PL/SQL Packages and Types Reference 10g Release 1 (10.1) Part Number B10802-01 |
|
|
View PDF |
The DBMS_DATA_MINING_TRANSFORM
package is a set of data transformation utilities available for use with the DBMS_DATA_MINING package for preparing mining data.
This chapter contains the following topics:
The DBMS_DATA_MINING_TRANSFORM
package has two motivations:
DBMS_DATA_MINING
package.dbmsdmxf.sql
(for UNIX, the path is $ORACLE_HOME/dm/admin/dbmsdmxf.sql)
and is available for inspection.The main principle behind the design of this package is the fact that SQL has enough power to perform most of the common mining transforms efficiently. For example, binning can be done using CASE
expression or DECODE
function, and linear normalization is a simple algebraic expression of the form (x
-
shift)/scale
where x
is the data value that is being transformed.
However, the queries that perform the transforms can be rather lengthy. So it is desirable to have some convenience routines that will help in generating queries. Thus, the goal of this package is to provide query generation services for the most common mining transforms, as well as to provide a framework that can be easily extended for implementing other transforms.
Note: Use of this package for preprocessing data input to DBMS_DATA_MINING
operations is not mandatory. You can use any third-party tool or any home-grown utilities that are customized for your application.
Data Type | Purpose |
---|---|
|
List of column names representing mining attributes, defined to be VARRAY(1000) OF VARCHAR2(32); |
The DBMS_DATA_MINING_TRANSFORM
package supports the following transformations for numerical and categorical attributes, which map to the NUMBER
and VARCHAR2/CHAR
Oracle data types respectively.
Binning involves mapping both continuous and discrete values to discrete values of reduced cardinality. For example, the age of persons can be binned into discrete numeric bins: 1-20 to 1, 21-40 to 2, and so on. Popular car manufacturers such as Ford, Chrysler, BMW, Volkswagen can be binned into discrete categorical bins: {Ford, Chrysler} to US_Car_Makers
, and {BMW, Volkswagen} to European_Car_Makers
. The package supports binning for both categorical and numerical attributes.
The bin definition for each attribute is computed based on the occurrence frequency of values that are computed from the data. The user specifies a particular number of bins, say N. Each of the bins bin
1, ...
, bin
N corresponds to the values with top frequencies. The bin bin
N+1 corresponds to all remaining values.
The bin definition for each attribute is computed based on the min
and max
values that are computed from the data. The user specifies a particular number of bins, say N
. Each of the bins bin
1,...,
bin
N span ranges of equal width of size inc
= (max
-
min)/N
, and bin
0 spans range (-inf, min
) and bin
N+1 range (max,
+
inf
).
The definition for each relevant column is computed based on the min
values for each quantile, where quantiles are computed from the data using NTILE
function. Bins bin
1, ...
, bin
N - 1 span ranges [min
I, min
I+1) and bin
N range [min
N,
max
N]. Bins with equal left and right boundaries are collapsed.
Normalization involves scaling continuous values down to specific range -- as in -1.0 to 1.0 or 0.0 to 1.0 such that x
new =
(x
old -
shift)/scale
. It applies only to numerical attributes.
The normalization definition for each attribute is computed based on the min
and max
values that are computed from the data. The values for shift
and scale
are computed to be shift
=
min
, and scale
=
(max
-
min)
respectively.
The normalization definition for each attribute is computed based on the values for mean
and standard deviation
that are computed from the data. The values for shift
and scale
are computed to be shift
=
mean
, and scale
=
standard deviation
respectively.
Some computations on attribute values can be significantly affected by extreme values. One approach to achieving a more robust computation is to winsorize or trim the data as a preprocessing step. Winsorizing involves setting the tail values of a particular attribute to some specified quantile of the data, while trimming removes the tails. In other words, trimmed values are ignored in further computations. This is achieved by setting the tails to NULL
. For example, for a 90% winsorization, the bottom 5% are set equal to the minimum value in the 6th percentile, while the upper 5% are set equal to the value corresponding to the maximum value in the 95th percentile.
The package provides three classes of convenience routines. The first two classes of routines define the transformation, and the last class of routines generate the queries that provide the transformed data.
CREATE
routines. The generated table is a transform-specific and has a pre-defined fixed schema. Each routine is equivalent to the SQL statement CREATE TABLE
table
(...)
INSERT
routines. Each routine is equivalent to the SQL statement INSERT
INTO
table
SELECT...
XFORM
routines. Each routine is equivalent to the SQL statement CREATE VIEW
view
AS SELECT ...
You must use CREATE_BIN_NUM
or CREATE_BIN_CAT
routines to create the bin definition tables. You must use CREATE_NORM_LIN
to create the normalization definition table. You must use CREATE_CLIP
to create the clipping definition table.
Usually, the consistency and integrity of transform definition tables is guaranteed by the creation process. Alternatively, it can be achieved by leveraging an integrity constraints mechanism. This can be done either by altering the tables created with CREATE
routines, or by creating the tables manually with the necessary integrity constraints.
The most common way of defining the transform (populating the transform definition tables) for each attribute is based on data inspection using some predefined methods (also known as automatic transform definition). Some of the most popular methods have been captured by the INSERT
routines in the package. For example, the z-score normalization method estimates mean and standard deviation from the data to be used as a shift and scale parameters of the linear normalization transform.
You can bin numerical attributes using INSERT_BIN_NUM_EQWIDTH
or INSERT_BIN_NUM_QTILE
and categorical attributes using INSERT_BIN_CAT_FREQ
. You can normalize numerical attributes using INSERT_NORM_LIN_ZSCORE
or INSERT_NORM_LIN_MINMAX
. You can either winsorize numerical attributes using INSERT_CLIP_WINZOR_TAIL
or trim them using INSERT_CLIP_TRIM_TAIL
. You can invoke these routines several times to transform all relevant attributes from various data sources till the definition table fully represents all mining attributes for a given problem.
After performing automatic transform definition, some or all of the definitions can be adjusted by issuing SQL DML statements against the transform definition tables, thus providing virtually infinite flexibility in defining custom transforms.
The INSERT
routines enable flexible transform definitions in several ways:
INSERT
routines do not necessarily have to be the data used for a particular model creation. It can be any data that contains adequate representation of the mining attributes.INSERT
routines can be called any number of times against the same or different dataset until all the attributes have their transformations defined. You can selectively exclude one or more attributes for a particular iteration of the INSERT
. In the most extreme case, each individual attribute can potentially have a unique transformation definition.NUMBER
columns in your table, and numerical binning and the normalization and clipping routines skip over VARCHAR2
/CHAR
columns in your input data.Query generation is driven by the simple transform-specific definition tables with a predefined schema. Query generation routines should be viewed as macros, and transform definition tables as parameters used in macro expansions. Similar to using #define
macros in the C language, the invoker is responsible for ensuring the correctness of the expanded macro, that is, that the result is a valid SQL query.
You can generate the views representing the transformation queries for binning using XFORM_BIN_CAT
and XFORM_BIN_NUM
, and for normalization using XFORM_NORM_LIN
, and for clipping using XFORM_CLIP
.
If your data contains a combination of numerical and categorical attributes, you must essentially feed the results of one transformation step to the next step. For example, the results of XFORM_BIN_CAT
can be fed to XFORM_BIN_NUM
or vice versa. The order is irrelevant since numerical and categorical transforms work on disjoint sets of attributes.
Given a dataset for a particular mining problem, any preprocessing and transformations on the mining data must be uniform across all mining operations. In other words, if the build data is preprocessed according to a particular transform definition, then it follows that the test data and the scoring data must be preprocessed using the same definition.
The general usage of routines in this package can be explained using this example. Assume that your input table for model build contains both numerical and categorical data that require to be binned. A possible sequence of operations will be:
CREATE_BIN_NUM
to generate a numerical bin definition table.INSERT_BIN_NUM_EQWIDTH
to define the transforms for all numerical attributes in the build data input (For the sake of simplicity, let us assume that all numericals are to be binned into 10 bins.)XFORM_BIN_NUM
with the numerical bin table and the build data table as inputs. The resulting object is a view that represents a SQL query against the build data table that performs numerical binning. Assume that you have named this result object build_bin_num_view
.CREATE_BIN_CAT
to create a categorical bin definition table.INSERT_BIN_CAT_FREQ
to define the transforms for all categorical attributes. (For the sake of simplicity, let us assume that all categorical attributes are to be binned into 10 bins.)XFORM_BIN_CAT
with the categorical bin table and the view name provided by XFORM_BIN_NUM
, namely build_bin_num_view
, as the inputs. This essentially amounts to combining the transformations from both stages.CREATE_MODEL
procedure in the DBMS_DATA_MINING
package.If this happens to be a classification model, and you want to APPLY
this model to scoring data, you must prepare the scoring data similar to the build data. You can achieve this in two simple steps:
XFORM_BIN_NUM
with the scoring data table and the numerical bin boundary table as inputs. The resulting object is a view that represents an SQL query against your scoring data table, influenced by the contents of the numerical bin boundary table. Assume that you have named this result object apply_bin_num_view
.XFORM_BIN_CAT
with the categorical bin table and the view name provided by XFORM_BIN_NUM
, namely apply_bin_num_view
, as the inputs.APPLY
procedure in the DBMS_DATA_MINING
package.This procedure creates a categorical binning definition table. This table is used as input to INSERT_BIN_CAT_FREQ
and XFORM_BIN_CAT
procedures.
DBMS_DATA_MINING_TRANSFORM.CREATE_BIN_CAT ( bin_table_name IN VARCHAR2, bin_schema_name IN VARCHAR2 DEFAULT NULL);
Parameter | Description |
---|---|
|
Name of the bin definition table. |
|
Name of the schema hosting the bin definition table. |
The generated bin definition table will have the predefined schema:
col VARCHAR2(30) val VARCHAR2(4000) bin VARCHAR2(4000)
BEGIN DBMS_DATA_MINING_TRANSFORM.CREATE_BIN_CAT('build_bin_cat_table'); END;
This procedure creates a numerical binning definition table. This table is used as input to INSERT_BIN_NUM_EQWIDTH
, INSERT_BIN_NUM_QTILE
, and XFORM_BIN_NUM
procedures.
DBMS_DATA_MINING_TRANSFORM.CREATE_BIN_NUM ( bin_table_name IN VARCHAR2, bin_schema_name IN VARCHAR2 DEFAULT NULL);
Parameter | Description |
---|---|
|
Name of the bin definition table. |
|
Name of the schema hosting the bin definition table. |
The generated bin definition table will have the predefined schema:
col VARCHAR2(30) val NUMBER bin VARCHAR2(4000)
BEGIN DBMS_DATA_MINING_TRANSFORM.CREATE_BIN_NUM('build_bin_num_table');END;
This procedure creates a clipping definition table. This table is used as input to INSERT_CLIP_WINSOR_TAIL
, INSERT_CLIP_TRIM_TAIL
, and XFORM_CLIP
procedures.
DBMS_DATA_MINING_TRANSFORM.CREATE_CLIP ( clip_table_name IN VARCHAR2, clip_schema_name IN VARCHAR2 DEFAULT NULL);
Parameter | Description |
---|---|
|
Name of the clipping definition table. |
|
Name of the schema hosting the clipping definition table. |
The generated clipping definition table will have the predefined schema:
col VARCHAR2(30) lcut NUMBER lval NUMBER rcut NUMBER rval NUMBER
BEGIN DBMS_DATA_MINING_TRANSFORM.CREATE_CLIP('build_clip_table'); END;
This procedure creates a linear normalization definition table. This table is used as input to INSERT_NORM_LIN_MINMAX
, INSERT_NORM_LIN_ZSCORE
, XFORM_NORM_LIN
procedures.
DBMS_DATA_MINING_TRANSFORM.CREATE_NORM_LIN ( norm_table_name IN VARCHAR2, norm_schema_name IN VARCHAR2 DEFAULT NULL);
Parameter | Description |
---|---|
|
Name of the normalization definition table. |
|
Name of the schema hosting the normalization definition table. |
The generated normalization table will have the predefined schema:
col VARCHAR2(30) shift NUMBER scale NUMBER
BEGIN DBMS_DATA_MINING_TRANSFORM.CREATE_NORM_LIN('build_norm_table');END;
This procedure finds the categorical binning definition for every VARCHAR2
, CHAR
column in the data table that is not specified in the exclusion list and inserts the definition into the categorical binning definition table created using CREATE_BIN_CAT
.
Definition for each relevant column is computed based on the occurrence frequency of column values that are computed from the data table. Each of the bin_num(N)
bins bin
1, ..., bin
N corresponds to the values with top frequencies when N
> 0 or bottom frequencies when N
< 0, and bin
N+1 to all remaining values, where bin
I = I
. Ordering ties among identical frequencies are broken by ordering on column values (ASC for N
> 0 or DESC for N
< 0). When the number of distinct values C
< N
only C
+1 bins will be created.
The parameter default_num
(D
) is used for pruning based on the number of values that fall into the default bin. When D
> 0 only columns that have at least D
defaults are kept while others are ignored. When D
< 0 only columns that have at most D
values are kept. No pruning is done when D is NULL
or D
= 0. Parameter bin_support
(SUP
) is used for restricting bins to frequent (SUP
> 0) values frq
>= SUP*tot
, or infrequent (SUP
< 0) ones frq
<= -SUP*tot
, where frq
is a given value count and tot
is a sum of all counts as computed from the data. No support filtering is done when SUP
is NULL
or when SUP
= 0.
DBMS_DATA_MINING_TRANSFORM.INSERT_BIN_CAT_FREQ ( bin_table_name IN VARCHAR2, data_table_name IN VARCHAR2, bin_num IN PLS_INTEGER DEFAULT 9, exclude_list IN Column_List DEFAULT NULL, default_num IN PLS_INTEGER DEFAULT 2, bin_support NUMBER DEFAULT NULL, bin_schema_name IN VARCHAR2 DEFAULT NULL, data_schema_name IN VARCHAR2 DEFAULT NULL);
For a given input table, you can iteratively call this routine several times with different specifications for number of bins for a given input table. For each iteration, you can selectively exclude attributes (that is, column names) using the exclude_list
parameter for a particular binning specification.
Columns with all NULL
s are ignored. No bin definitions are populated when bin_num
=
0
, or bin_num
, is NULL
.
The simplest invocation of this routine populates bin definitions in the cat_bin_table
for all the categorical attributes found in build_table
.
BEGIN DBMS_DATA_MINING_TRANSFORM.INSERT_BIN_NUM( 'cat_bin_table', 'build_table');END; /
This procedure finds the numerical binning definition for every NUMBER
column in the data table that is not specified in the exclusion list and inserts the definition into the numerical binning definition table that was created using CREATE_BIN_NUM
.
Definition for each relevant column is computed based on the min
and max
values that are computed from the data table. Each of the bin_num(
N)
bins bin
1,...,
bin
N span ranges of equal width inc
= (
max - min
) /
N where bin
I = I when N
> 0 or bin
I = N+1-I
when N
< 0, and bin
0 = bin
N+1 = NULL
. The values of val
column are rounded to round_num
significant digits prior to scoring them in the definition table.
DBMS_DATA_MINING_TRANSFORM.INSERT_BIN_NUM_EQWIDTH ( bin_table_name IN VARCHAR2, data_table_name IN VARCHAR2, bin_num IN PLS_INTEGER DEFAULT 10, exclude_list IN Column_List DEFAULT NULL, round_num IN PLS_INTEGER DEFAULT 6, bin_schema_name IN VARCHAR2 DEFAULT NULL, data_schema_name IN VARCHAR2 DEFAULT NULL);
For a given input table, you can iteratively call this routine several times with different specifications for number of bins for a given input table. For each iteration, you can selectively exclude attributes (that is, column names) using the exclude_list
parameter for a particular binning specification.
Columns with all NULL
s or only one unique value are ignored. No bin definitions are populated when bin_num
=
0
, or bin_num
is NULL
.
For example, when N=2
, col='mycol'
, min=10
, and max = 21
, the following three rows are inserted into the definition table (inc = 5.5
):
COL VAL BIN ----- ----- ----- mycol 10 NULL mycol 15.5 1 mycol 21 2
The simplest invocation of this routine populates bin definitions in the num_bin_ table for all the numerical attributes found in build_table. BEGIN DBMS_DATA_MINING_TRANSFORM.INSERT_BIN_NUM( 'num_bin_table', 'build_table');END;
/
This procedure finds a numerical binning definition for every NUMBER
column in the data table that is not specified in the exclusion list and inserts the definition into the binning definition table that was created using CREATE_BIN_NUM
.
The definition for each relevant column is computed based on the min
values for each quantile, where quantiles are computed from the data using NTILE
function. Bins bin
1, ...
, bin
N - 1 span ranges [min
I, min
I+1) and bin
N range [min
N,
max
N] with bin
I = I
when N
> 0 or bin
I = N+1
-I
when N
< 0, and bin
0 = bin
N+1 = NULL
. Bins with equal left and right boundaries are collapsed.
DBMS_DATA_MINING_TRANSFORM.INSERT_BIN_NUM_QTILE ( bin_table_name IN VARCHAR2, data_table_name IN VARCHAR2, bin_num IN PLS_INTEGER DEFAULT 10, exclude_list IN Column_List DEFAULT NULL, bin_schema_name IN VARCHAR2 DEFAULT NULL, data_schema_name IN VARCHAR2 DEFAULT NULL);
For a given input table, you can iteratively call this routine several times with different specifications for bin_num
for a given input table. For each iteration, you can selectively exclude attributes (that is, column names) using the exclude_list
parameter for a particular specification. Columns with all NULL
s are ignored.
Example 1. When N
=
4, col='mycol'
, and data is {1,2,2,2,2,3,4}
, the following three rows are inserted into the definition table:
COL VAL BIN ----- ----- ----- mycol 1 NULL mycol 2 1 mycol 4 2
Here quantities are {1,2}
, {2,2}
, {2,3}
, {4}
and min(1)
=
1
, min(2)
= 2
, min(3)
= 2
, min(4)
= 4
, max(4)
= 4
, and ranges are [1,2)
, [2,2)
, [2,4)
, [4,4]
. After collapsing [1,2)
, [2,4]
.
The simplest invocation of this routine populates numerical binning definitions in the num_bin_table
for all the numerical attributes found in build_table
.
BEGIN DBMS_DATA_MINING_TRANSFORM.INSERT_BIN_NUM_QTILE( 'num_bin_table', 'build_table'); END;
This procedure finds the trimming definition for every NUMBER
column in the data table that is not specified in the exclusion list and inserts the definition into the clipping definition table that was created using CREATE_CLIP
.
The definition for each relevant column is computed based on the non-NULL
values sorted in ascending order such that val(1)
<
val(2)
<
...
<
val(N)
, where N
is a total number of non-NULL
values in a column:
lcut = val(1+floor(N*q)) lval = NULL rcut = val(N-floor(*N*q)) rval = NULL
where q
=
ABS(NVL(tail_frac,0))
. Nothing is done when q
>=
0.5
.
DBMS_DATA_MINING_TRANSFORM.INSERT_CLIP_TRIM_TAIL ( clip_table_name IN VARCHAR2, data_table_name IN VARCHAR2, tail_frac IN NUMBER DEFAULT 0.025, exclude_list IN Column_List DEFAULT NULL, clip_schema_name IN VARCHAR2 DEFAULT NULL, data_schema_name IN VARCHAR2 DEFAULT NULL);
For a given input table, you can iteratively call this routine several times with different specifications for tail_frac
for a given input table. For each iteration, you can selectively exclude attributes (that is, column names) using the exclude_list
parameter for a particular specification.
Example 1. When q
=
0.2
, col='mycol'
, and data is {1,2,2,2,3,4,4}
, the following row is inserted into the definition table:
COL LCUT LVAL RCUT RVAL ----- ----- ----- ----- ----- mycol 2 NULL 4 NULL
Here 1
+
floor(N*q)
=
1
+
floor(7*0.2)
= 2,
lcut
=
val(2)
=
2
.
N
- floor(N*q)
=
7
- floor(7*0.2) = 6, rcut
=
val(6)
=
4.
The simplest invocation of this routine populates clipping definitions in the clip_table
for all the numerical attributes found in build_table
.
BEGIN DBMS_DATA_MINING_TRANSFORM.INSERT_CLIP_TRIM_TAIL( 'clip_table', 'build_table'); END;
This procedure finds winsorizing definition for every NUMBER
column in the data table that is not specified in the exclusion list and inserts the definition into the clipping definition table that was created using CREATE_CLIP
.
Definition for each relevant column is computed based on the non-NULL
values sorted in ascending order such that val(1)
<
val(2)
<
...
<
val(N)
, where N
is a total number of non-NULL
values in a column:
lcut = val(1+floor(N*q)) lval = lcut rcut = val(N-floor(N*q)) rval = rcut
where q
= ABS(NVL(tail_fraq,0))
. Nothing is done when q >= 0.5
.
DBMS_DATA_MINING_TRANSFORM.INSERT_CLIP_WINSOR_TAIL ( clip_table_name IN VARCHAR2, data_table_name IN VARCHAR2, tail_frac IN NUMBER DEFAULT 0.025, exclude_list IN Column_List DEFAULT NULL, clip_schema_name IN VARCHAR2 DEFAULT NULL, data_schema_name IN VARCHAR2 DEFAULT NULL);
For a given input table, you can iteratively call this routine several times with different specifications for tail_frac
for a given input table. For each iteration, you can selectively exclude attribute (that is, column names using the exclude_list parameter for a particular specification. Columns with all NULL
s are ignored.
Example 1. When q
=
0.2
, col='mycol'
, and data is {1,2,2,2,3,4,4}
, the following row is inserted into the definition table:
COL LCUT LVAL RCUT RVAL ----- ----- ----- ----- ----- mycol 2 2 4 4
Here 1
+
floor(N*q)
=
1
+
floor(7*0.2)
= 2,
lcut
=
val(2)
=
2
.
N
- floor(N*q)
=
7
- floor(7*0.2) = 6, rcut
=
val(6)
=
4.
The simplest invocation of this routine populates clipping definitions in the clip_table
for all the numerical attributes found in build_table
.
BEGIN DBMS_DATA_MINING_TRANSFORM.INSERT_CLIP_WINSOR_TAIL( 'clip_table', 'build_table'); END;
This procedure finds the normalization definition for every NUMBER
column in the data table that is not specified in the exclusion list and inserts the definition based on z-score normalization into the table that was created using CREATE_NORM_LIN
.
Definition for each relevant column is computed based on the min and max values that are computed from the data table, such that shift
= min
and scale
= max
- min
. The values of shift
and scale
are rounded to round_num
significant digits prior to storing them in the definition table.
DBMS_DATA_MINING_TRANSFORM.INSERT_NORM_LIN_ZSCORE ( norm_table_name IN VARCHAR2, data_table_name IN VARCHAR2, exclude_list IN Column_List DEFAULT NULL, round_num IN PLS_INTEGER DEFAULT 6, norm_schema_name IN VARCHAR2 DEFAULT NULL, data_schema_name IN VARCHAR2 DEFAULT NULL);
For a given input table, you can iteratively call this routine several times with selective exclusion of attributes (that is, column names) using the exclude_list
parameter for a particular binning specification.
Columns with all NULL
s or only one unique value are ignored.
The simplest invocation of this routine populates normalization definitions in the norm_zscore_table
for all the numerical attributes found in build_table.
BEGIN DBMS_DATA_MINING_TRANSFORM.INSERT_NORM_LIN_ZSCORE( 'norm_zscore_table', 'build_table');END;
/
This procedure finds the normalization definition for every NUMBER
column in the data table that is not specified in the exclusion list and inserts the definition based on min-max normalization into the table that was created using CREATE_NORM_LIN
.
Definition for each relevant column is computed based on the mean and standard deviation that are computed from the data table, such that shift
= mean
and scale
= stddev
. The values of shift
and scale
are rounded to round_num
significant digits prior to storing them in the definition table.
DBMS_DATA_MINING_TRANSFORM.INSERT_NORM_LIN_MINMAX ( norm_table_name IN VARCHAR2, data_table_name IN VARCHAR2, exclude_list IN Column_List DEFAULT NULL, round_num IN PLS_INTEGER DEFAULT 6, norm_schema_name IN VARCHAR2 DEFAULT NULL, data_schema_name IN VARCHAR2 DEFAULT NULL);
For a given input table, you can iteratively call this routine several times with selective exclusion of attributes (that is, column names) using the exclude_list
parameter for a particular normalization specification.
Columns with all NULL
s or only one unique value are ignored.
The simplest invocation of this routine populates normalization definitions in the norm_minmax_table
for all the numerical attributes found in build_table
.
BEGIN DBMS_DATA_MINING_TRANSFORM.INSERT_NORM_LIN_MINMAX( 'norm_minmax_table', 'build_table');END;
This procedure creates the view that performs categorical binning. Only the columns that are specified in the definition table are transformed; the remaining columns do not change.
DBMS_DATA_MINING_TRANSFORM.XFORM_BIN_CAT ( bin_table_name IN VARCHAR2, data_table_name IN VARCHAR2, xform_view_name IN VARCHAR2, literal_flag IN BOOLEAN DEFAULT FALSE, bin_schema_name IN VARCHAR2 DEFAULT NULL, data_schema_name IN VARCHAR2 DEFAULT NULL, xform_schema_name IN VARCHAR2 DEFAULT NULL);
The bin table created by CREATE_BIN_CAT
and populated with bin definitions by INSERT_BIN_CAT_FREQ
is used to guide the query generation process to construct categorical binning expressions of the following form:
DECODE("col", val1, bin1, ... valN, binN, NULL, NULL, binN+1) "col"
This expression maps values val
1,...
, val
N into N
bins bin1,...
, binN, and other values into bin
N+1, while NULL
values remain unchanged. bin
N+1 is optional. If not specified, it defaults to NULL
. To specify bin
N+1 provide a row with val
set to NULL
.
The literal_flag
parameter indicates whether the values in bin
are valid SQL literals. When the flag is set to TRUE
, the value of bin
is used as is in query generation; otherwise it is converted into a valid text literal (surrounded by quotes and each single quote is replaced by two single quotes). By default, the flag is set to FALSE
. One example of when it can be set to TRUE
is in cases when all bin
are numbers. In that case the transformed column will be numeric as opposed to textual (default behavior).
Note that col
is case-sensitive since it generates quoted identifiers. In cases when there are multiple entries with the same col
,val
combination with different bin
, the behavior is undefined -- any one of the bin
values might be used.
Example 1. bin_cat
contains four rows with col
= 'mycol'
:
{col = 'mycol', val = 'Waltham', bin = 'MA'} {col = 'mycol', val = 'Burlington', bin = 'MA'} {col = 'mycol', val = 'Redwood Shores', bin = 'CA'} {col = 'mycol', val = NULL, bin = 'OTHER'}
the following expression is generated:
DECODE("mycol", 'Waltham', 'MA', 'Burlington', 'MA', 'Redwood Shores', 'CA', NULL, NULL, 'OTHER') "mycol"
Example 2. bin_cat
contains three rows with col
= 'mycol'
:
{col = 'mycol', val = 'Waltham', bin = 'MA'} {col = 'mycol', val = 'Burlington', bin = 'MA'} {col = 'mycol', val = 'Redwood Shores', bin = 'CA'}
the following expression is generated:
DECODE("mycol", 'Waltham', 'MA', 'Burlington', 'MA', 'Redwood Shores', 'CA') "mycol"
Example 3. For the definition:
COL VAL BIN ----- ---------- --- mycol Waltham 1 mycol Burlington 1 mycol Redwood Shores 2
the following expression is generated when the literal flag is set to FALSE
:
DECODE ("mycol", 'Waltham', '1', 'Burlington' '1', 'Redwood Shores', '2') "mycol"
and when the flag is set to TRUE
:
DECODE("mycol", 'Waltham', 1, 'Burlington', 1, 'Redwood Shores', 2) "mycol"
The simplest invocation of this routine generates a view build_view
that represents the transformation query on build_table
based on bin definitions in the cat_bin_table
.
BEGIN DBMS_DATA_MINING_TRANSFORM.XFORM_BIN_CAT( 'cat_bin_table', 'build_table', 'build_view');END; /
This procedure creates the view that performs numerical binning. Only the columns that are specified in the definition table are transformed; the remaining columns do not change.
DBMS_DATA_MINING_TRANSFORM.XFORM_BIN_NUM ( bin_table_name IN VARCHAR2, data_table_name IN VARCHAR2, xform_view_name IN VARCHAR2, literal_flag IN BOOLEAN DEFAULT FALSE, bin_schema_name IN VARCHAR2 DEFAULT NULL, data_schema_name IN VARCHAR2 DEFAULT NULL, xform_schema_name IN VARCHAR2 DEFAULT NULL);
The bin table created by CREATE_BIN_NUM
and populated with bin definitions by INSERT_BIN_NUM_EQWIDTH
or INSERT_BIN_NUM_QTILE
is used to guide the query generation process to construct numerical binning expressions of the following form:
CASE WHEN "col" < val0 THEN 'bin0' WHEN "col" <= val1 THEN 'bin1' ... WHEN "col" <= valN THEN 'binN' WHEN "col" IS NOT NULL THEN 'binN+1' END "col"
This expression maps values in the range [
val0;
valN]
into N
bins bin
1, ...
, bin
N, values outside of this range into bin
0 or bin
N+1, such that
(-inf; val0) -> bin0 [val0; val1) -> bin1 ... (valN-1; valN] -> binN (valN; +inf) -> binN+1
NULL
values remain unchanged. bin
N+1 is optional. If it is not specified, the values ("
col
"
>
val
N)
are mapped to NULL
. To specify bin
N+1, provide a row with val
set to NULL
. The order of the WHEN
...
THEN
pairs is based on the ascending order of val
for a given col
.
The literal_flag
parameter indicates whether the values in bin
are valid SQL literals. When the flag is set to TRUE
, the value of bin
is used as is in query generation; otherwise it is converted into a valid text literal (surrounded by quotes and each single quote is replaced by two single quotes). By default, the flag is set to FALSE
. One example of when it can be set to TRUE
is in cases when all bin
are numbers. In that case the transformed column will be numeric as opposed to textual (default behavior).
Note that col
is case-sensitive since it generates quoted identifiers. In cases where there are multiple entries with the same col,val
combination with different bin
, the behavior is undefined -- any one of the bin
values might be used.
Example 1. bin_num
contains four rows with col
=
'mycol'
:
{col = 'mycol', val = 15.5, bin = 'small'} {col = 'mycol', val = 10, bin = 'tiny'} {col = 'mycol', val = 20, bin = 'large'} {col = 'mycol', val = NULL, bin = 'huge'}
the following expression is generated:
CASE WHEN "mycol" < 10 THEN 'tiny' WHEN "mycol" <= 15.5 THEN 'small' WHEN "mycol" <= 20 THEN 'large' WHEN "mycol" IS NOT NULL THEN 'huge' END "mycol"
Example 2. bin_num
contains three rows with col
= 'mycol'
:
{col = 'mycol', val = 15.5, bin = NULL} {col = 'mycol', val = 10, bin = 'tiny'} {col = 'mycol', val = 20, bin = 'large'}
the following expression is generated:
CASE WHEN "mycol" < 10 THEN NULL WHEN "mycol" <= 15.5 THEN 'small' WHEN "mycol" <= 20 THEN 'large' END "mycol"
Example 3. For the definition:
COL VAL BIN ----- ---- --- mycol 10 NULL mycol 15.5 1 mycol 21 2
the following expression is generated when the literal flag is set to FALSE
:
CASE WHEN "mycol" < 10 THEN NULL WHEN "mycol" <= 15.5 THEN '1' WHEN "mycol" <= 20 THEN '2' END "mycol"
and when the flag is set to TRUE
:
CASE WHEN "mycol" < 10 THEN NULL WHEN "mycol" <= 15.5 THEN 1 WHEN "mycol" <= 20 THEN 2 END "mycol"
The simplest invocation of this routine generates a view build_view
that represents the transformation query on build_table
based on transform definitions in bin definitions in the num_bin_table
.
BEGIN DBMS_DATA_MINING_TRANSFORM.XFORM_BIN_NUM( 'num_bin_table', 'build_table', 'build_view');END; /
This procedure creates the view that performs clipping. Only the columns that are specified in the transform definition are clipped; the remaining columns do not change.
DBMS_DATA_MINING_TRANSFORM.XFORM_CLIP ( clip_table_name IN VARCHAR2, data_table_name IN VARCHAR2, xform_view_name IN VARCHAR2, clip_schema_name IN VARCHAR2 DEFAULT NULL, data_schema_name IN VARCHAR2,DEFAULT NULL, xform_schema_name IN VARCHAR2,DEFAULT NULL;
The clipping definition table created by CREATE_CLIP
and populated with clipping definitions by INSERT_CLIP_WINSOR_TAIL
or INSERT_CLIP_TRIM_TAIL
is used to guide query generation process to construct clipping expressions of the following form:
CASE WHEN "col" < lcut THEN lval WHEN "col" > rcut THEN rval ELSE "col" END "col"
Note that col
is case-sensitive since it generates quoted identifiers. When there are multiple entries in the transform definition table for the same col
, the behavior is undefined. Any one of the definitions may be used in query generation. NULL
values remain unchanged.
Example 1 (Winsorizing). When col =
'my_col'
, lcut
= -1.5
, lval
= -1.5
, and rcut
= 4.5
and rval
= 4.5
, the following expression is generated:
CASE WHEN "my_col" < -1.5 THEN NULL WHEN "my_col" > 4.5 THEN NULL ELSE "my_col" END "my_col"
The simplest invocation of this routine generates a view object build_view
that represents the transformation query on build_table
based on transform definitions in clipping definitions in the clip_table
.
BEGIN DBMS_DATA_MINING_TRANSFORM.XFORM_CLIP( 'clip_table', 'build_table', 'build_view'); END;
This procedure creates the view that performs linear normalization. Only the columns that are specified in the definition table are transformed; the remaining columns do not change.
DBMS_DATA_MINING_TRANSFORM.XFORM_NORM_LIN ( norm_table_name IN VARCHAR2, data_table_name IN VARCHAR2, xform_view_name IN VARCHAR2, norm_schema_name IN VARCHAR2 DEFAULT NULL, data_schema_name IN VARCHAR2 DEFAULT NULL, xform_schema_name IN VARCHAR2 DEFAULT NULL);
The normalization table created by CREATE_NORM_LIN
and populated with definitions by either INSERT_NORM_LIN_ZSCORE
or INSERT_NORM_LIN_MINMAX
is used to guide the query generation process to construct normalization expressions of the following form:
("col" - shift)/scale "col"
Note that col
is case-sensitive since it generates quoted identifiers. When there are multiple entries in the transform definition table for the same col
, the behavior is undefined. Any one of the definitions may be used in query generation. NULL
values remain unchanged.
For example, when col =
'my_col
', shift = -1.5
, and scale = 20
. The following expression is generated:
("my_col" - (-1.5))/20 "my_col"
The simplest invocation of this routine generates a view build_view
that represents the transformation query on build_table
based on normalization definitions in the norm_minmax_table
.
BEGIN DBMS_DATA_MINING_TRANSFORM.XFORM_NORM_LIN( 'norm_minmax_table', 'build_table', 'build_view');END;