Oracle Data Mining Application Developer's Guide 10g Release 1 (10.1) Part Number B10699-01 |
|
|
View PDF |
This chapter discusses the following topics related to writing data mining programs with the PL/SQL interface:
This chapter provides an overview of the steps required to perform basic Oracle Data Mining tasks. For detailed examples of how to perform these tasks, see the sample programs in Chapter 5.
This chapter does not include detailed descriptions of the PL/SQL subprograms. For that information, see the DBMS_DATA_MINING
and DBMS_DATA_MINING_TRANSFORM
chapters in the PL/SQL Packages and Types Reference.
The DBMS_DATA_MINING
package provides PL/SQL support for in-database data mining. You can use the package to build a mining model, test the model, and apply this model to your data to obtain predictive and descriptive information.
See also:
DBMS_DATA_MINING_TRANSFORM
, a supplied package that supports data preprocessing for mining (described inPL/SQL Packages and Types Reference).This chapter discusses the following topics:
DBMS_DATA_MINING
subprograms are presented in PL/SQL Packages and Types Reference. Sample code is described in Chapter 5 of this manual; the code itself is in the dm/demo/sample/plsql
directory.
The development methodology for data mining using the DBMS_DATA_MINING
API is divided into two phases.
The first phase includes your application and data analysis and design, where you perform the following two steps:
The second phase involves developing a mining application using DBMS_DATA_MINING
and DBMS_DATA_MINING_TRANSFORM
packages.
DBMS_DATA_MINING_TRANSFORM
package or other third-party tool or direct SQL or PL/SQL utility scripts in a manner suitable for the chosen mining function and algorithm. An important caveat is that the three detests referred to above have to be prepared in an identical manner for mining results to be meaningful. This is an optional step.The DBMS_DATA_MINING
package creates a mining model for a mining function using a specified mining algorithm that supports the function. The algorithm can be influenced by specific algorithm settings.
A model is identified by its name. Like tables in the database, a model has storage associated with it. But unlike a table, the form, shape, and content of this storage is opaque to the user. However, the user can view the contents of a model -- that is, the patterns and rules that constitute a mining model -- using algorithm-specific GET_MODEL_DETAILS
functions. In addition, dm_user_models
provides the model size in megabytes.
The DBMS_DATA_MINING
package supports Classification, Regression, Association Rules, Clustering, and Feature Extraction. You can specify your choice of mining function through a parameter to the CREATE_MODEL
procedure.
Each mining function can be implemented using one or more algorithms. Table 4-1 provides a list of supported algorithms. Oracle assumes a default algorithm for each mining function, but you can override this default through an explicit setting in the settings table.
Each algorithm has one or more settings or parameters that influence the way it builds the model. Oracle assumes a default set of algorithm settings for each mining algorithm. These defaults are available for your review through the table function GET_DEFAULT_SETTINGS
. To override the defaults, you must provide the choice of the algorithm and the settings for the algorithm through a settings table input to the CREATE_MODEL
procedure.
The settings table is a simple relational table with a fixed schema. You can choose the name of the settings table, but the column names and their types must be defined as specified below.
(setting_name VARCHAR2(30), setting_value VARCHAR2(128))
The values provided in the settings table override the default values assumed by the system. The values inserted into the setting_name
column are one or more of several constants defined in the DBMS_DATA_MINING
package. Depending on what the setting name denotes, the value for the setting_value
column can be a predefined constant or the actual numerical value corresponding to the setting itself. The setting_valu
e column is defined to be VARCHAR2
, so you must cast numerical inputs to string using the TO_CHAR()
function before input into the settings table.
Table 4-2 through Table 4-7 list the various setting names and the valid setting values, with a brief explanation of each setting
You can create a settings table as shown in the example below for an SVM classification model, and edit the individual values using SQL DML.
CREATE TABLE drugstore_settings ( setting_name VARCHAR2(30), setting_value VARCHAR2(128)) BEGIN -- override the default for convergence tolerance for SVM Classification INSERT INTO drugstore_model_settings (setting_name, setting_value) VALUES (dbms_data_mining.svms_conv_tolerance, TO_CHAR(0.081)); COMMIT; END;
The table function GET_DEFAULT_SETTINGS
provides you all the default settings for mining functions and algorithms. If you intend to override all the default settings, you can create a seed settings table and edit them using DML.
BEGIN CREATE TABLE drug_store_settings AS SELECT setting_name, setting_value FROM TABLE (DBMS_DATA_MINING.GET_DEFAULT_SETTINGS WHERE setting_name LIKE 'SVMS_%'; -- update the values using appropriate DML END;
You can also create a settings table based on another model's settings using GET_MODEL_SETTINGS
, as shown in the example below.
BEGIN CREATE TABLE my_new_model_settings AS SELECT setting_name, setting_value FROM TABLE (DBMS_DATA_MINING.GET_MODEL_SETTINGS('my_other_ model')); END;
Consult Oracle Data Mining Concepts for an explanation of the prior probabilities table. You can specify a prior probabilities table as an optional function setting when building classification models.
You must create the prior probabilities table using the fixed schema shown below. For numerical targets, use the following schema:
target_value NUMBER prior_probability NUMBER
For categorical targets, use the following schema:
target_value VARCHAR2 prior_probability NUMBER
Next, provide the name of the prior probabilities table as input to the setting_value
column in the settings table, with the corresponding value for the setting_name
column to be DBMS_DATA_MINING.clas_priors_table_name
, as shown below:
BEGIN INSERT INTO drugstore_settings (setting_name, setting_value) VALUES (DBMS_ DATA_MINING.class_priors_table_name, 'census_priors'); COMMIT; END;
Consult Oracle Data Mining Concepts for an explanation of the cost matrix. You must create a cost matrix table with the fixed schema shown below. For numerical targets, use the following schema:
actual_target_value NUMBER predicted_target_value NUMBER cost NUMBER
For categorical targets, use the following schema:
actual_target_value VARCHAR2 predicted_target_value VARCHAR2 cost NUMBER
The DBMS_DATA_MINING
package enables you to evaluate the cost of predictions from classification models in an iterative manner during the experimental phase of mining, and to eventually apply the optimal cost matrix to predictions on the actual scoring data in a production environment.
The data input to each COMPUTE
procedure in the package is the result generated from applying the model on test data. In addition, if you also provide a cost matrix as an input, the COMPUTE
procedure generates test results taking the cost matrix into account. This enables you to experiment with various costs for a given prediction against the same APPLY
results, without rebuilding the model and applying it against the same test data for every iteration.
Once you arrive at an optimal cost matrix, you can then input this cost matrix to the RANK_APPLY
procedure along with the results of APPLY
on your scoring data. RANK_APPLY
will provide your new data ranked by cost.
There are essentially four classes of mining operations supported by the DBMS_DATA_MINING
package:
CREATE_MODEL
, DROP_MODEL
, RENAME_MODEL
.APPLY
.APPLY
results or other data that is presented using the same schema as that of APPLY
results: RANK_APPLY
.GET_MODEL_DETAILS
, GET_MODEL_SETTINGS
, GET_MODEL_SIGNATURE
.APPLY
operation on the test data, or based on any other data that is presented using the same schema as that of the APPLY
results: COMPUTE_CONFUSION_MATRIX
, COMPUTE_LIFT
, and COMPUTE_ROC
.EXPORT_MODEL
, IMPORT_MODEL
.GET_DEFAULT_SETTINGS
returns default values for all the settings.Of these, the first set represents DDL-like operations. The last set represents utilities. The rest are query-like operations in that they do not modify the model. Besides these operations, the following capabilities are also provided as part of the Oracle Data Mining installation:
Mining results are either returned as result sets or persisted as fixed schema tables.
The CREATE_MODEL
operation creates a mining model. The viewable contents of a mining model are provided to you through GET_MODEL_DETAILS
functions for each supported algorithm. In addition, GET_MODEL_SIGNATURE
and GET_MODEL_SETTINGS
provide descriptive information about the model.
The APPLY
operation creates and populates a fixed schema table with a given name. The schema of this table varies based on the particular mining function, algorithm, and target attribute type -- numerical or categorical.
The RANK_APPLY
operation takes this results table as input and generates another table with results ranked based on a top-N input, and for classification models, also based on cost. The schema of this table varies based on the particular mining function, algorithm, and the target attribute type -- numerical or categorical.
The COMPUTE
routines provided in the package are the most popularly used metrics for classification. They are not tied to a particular model -- they can compute the metrics from any meaningful data input as long as the schema of the input tables fits the specification of the apply results table and the targets tables. Please consult any of the COMPUTE
descriptions in this document for details.
The most commonly used metrics for regression models are root mean square error and mean absolute error. You can use the simple SQL queries provided below to compute these metrics by replacing the italicized tokens with table and column names appropriate for your application.
SELECT sqrt(avg((A.prediction - B.target_column_name) * (A.prediction - B.target_column_name))) rmse FROM apply_results_table A, targets_table B WHERE A.case_id_column_name = B.case_id_column_name;
Given the targets_table
generated from the test data with the schema:
(case_id_column_name VARCHAR2, target_column_name NUMBER)
and apply results table for regression with the schema:
(case_id_column_name VARCHAR2, prediction NUMBER)
and a normalization table (optional) with the schema:
(attribute_name VARCHAR2(30), scale NUMBER, shift NUMBER)
the query for mean absolute error is:
SELECT /*+PARALLEL(T) PARALLEL(A)*/ AVG(ABS(T.actual_value - T.target_value)) mean_absolute_error FROM (SELECT B.case_id_column_name (B.target_column_name * N.scale + N.shift) actual_value FROM targets_table B, normalization_table N WHERE N.attribute_name = B.target_column_name AND B.target_column_name = 1) T, apply_results_table_name A WHERE A.case_id_column_name = T.case_id_column_name;
You can fill in the italicized values with the actual column and table names chosen by you. If the data has not undergone normalization transformation, you can eliminate those references from the subquery. See dm/demo/sample/plsql/svmrdemo.sql
for an example.
The data input for all the mining operations should be through standard relational tables, where each row in the table represents a case, and each column in the table uniquely represents a mining attribute. We call this format single-record case.
A table in the Oracle RDBMS supports 1000 columns. The DBMS_DATA_MINING
package requires a mandatory case identifier column, which implies that you can provide 999 mining attributes using this representation, which is adequate for most business applications.
The notable exceptions to the common scenario are applications in the domains of bioinformatics, text mining, and such specialized areas where data is characterized to be wide and shallow -- with relatively few cases numbering in the thousands, but with several thousand mining attributes.
You can provide such data in a multi-record case format, where attribute
, value
pairs are grouped into collections (nested tables) associated with a given case-id. You must use the fixed collection types DM_Nested_Numericals
and DM_Nested_Categoricals
to define columns that represent collections of numerical attributes and categorical attributes respectively.
A few caveats on the use of multi-record case format:
From a data modeling perspective, the multi-record case format is analogous to the dimension-fact relationship commonly found in OLAP applications, where the dimension and fact tables are aggregated based on a common key for generating reports using materialized views or SQL queries.
The main distinction between the OLAP fact table and the multi-record case is this: Each row in the fact table represents a column value, whereas each row in the multi-record collection represents an attribute name (paired with its corresponding value in another column in the nested table). The two fixed collection types are provided to highlight and enforce this distinction. These two types are defined with the reasonable assumption that mining attributes of the same type (numericals versus categoricals) are generally grouped together, just as a fact table contains values that logically correspond to the same entity.
Oracle strongly recommends that you present your multi-record case data using object views, and present the view as input to CREATE_MODEL
and APPLY
operations. Apart from the benefit of providing all your mining attributes through a single row-source without impacting their physical data storage, the view acts as a join specification on the underlying tables that can be used by the server for efficiently accessing your data.
We illustrate this discussion on wide data with a real-world example of an analytical pipeline for brain tumor research. The datasets of relevance are gene expression data from the instruments (fact table), and the clinical data about the patient (dimension table). The schemas of these tables are provided below.
(case_id NUMBER, name VARCHAR2(30) type VARCHAR2(30) subtype VARCHAR2(30), gender CHAR(1), age NUMBER, status VARCHAR2(30))
(case_id NUMBER, gene VARCHAR2(30), expr NUMBER)
Let us consider building a model with status
as the target prediction, and with sex
, age
, and expr
being the predictors. You can provide the build data input using an object view that combines the clinical_data
table and the gene_expression_data
table with the schema:
(case_id NUMBER, gender CHAR(1), age NUMBER, gene_expr DM_Nested_Numericals, status VARCHAR2(30))
The query for constructing such an object view is relatively simple in Oracle SQL:
CREATE OR REPLACE VIEW gene_expr_build AS SELECT C.case_id, C.gender, C.age, CAST(MULTISET( SELECT gene, expr FROM gene_expression_data WHERE case_id = C.case_id) AS DM_Nested_Numericals ) gene_expr, C.status FROM clinical_data C
Now gene_expr_build
can be provided as the input data_table_name
for CREATE_MODEL
.
Oracle Data Mining handles categorical and numerical attributes. See Oracle Data Mining Concepts for explanation and examples of attribute types.
The DBMS_DATA_MINING
package infers the attribute type based on the type of the column representing the mining attribute. You must define all columns representing numerical attributes to be of type NUMBER
. You must define all columns representing categorical attributes to be of type VARCHAR2
or CHAR
.
In situations where you have numbers that are deemed categorical data by your application logic, you must typecast such attribute values using the TO_CHAR()
operator and populate them into a VARCHAR2
or CHAR
column representing the mining attribute.
In situations where you have numeric attribute values stored in a CHAR
or VARCHAR2
column, you must typecast those attribute values using the TO_NUMBER()
operator and store them in a NUMBER
column.
If persisting these transformed values in another table is not a viable option, you could also create a view with these conversions in place, and provide the view name to represent the training data input for the CREATE_MODEL
operation.
Classification and Regression algorithms require a target attribute. The package supports single targets. The target attribute for all classification algorithms can be numerical or categorical. SVM Regression supports only numerical target attributes.
All mining operations assume the incoming data to be already prepared and transformed. You can preprocess your data using the DBMS_DATA_MINING_TRANSFORM
package, or by using any third-party tool, or using your own homegrown utilities based on SQL and/or PL/SQL scripts.
If you preprocess or transform the input data, you must also reverse-transform the results from APPLY
to map the data back to its original form.
See PL/SQL Packages and Types Reference for a description of the DBMS_DATA_MINING_TRANSFORM
package.
The suggestion to customers with wide data is to perform transforms before forming WD views on data in single-record (2D) or multi-record (transactional) format. It is possible to use dbmsdmxf
for multi-record format. In this case, all attributes are transformed in a similar way. In most of the cases, attributes in transactional (multi-record) form are of the same scale and thus this approach works. Otherwise, our suggestion is to split the data into sets of similar items and then transform them separately.
If you have mining attributes numbering in the few hundreds, and your application requires them to be represented as columns in the same row of the table, here are some data storage issues to consider. For a table with several columns, the key question to consider is the (average) row length, not the number of columns. Having more than 255 columns in a table built with a smaller block size typically results in intra-block chaining. Oracle stores multiple row pieces in the same block owing to pragmatics of design, but the overhead to maintain the column information is minimal as long as all row pieces fit in a single data block. If the rows don't fit in a single data block, you may consider using a larger database block size (or use multiple block sizes in the same database). For more details, consult the Oracle Data Mining Concepts and the Oracle Database Performance Tuning Guide.
The use of DBMS_DATA_MINING
is subject to the following rules and limitations:
CREATE_MODEL
and APPLY
operations require a case identifier column of type VARCHAR2
, CHAR
, or NUMBER
. The length of the case identifier values is limited to 128 bytes.DBMS_DATA_MINING
does not treat model names in a case-sensitive manner like Oracle schema objects and up-cases all model names internally.DM_
, DM$P
, DM$J
, DM$T
are system objects that should not be read or updated by the user. Direct queries against these tables may be possible, but the queries may provide meaningless results. Direct DML against these tables can irrevocably corrupt your model. Hence Oracle strongly recommends that you do not perform any operations on these system objects.GRANT
and REVOKE
of read and update privileges on a mining model across user schemas are not yet supported. The user can only read and update models that are generated in a given schema owned by that user. Models in one schema or database instance can be exported using EXPORT_MODEL
to other schemas or database instances.DBMS_DATA_MINING
package.The DBMS_DATA_MINING
and the DBMS_DATA_MINING_TRANSFORM
packages use the data types shown in Table 4-8.
Table 4-9 through Table 4-16 list the constants to be used for various settings in the settings table.
Constant | Purpose |
---|---|
|
Setting value for singleton threshold for Naive Bayes |
|
Setting value for pair-wise threshold for Naive Bayes |
Table 6-18 lists the errors generated by DBMS_DATA_MINING
.
Table 4-18 lists the user views provided by Oracle to obtain information about the models generated using DBMS_DATA_MINING
.
User View | Purpose |
---|---|
|
Lists all models in a given user's schema. |
Data mining models can be moved between Oracle databases or schemas. For example, in an organization, data mining specialists may build and test data mining models in a data mining lab. After models are built and tested in the lab, the chosen model may be moved to a scoring engine used by applications. Because data mining lab and scoring engine usually do not share the same database, the model must be exported from the lab and then imported to the scoring engine. Model export and import can be a routine procedure. As new data are accumulated, data mining specialists will build and test new models, and newer and better models will be loaded onto the scoring engine on a regular basis. DBAs will want to back up and restore models in their routine database maintenance.
Native export and import of data mining models are supported in the following scenarios:
expdp
, all the existing data mining models in the database are exported. By the same token, when a DBA imports a database dump using utility impdp
, all the data mining models in the dump are restored.expdp
, all the data mining models in the schema are exported. When the user or DBA imports the schema dump using impdp
, all the models in the dump are imported.DBMS_DATA_MINING.EXPORT_MODEL
and import specified models using DBMS_DATA_MINING.IMPORT_MODEL
.The use of model export and import is subject to the following limitations:
DBMS_DATA_MINING
interface are supported.Prerequisites for model export are as follows:
CREATE_DIRECTORY
in the PL/SQL Packages and Types Reference.expdp
must be used.Prerequisites for model import are as follows:
expdp
or EXPORT_MODEL
.impdp
must be usedSee also:
There are two ways to export models:
To export all data mining models in a user schema, you can either run expdp
or use EXPORT_MODEL
with the parameter model_filter
set to NULL
. Note the difference between the two operations: When you run expdp
to export the schema, all objects in the schema including data mining models are exported. When you run EXPORT_MODEL
with a NULL
model_filter
, only the models will be exported.
There are also two ways to import models from the dump file:
In general, if you want to import the full dump file set, run impdp
. This imports all database objects and data, including all data mining models, from the dump file set. If you want to import models only, use IMPORT_MODEL
. When the parameter model_filter
is set to NULL
, all models are imported from the dump. If valid model names are assigned in model_filter
, this operation imports only named models from the dump file set.
Data mining model export and import jobs will create and manage two temporary tables in the user schema: DM$P_MODEL_EXPIMP_TEMP
and DM$P_MODEL_TABKEY_TEMP
. Users should not manipulate these tables.