PL/SQL Packages and Types Reference 10g Release 1 (10.1) Part Number B10802-01 |
|
|
View PDF |
Oracle Data Mining (ODM) is designed for programmers, systems analysts, project managers, and others interested in developing database applications that use data mining to discover hidden patterns and use that knowledge to make predictions.
This chapter contains the following topics:
Oracle Data Mining (ODM) embeds data mining in the Oracle database. The data never leaves the database -- the data, its preparation, model building, and model scoring activities all remain in the database. This enables Oracle to provide an infrastructure for data analysts and application developers to integrate data mining seamlessly with database applications.
ODM provides two interfaces that support in-database data mining: a Java interface and a PL/SQL interface. The Java interface is described in Oracle Data Mining Application Developer's Guide.
You can use the package to build a mining model, test the model, and apply this model to your data to obtain predictive and descriptive information. For detailed examples of how to perform these tasks, see the sample programs in Oracle Data Mining Application Developer's Guide. The sample code is in the demo directory.
General information about both the Java and the PL/SQL interface is contained in Oracle Data Mining Application Developer's Guide. See Oracle Data Mining Concepts for a discussion of data mining concepts.
This chapter contains an overview of the development methodology, followed by information on data types, settings, and constants, as well as detailed descriptions of the PL/SQL subprograms. The DBMS_DATA_MINING_TRANSFORM package supports data pre-processing for data mining.
Table 23-1 through Table 23-8 list the constants to be used for various settings in the settings table.
Constant | Purpose |
---|---|
|
Setting value for singleton threshold for Naive Bayes. |
|
Setting value for pair-wise threshold for Naive Bayes. |
The DBMS_DATA_MINING
and the DBMS_DATA_MINING_TRANSFORM
packages use the data types shown in Table 23-9.
Table 23-10 lists the errors generated by DBMS_DATA_MINING
.
Table 23-11 lists the user views provided by Oracle to obtain information about the models generated using DBMS_DATA_MINING
User View | Purpose |
---|---|
|
Lists all models in a given user's schema. |
The development methodology for data mining using the DBMS_DATA_MINING
interface is divided into two phases.
The first phase includes your application data analysis and design, where you perform the following two steps:
The second phase involves developing a mining application using DBMS_DATA_MINING
and DBMS_DATA_MINING_TRANSFORM
packages.
DBMS_DATA_MINING_TRANSFORM
package or other third-party tool or direct SQL or PL/SQL utility scripts in a manner suitable for the chosen mining function and algorithm. An important caveat is that the three datasets referred to earlier have to be prepared in an identical manner for mining results to be meaningful. This is an optional step, required only if your data is not prepared for mining.See Oracle Data Mining Application Developer's Guide for more general discussion of the PL/SQL interface to Oracle Data Mining.
The settings table is a simple relational table with a fixed schema. You can choose the name of the settings table, but the column names and their types must be defined as specified:
(setting_name VARCHAR2(30), setting_value VARCHAR2(128))
The values provided in the settings table override the default values assumed by the system. The values inserted into the setting_name
column are one or more of several constants defined in the DBMS_DATA_MINING
package. Depending on what the setting name denotes, the value for the setting_value
column can be a predefined constant or the actual numerical or string value corresponding to the setting itself. The setting_valu
e column is defined to be VARCHAR2
, so you must cast numerical inputs to string using the TO_CHAR
function before input into the settings table.
Table 23-12 through Table 23-17 list the various setting names and the valid setting values, with a brief explanation of each setting
You can create a settings table as shown in the example that follows for an SVM classification model, and edit the individual values using SQL DML.
CREATE TABLE drugstore_settings ( setting_name VARCHAR2(30), setting_value VARCHAR2(128)) BEGIN -- override the default for convergence tolerance for SVM Classification INSERT INTO drugstore_model_settings (setting_name, setting_value) VALUES (dbms_data_mining.svms_conv_tolerance, TO_CHAR(0.081)); COMMIT; END;
The table function GET_DEFAULT_SETTINGS
provides you all the default settings for mining functions and algorithms. If you intend to override all the default settings, you can create a seed settings table and edit them using SQL DML.
BEGIN CREATE TABLE drug_store_settings AS SELECT setting_name, setting_value FROM TABLE (DBMS_DATA_MINING.GET_DEFAULT_SETTINGS WHERE setting_name LIKE 'SVMS_%'; -- update the values using appropriate DML END;
You can also create a settings table based on another model's settings using GET_MODEL_SETTINGS
, as shown in the following example:
BEGIN CREATE TABLE my_new_model_settings AS SELECT setting_name, setting_value FROM TABLE (DBMS_DATA_MINING.GET_MODEL_SETTINGS('my_other_ model')); END;
Consult Oracle Data Mining Concepts for an explanation of the prior probabilities table. You can specify a prior probabilities table as an optional function setting when building classification models.
You must create the prior probabilities table using the fixed schema shown in the following code samples. For numerical targets, use the following schema:
target_value NUMBER prior_probability NUMBER
For categorical targets, use the following schema:
target_value VARCHAR2 prior_probability NUMBER
Next, provide the name of the prior probabilities table as input to the setting_value
column in the settings table, with the corresponding value for the setting_name
column to be DBMS_DATA_MINING.clas_priors_table_name
, as shown:
BEGIN INSERT INTO drugstore_settings (setting_name, setting_value) VALUES (DBMS_DATA_MINING.class_priors_table_name, 'census_priors'); COMMIT; END;
Consult Oracle Data Mining Concepts for an explanation of the cost matrix. You must create a cost matrix table with the fixed schema shown in the following code samples. For numerical targets, use the following schema:
actual_target_value NUMBER predicted_target_value NUMBER cost NUMBER
For categorical targets, use the following schema:
actual_target_value VARCHAR2 predicted_target_value VARCHAR2 cost NUMBER
The DBMS_DATA_MINING
package enables you to evaluate the cost of predictions from classification models in an iterative manner during the experimental phase of mining, and to eventually apply the optimal cost matrix to predictions on the actual scoring data in a production environment.
The data input to each COMPUTE
procedure in the package is the result generated from applying the model on test data. In addition, if you also provide a cost matrix as an input, the COMPUTE
procedure generates test results taking the cost matrix into account. This enables you to experiment with various costs for a given prediction against the same APPLY
results, without rebuilding the model and applying it against the same test data for every iteration.
Once you arrive at an optimal cost matrix, you can then input this cost matrix to the RANK_APPLY
procedure along with the results of APPLY
on your scoring data. RANK_APPLY
will provide your new data ranked by cost.
This procedure applies a mining model to the data to be scored, and generates the APPLY
results in a table. This operation is applicable for predictive models (classification, regression) and also for descriptive models (clustering, feature extraction).
DBMS_DATA_MINING.APPLY ( model_name IN VARCHAR2, data_table_name IN VARCHAR2, case_id_column_name IN VARCHAR2, result_table_name IN VARCHAR2, data_schema_name IN VARCHAR2 DEFAULT NULL);
The data provided for APPLY
should match the data provided to CREATE_MODEL
in terms of the schema definition and relevant content. The GET_MODEL_SIGNATURE
function provides this information. If the data provided as input to CREATE_MODEL
has been pre-processed, then the data input to APPLY
must also be pre-processed using the statistics from the CREATE_MODEL
data pre-processing. The case identifier is not considered to be a mining attribute during APPLY
.
You must provide the name of the table in which the results of the apply operation are to be stored. APPLY
creates a table with an algorithm-specific fixed schema in the user schema that owns the model.
The behavior of an APPLY
operation is analogous to a SQL query operation, even though it is packaged as a procedure. It does not update the model contents and does not have any contention with CREATE_MODEL
, DROP_MODEL
, or RENAME_MODEL
operations. The corollary is that if you potentially drop or rename a model while a model is being applied to scoring data, the APPLY
operation may discontinue with partial or unpredictable results.
The schema for the apply results from each of the supported algorithms is listed in subsequent sections. The case_id
column will match the case identifier column name provided by you. The type of incoming case-id
column is preserved in Apply output.
The table containing the APPLY
results for all classification models has the same schema. For numerical targets, the results table will have the schema as shown:
case_id VARCHAR2/NUMBER prediction NUMBER probability NUMBER
For categorical targets, the results table will have the following schema:
case_id VARCHAR2/NUMBER prediction VARCHAR2 probability NUMBER
The results table will have the following schema:
case_id VARCHAR2/NUMBER prediction NUMBER
Clustering is an unsupervised mining function, and hence there are no targets. The results of an APPLY
operation will contain simply the cluster identifier corresponding to a case, and the associated probability. The results table will have the schema as shown:
case_id VARCHAR2/NUMBER cluster_id NUMBER probability NUMBER
Feature extraction is also an unsupervised mining function, and hence there are no targets. The results of an APPLY
operation will contain simply the feature identifier corresponding to a case, and the associated match quality. The results table will have the schema as shown:
case_id VARCHAR2/NUMBER feature_id NUMBER match_quality NUMBER
BEGIN /* build a model with name census_model. * (See example under CREATE_MODEL) */ /* if build data was pre-processed in any manner, * perform the same pre-processing steps on the * scoring data also. * (See examples in the section on DBMS_DATA_MINING_TRANSFORM) */ /* apply the model to data to be scored */ dbms_data_mining.apply( model_name => 'census_model', data_table_name => 'census_2d_apply', case_id_column_name => 'person_id', result_table_name => 'census_apply_result'); END; / -- View Apply Results SELECT case_id, prediction, probability FROM census_apply_result;
This procedure creates a mining model for a given mining function using a specified mining algorithm.
DBMS_DATA_MINING.CREATE_MODEL ( model_name IN VARCHAR2, mining_function IN VARCHAR2, data_table_name IN VARCHAR2, case_id_column_name IN VARCHAR2, target_column_name IN VARCHAR2 DEFAULT NULL, settings_table_name IN VARCHAR2 DEFAULT NULL, data_schema_name IN VARCHAR2 DEFAULT NULL, settings_schema_name IN VARCHAR2 DEFAULT NULL);
The data provided to all subsequent operations such as APPLY
must match the data provided to CREATE_MODEL
in schema and relevant content. If the data provided as input to CREATE_MODEL
has been pre-processed, then the data input to subsequent operations such as APPLY
must also be pre-processed using the statistics from the CREATE_MODEL
data pre-processing. The case identifier column is not considered to be a mining attribute during CREATE_MODEL
.
You can view the default settings for each algorithm through GET_DEFAULT_SETTINGS
. You can override the defaults by providing a settings table specifying your choice of mining algorithm and relevant overriding algorithm settings.
Once a model has been built, information about the attributes used for model build can be obtained from GET_MODEL_SIGNATURE
. To inspect or review model contents, you can use any of the algorithm-specific GET_MODEL_DETAILS
functions.
The behavior of the CREATE_MODEL
is analogous to a SQL DDL CREATE
operation. It contends with RENAME_MODEL
and DROP_MODEL
operations.
Note: The CREATE_MODEL
operation creates a set of system tables in the owner's schema to store the patterns and information that constitutes a mining model for a particular algorithm.The names of these tables have the prefix DM$
. The number, schema, and content of these tables is Oracle proprietary and may change from release to release. You must not direct any queries or updates against these system tables.
Assume that you need to build a classification model using Support Vector Machines algorithm.
/* prepare a settings table to override default * settings (Naïve Bayes is the default classifier) */ CREATE TABLE census_settings ( setting_name VARCHAR2(30), setting_value VARCHAR2(128)); BEGIN /* indicate that SVM is the chosen classifier */ INSERT INTO census_settings VALUES ( dbms_data_mining.algo_name, dbms_data_mining.algo_support_vector_machines); /* override the default value for complexity factor */ INSERT INTO census_settings (setting_name, setting_value) VALUES (dbms_data_mining.svms_complexity_factor, TO_CHAR(0.081)); COMMIT; /* build a model with name census_model */ dbms_data_mining.create_model( model_name => 'census_model', mining_function => DBMS_DATA_MINING.CLASSIFICATION, data_table_name => 'census_2d_build', case_id_column_name => 'person_id', target_column_name => 'class', settings_table_name => 'census_settings'); END; /
This procedure computes the confusion matrix for a classification model and also provides the accuracy of the model. See Oracle Data Mining Concepts for a description of confusion matrix.
The inputs are a table containing the results of applying the model on the test data, and a table that contains only the target and case identifier columns from the test data.
DBMS_DATA_MINING.COMPUTE_CONFUSION_MATRIX ( accuracy OUT NUMBER, apply_result_table_name IN VARCHAR2, target_table_name IN VARCHAR2, case_id_column_name IN VARCHAR2, target_column_name IN VARCHAR2, confusion_matrix_table_name IN VARCHAR2, score_column_name IN VARCHAR2 DEFAULT 'PREDICTION', score_criterion_column_name IN VARCHAR2 DEFAULT 'PROBABILITY', cost_matrix_table_name IN VARCHAR2 DEFAULT NULL, apply_result_schema_name IN VARCHAR2 DEFAULT NULL, target_schema_name IN VARCHAR2 DEFAULT NULL, cost_matrix_schema_name IN VARCHAR2 DEFAULT NULL);
You can also provide a cost matrix as an optional input in order to have the cost of predictions reflected in the results.
It is important to note that the data inputs to COMPUTE_CONFUSION_MATRIX
do not always have to be generated using APPLY
. As long as the schema of the two input tables matches the ones discussed in this section, with appropriate content, the procedure can provide the confusion matrix and accuracy as outputs. The quality of the results is dependent on the quality of the data.
The data provided for testing your classification model must match the data provided to CREATE_MODEL
in schema and relevant content. If the data provided as input to CREATE_MODEL
has been pre-processed, then the data input to APPLY
must also be pre-processed using the statistics from the CREATE_MODEL
data pre-processing.
Before you use the COMPUTE_CONFUSION_MATRIX
procedure, you must prepare two data input streams from your test data.
First, you must APPLY
the model on your test data. The parameter apply_result_table_name
in the COMPUTE_CONFUSION_MATRIX
procedure represents the table that will be generated in your schema as a result of the APPLY
operation.
Next, you must create a table or view containing only the case identifier column and the target column in its schema. The parameter target_table_name
reflects this input. The schema for this view or table name for a numerical target attribute is:
(case_identifier_column_name VARCHAR2/NUMBER, target_column_name NUMBER)
The schema for this view or table name for a categorical target attribute is:
(case_identifier_column_name VARCHAR2/NUMBER, target_column_name NUMBER)
You must provide the name of the table in which the confusion matrix is to be generated. The resulting fixed schema table will always be created in the schema owning the model.
For numerical target attributes, the confusion matrix table will have the schema:
(actual_target_value NUMBER, predicted_target_value NUMBER, value NUMBER)
For categorical target attributes, the confusion matrix table will have the schema:
actual_target_value VARCHAR2, predicted_target_value VARCHAR2, value NUMBER
Assume that you have built a classification model census_model
using the Naive Bayes algorithm, and you have been provided the test data in a table called census_2d_test
, with case identifier column name person_id
, and the target column name class
.
DECLARE v_sql_stmt VARCHAR2(4000); v_accuracy NUMBER; BEGIN /* apply the model census_model on test data */ dbms_data_mining.apply( model_name => 'census_model', data_table_name => 'census_2d_test', case_id_column_name => 'person_id', result_table_name => 'census_test_result'); CREATE VIEW census_2d_test_view as select person_id, class from census_2d_test; /* now compute the confusion matrix from the two * data streams, also providing a cost matrix as input. */ dbms_data_mining.compute_confusion_matrix ( accuracy => v_accuracy, apply_result_table_name => 'census_test_result', target_table_name => 'census_2d_test_view', case_id_column_name => 'person_id', target_column_name => 'class', confusion_matrix_table_name => 'census_confusion_matrix', cost_matrix_table_name => 'census_cost_matrix'); dbms_output.put_line('Accuracy of the model: ' || v_accuracy); END; / -- View the confusion matrix using Oracle SQL SELECT actual_target_value, predicted_target_value, value FROM census_confusion_matrix;
This procedure computes a lift table for a given positive target for a classification model. See Oracle Data Mining Concepts for a description of lift.
The inputs are a table containing the results of applying the model on the test data, and a table that contains only the target and case identifier columns from the test data.
DBMS_DATA_MINING.COMPUTE_LIFT ( apply_result_table_name IN VARCHAR2, target_table_name IN VARCHAR2, case_id_column_name IN VARCHAR2, target_column_name IN VARCHAR2, lift_table_name IN VARCHAR2, positive_target_value IN VARCHAR2, score_column_name IN VARCHAR2 DEFAULT 'PREDICTION', score_criterion_column_name IN VARCHAR2 DEFAULT 'PROBABILITY', num_quantiles IN NUMBER DEFAULT 10, cost_matrix_table_name IN VARCHAR2 DEFAULT NULL, apply_result_schema_name IN VARCHAR2 DEFAULT NULL, target_schema_name IN VARCHAR2 DEFAULT NULL, cost_matrix_schema_name IN VARCHAR2 DEFAULT NULL);
You can also provide a cost matrix as an optional input to have the cost of predictions reflected in the results.
It is important to note that the data inputs to COMPUTE_LIFT
do not always have to be generated using APPLY
. As long as the schema of the two input tables matches the ones discussed in this section, with appropriate content, the procedure can provide the lift table as output. The quality of the results depends on the quality of the data.
The data provided for testing your classification model must match the data provided to CREATE_MODEL
in schema and relevant content. If the data provided as input to CREATE_MODEL
has been pre-processed, then the data input to APPLY
must also be pre-processed using the statistics from the CREATE_MODEL
data pre-processing.
Before you use the COMPUTE_LIFT
procedure, you must prepare two data input streams from your test data.
First, you must APPLY
the model on your test data. The parameter apply_result_table_name
in the COMPUTE_LIFT
procedure represents the table that will be generated in your schema as a result of the APPLY
operation.
Next, you must create a table or view containing only the case identifier column and the target column in its schema. The parameter target_table_name
reflects this input. The schema for this view or table name for a numerical target attribute is:
(case_identifier_column_name VARCHAR2/NUMBER, target_column_name NUMBER)
The schema for this view or table name for a categorical target attribute is:
(case_identifier_column_name VARCHAR2/NUMBER, target_column_name NUMBER)
You must provide the name of the table in which the lift table is to be generated. The resulting fixed schema table is always created in the schema that owns the model.
The resulting lift table will have the following schema:
(quantile_number NUMBER, quantile_total_count NUMBER, quantile_target_count NUMBER, percent_records_cumulative NUMBER, lift_cumulative NUMBER, target_density_cumulative NUMBER, targets_cumulative NUMBER, non_targets_cumulative NUMBER, lift_quantile NUMBER, target_density NUMBER)
The output columns are explained in Oracle Data Mining Concepts.
Assume that you have built a classification model census_model
using the Naive Bayes algorithm, and you have been provided the test data in a table called census_2d_test
, with case identifier column name person_id
, and the target column name class
.
DECLARE v_sql_stmt VARCHAR2(4000); BEGIN /* apply the model census_model on test data */ dbms_data_mining.apply( model_name => 'census_model', data_table_name => 'census_2d_test, case_id_column_name => 'person_id', result_table_name => 'census_test_result'); /* next create a view from test data that projects * only the case identifier and target column */ /* now compute lift with the default 10 quantiles * from the two data streams */ dbms_data_mining.compute_lift ( apply_result_table_name => 'census_test_result', target_table_name => 'census_2d_test_view', case_id_column_name => 'person_id', target_column_name => 'class', lift_table_name => 'census_lift', positive_target_value => '1', cost_matrix_table_name => 'census_cost_matrix'); END; / -- View the lift table contents using SQL SELECT * FROM census_lift;
This procedure computes the receiver operating characteristic (ROC) for a binary classification model. See Oracle Data Mining Concepts for a description of receiver operating characteristic.
The inputs are a table containing the results of applying the model on the test data, and a table that contains only the target and case identifier columns from the test data.
DBMS_DATA_MINING.COMPUTE_ROC ( roc_area_under_curve OUT NUMBER, apply_result_table_name IN VARCHAR2, target_table_name IN VARCHAR2, case_id_column_name IN VARCHAR2, target_column_name IN VARCHAR2, roc_table_name IN VARCHAR2, positive_target_value IN VARCHAR2, score_column_name IN VARCHAR2 DEFAULT 'PREDICTION', score_criterion_column_name IN VARCHAR2 DEFAULT 'PROBABILITY', apply_result_schema_name IN VARCHAR2 DEFAULT NULL, target_schema_name IN VARCHAR2 DEFAULT NULL);
It is important to note that the data inputs to COMPUTE_ROC
do not always have to be generated using APPLY
. As long as the schema of the two input tables matches the ones discussed in this section, with appropriate content, the procedure can provide the ROC
table as output. The quality of the results depends on the quality of the data.
The data provided for testing your classification model must match the data provided to CREATE_MODEL
in schema and relevant content. If the data provided as input to CREATE_MODEL
has been pre-processed, then the data input to APPLY
must also be pre-processed using the statistics from the CREATE_MODEL
data pre-processing.
Before you use the COMPUTE_ROC
procedure, you must prepare two data input streams from your test data.
First, you must APPLY
the model on your test data. The parameter apply_result_table_name
in the COMPUTE_ROC
procedure represents the table that will be generated in your schema as a result of the APPLY
operation.
Next, you must create a table or view containing only the case identifier column and the target column in its schema. The parameter target_table_name
reflects this input. The schema for this view or table name for a numerical target attribute is:
case_identifier_column_name VARCHAR2/NUMBER, target_column_name NUMBER
The schema for this view or table name for a categorical target attribute is:
case_identifier_column_name VARCHAR2/NUMBER, target_column_name VARCHAR2
You must provide the name of the table in which the ROC table is to be generated. The resulting fixed schema table will always be created in the schema that owns the model. The resulting ROC table will have the following schema:
(probability NUMBER, true_positives NUMBER, false_negatives NUMBER, false_positives NUMBER, true_negatives NUMBER, true_positive_fraction NUMBER, false_positive_fraction NUMBER)
The output columns are explained in Table 23-24.
The typical use scenario is to examine the true_positive_fraction
and false_positive_fraction
to determine the most desirable probability_threshold
. This threshold is then used to predict class values in subsequent apply operations. For example, to identify positively predicted cases in probability rank order from an apply result table, given a probability_threshold
:
select case_id_column_name from apply_result_table_name where probability > probability_threshold order by probability DESC;
There are two procedures one might use to identify the most desirable probability_threshold
. One procedure applies when the relative cost of positive class versus negative class prediction errors are known to the user. The other applies when such costs are not well known to the user. In the first instance, one can apply the relative costs to the ROC table to compute the minimum cost probability_threshold
. Suppose the relative cost ratio, Positive Class Error Cost / Negative Class Error Cost = 20. Then execute a query like:
WITH cost AS ( SELECT probability_threshold, 20 * false_negatives + false positives cost FROM ROC_table GROUP BY probability_threshold), minCost AS ( SELECT min(cost) minCost FROM cost) SELECT max(probability_threshold)probability_threshold FROM cost, minCost WHERE cost = minCost;
If relative costs are not well known, the user simply scans the values in the table (in sorted order) and makes a determination about which of the displayed trade-offs (misclassified positives versus misclassified negatives) is most desirable:
select * from ROC_table order by probability_threshold
Assume that you have built a classification model census_model
using the SVM algorithm, and you have been provided the test data in a table called census_2d_test
, with case identifier column name person_id
, and the target column name class
.
DECLARE v_sql_stmt VARCHAR2(4000); v_accuracy NUMBER; BEGIN /* apply the model census_model on test data */ DBMS_DATA_MINING.apply( model_name => 'census_model', data_table_name => 'census_2d_test', case_id_column_name => 'person_id', result_table_name => 'census_test_result'); /* next create a view from test data that projects * only the case identifier and target column */ v_sql_stmt := 'CREATE VIEW census_2d_test_view AS ' || 'SELECT person_id, class FROM census_2d_test'; EXECUTE IMMEDIATE v_sql_stmt; /* now compute the receiver operating characterestics from * the two data streams, also providing a cost matrix * as input. */ DBMS_DATA_MINING.compute_roc ( accuracy => v_accuracy, apply_result_table_name => 'census_test_result', target_table_name => 'census_2d_test_view', case_id_column_name => 'person_id', target_column_name => 'class', roc_table_name => 'census_roc', cost_matrix_table_name => 'census_cost_matrix'); END; / -- View the ROC results using Oracle SQL SELECT * FROM census_roc;
This procedure drops an existing mining model from the user's schema.
DBMS_DATA_MINING.DROP_MODEL (model_name IN VARCHAR2);
Parameter | Description |
---|---|
|
Name of the model (see Rules and Limitations). |
You can use DROP_MODEL
to drop an existing mining model.
The behavior of the DROP_MODEL
is similar to a SQL DDL DROP
operation. It blocks RENAME_MODEL
and CREATE_MODEL
operations. It does not block or block on APPLY
, which is a SQL query-like operation that does not update any model data.
If an APPLY
operation is using a model, and you attempt to drop the model during that time, the DROP
will succeed and APPLY
will return indeterminate results. This is in line with the conventional behavior in the RDBMS, where DDL operations do not block on Query operations.
Assume the existence of a model census_model
. The following example shows how to drop this model.
BEGIN DBMS_DATA_MINING.drop_model(model_name => 'census_model'); END; /
This procedure exports specified data mining models into a dump file set.
DBMS_DATA_MINING.EXPORT_MODEL ( filename IN VARCHAR2, directory IN VARCHAR2, model_filter IN VARCHAR2 DEFAULT NULL, filesize IN VARCHAR2 DEFAULT NULL, operation IN VARCHAR2 DEFAULT NULL, remote_link IN VARCHAR2 DEFAULT NULL, jobname IN VARCHAR2 DEFAULT NULL);
Use EXPORT_MODEL
to export all or specific data mining models from the source. This procedure creates a dump file set that includes one or more files. The location of the dump files is specified by parameter directory
, which is the name of a directory object created before this procedure is called. The user must have WRITE
privileges on this directory object. The dump file name must be unique. When the export operation completes successfully, the dump file name is automatically expanded to filename01.dmp
even if there is only one file in the dump set.
A log file is created for every successful export operation in the directory mapped by directory
. If jobname
is specified, the log is named jobname.
log
. If jobname
is set to NULL
(the default), the log is named as USERNAME_exp_nnnn.
log
, where nnnn
is a number. If jobname
is provided, it must be unique. Parameter model_filter
is used to specify models to be exported; its use is indicated in the following table:
EXPORT_MODEL
is not mutually exclusive with the DDL-like operations such as CREATE_MODEL
, DROP_MODEL
, and RENAME_MODEL
. In other words, if an export operation is currently underway and the model is dropped at that time, then the results in the exported dump file are unpredictable.
The following example shows exporting one model from the current user schema into a dump file. The dump file will be located in operating system directory /home/models
, which is mapped to a directory object DM_DUMP
. The directory object DM_DUMP
must be created before executing the sample, and the user must be granted WRITE
privileges on it.
DECLARE job_name VARCHAR2(32) := 'model_exp_001'; filename VARCHAR2(32); BEGIN filename := job_name; DBMS_DATA_MINING.export_model( filename =>filename, directory =>'DM_DUMP', metadata_filter => 'name=''NB_MODEL_01''', remote_link => NULL, filesize => '30M', operation => 'EXPORT', job_name => job_name); dbms_output.put_line( 'Export_model '||job_name||' completed sucessfully!'); END; /
This table function returns the rules from an Association model. The rows are an enumeration of the rules generated during the creation of the model.
DBMS_DATA_MINING.GET_ASSOCIATION_RULES ( model_name IN VARCHAR2) RETURN DM_Rules pipelined;
Parameter | Description |
---|---|
|
Name of the model (see Rules and Limitations). |
RNDS
, WNDS
, RNPS
, WNPS
The table function pipes out rows with the schema:
rule_id INTEGER, antecedent DM_Predicates, consequent DM_Predicates, rule_support NUMBER, rule_confidence NUMBER
DM_Predicates
is a collection of DM_Predicate
objects. When un-nested, each object maps to a row of the form:
attribute_name INTEGER, conditional_operator CHAR(2), attribute_num_value NUMBER, attribute_str_value VARCHAR2, attribute_support NUMBER, attribute_confidence NUMBER
The significance of piped output is that each row is materialized by the table function as soon as it is read from model storage, without any latency or wait for the generation of the complete DM_Rules
object. All GET
operations use pipelining. For more information on pipelined, parallel table functions, consult the PL/SQL User's Guide and Reference.
The examples shown in this section describe how to un-nest the values from each of the columns discussed earlier.
The following example demonstrates an Association model build followed by an invocation of GET_ASSOCIATION_RULES
table function from Oracle SQL.
-- prepare a settings table to override default settings CREATE TABLE market_settings AS SELECT * FROM TABLE(DBMS_DATA_MINING.GET_DEFAULT_SETTINGS) WHERE setting_name LIKE 'ASSO_%'; BEGIN -- update the value of the minimum confidence UPDATE census_settings SET setting_value = TO_CHAR(0.081) WHERE setting_name = DBMS_DATA_MINING.asso_min_confidence; /* build an AR model */ DBMS_DATA_MINING.CREATE_MODEL( model_name => 'market_model', function => DBMS_DATA_MINING.ASSOCIATION, data_table_name => 'market_build', case_id_column_name => 'item_id', target_column_name => NULL, settings_table_name => 'census_settings'); END; / -- View the (unformatted) rules from SQL/Plus SELECT rule_id, antecedent, consequent, rule_support, rule_confidence FROM GET_ASSOCIATION_RULES('market_model')); -- see ardemo.sql for retrieving formatted rules /
This table function returns all the default settings for all mining functions and algorithms supported in the DBMS_DATA_MINING
package.
DBMS_DATA_MINING.GET_DEFAULT_SETTINGS;
Return Value | Description |
---|---|
|
Represents a set of rows with schema: (setting_name VARCHAR2(30), setting_value VARCHAR2(128)) |
RNDS
, WNDS
, RNPS
, WNPS
The table function pipes out rows with the schema:
(setting_name VARCHAR2(30), setting_value VARCHAR2(128))
This function is particularly useful if you do not know what settings are associated with a particular function or algorithm, and you want to override some or all of them.
For example, if you want to override some or all of k-Means clustering settings, you can create a settings table as shown, and update individual settings as required.
BEGIN CREATE TABLE mysettings AS SELECT * FROM TABLE(DBMS_DATA_MINING.GET_DEFAULT_SETTINGS) WHERE setting_name LIKE 'KMNS%'; -- now update individual settings as required UPDATE mysettings SET setting_value = 0.02 WHERE setting_name = dbms_data_mining.kmns_min_pct_attr_support; END; /
This table function returns a set of rows that represent the frequent itemsets from an Association model. The rows are an enumeration of the frequent itemsets generated during the creation of the model. For a detailed description of frequent itemsets, consult Oracle Data Mining Concepts.
DBMS_DATA_MINING.GET_FREQUENT_ITEMSETS ( model_name IN VARCHAR2) RETURN DM_ItemSets pipelined;
Parameter | Description |
---|---|
|
Name of the model (see Rules and Limitations). |
Return Value | Description |
---|---|
|
Represents a set of rows with schema: (itemsets_id NUMBER, items DM_items, support NUMBER, number_of_items NUMBER) |
RNDS
, WNDS
, RNPS
, WNPS
The table function pipes out rows with the schema:
(itemsets_id NUMBER, items DM_items, support NUMBER, number_of_items NUMBER)
DM_Items
is a nested table of VARCHAR2
strings representing individual item names.
The examples shown in this section describe how to un-nest the values from each of the columns discussed in the preceding sections.
The following example demonstrates an Association model build followed by an invocation of GET_FREQUENT_ITEMSETS
table function from Oracle SQL.
-- prepare a settings table to override default settings CREATE TABLE market_settings AS
SELECT * FROM TABLE(DBMS_DATA_MINING.GET_DEFAULT_SETTINGS) WHERE setting_name LIKE 'ASSO_%'; BEGIN -- update the value of the minimum confidence UPDATE census_settings SET setting_value = TO_CHAR(0.081) WHERE setting_name = DBMS_DATA_MINING.asso_min_confidence; /* build a AR model */ DBMS_DATA_MINING.CREATE_MODEL( model_name => 'market_model', function => DBMS_DATA_MINING.ASSOCIATION, data_table_name => 'market_build', case_id_column_name => 'item_id', target_column_name => NULL, settings_table_name => 'census_settings'); END; / -- View the (unformatted) Itemsets from SQL/Plus SELECT itemset_id, items, support, number_of_items FROM TABLE(DBMS_DATA_MINING.GET_FREQUENT_ITEMSETS('market_model'));
This table function returns a set of rows that provide the details of an Adaptive Bayes Network model.
DBMS_DATA_MINING.GET_MODEL_DETAILS_ABN ( model_name IN VARCHAR2) RETURN DM_ABN_Details
Parameter | Description |
---|---|
|
Name of the model (see Rules and Limitations). |
RNDS
, WNDS
, RNPS
, WNPS
.
The table function pipes out rows with the schema:
The examples shown in this section describe how to un-nest the values from each of the columns discussed earlier.
This function returns details only for 'single feature'
ABN model.
The following example demonstrates an ABN model build followed by an invocation of GET_MODEL_DETAILS_ABN
table function from Oracle SQL.
BEGIN -- prepare a settings table to override default algorithm CREATE TABLE abn_settings (setting_name VARCHAR2(30), setting_value VARCHAR2(128)); INSERT INTO abn_settings VALUES (dbms_data_mining.algo_name, dbms_data_mining.algo_adaptive_bayes_network); -- create a model DBMS_DATA_MINING.CREATE_MODEL ( model_name => 'abn_model', function => DBMS_DATA_MINING.CLASSIFICATION, data_table_name => 'abn_build', case_id_column_name => 'id', target_column_name => NULL, settings_table_name => 'abn_settings'); END; / -- View the (unformatted) results from SQL/Plus SELECT * FROM TABLE(DBMS_DATA_MINING.GET_MODEL_DETAILS_ABN('abn_model');
This table function returns a set of rows that provide the details of a k-Means clustering model. The rows are an enumeration of the clustering patterns generated during the creation of the model.
DBMS_DATA_MINING.GET_MODEL_DETAILS_KM ( model_name IN VARCHAR2) RETURN DM_Clusters pipelined;
Parameter | Description |
---|---|
|
Name of the model (see Rules and Limitations). |
RNDS
, WNDS
, RNPS
, WNPS
The table function pipes out rows with the schema:
(id INTEGER, record_count NUMBER, parent NUMBER, tree_level NUMBER, dispersion NUMBER, child DM_Children, centroid DM_Centroids, histogram DM_Histograms, rule DM_Rule)
Each DM_Children
value is a nested table of id's.
Each DM_Centroid
value is itself as nested table that, when un-nested, can return rows of the form:
(attribute_name VARCHAR2(30) mean NUMBER, mode_value VARCHAR2(30), variance NUMBER);
Each DM_Histogram
value is itself a nested table that, when un-nested, can return rows of the form:
(attribute_name VARCHAR2(30), bin_id NUMBER, lower_bound NUMBER, upper_bound NUMBER, label VARCHAR2(4000), count NUMBER)
Each DM_Rule
value is an object with two nested DM_Predicate
columns:
(rule_id INTEGER, antecedent DM_Predicates, consequent DM_Predicates, rule_support NUMBER, rule_confidence NUMBER)
DM_Predicates
is a collection of DM_Predicate
objects. When un-nested, each object maps to a row of the form:
(attribute_name VARCHAR2(30), conditional_operator CHAR(2) , attribute_num_value NUMBER, attribute_str_value VARCHAR2(4000), attribute_support NUMBER, attribute_confidence NUMBER)
The examples shown in this section describe how to un-nest the values from each of the columns discussed earlier.
The following example demonstrates a k-Means clustering model build followed by an invocation of GET_MODEL_DETAILS_KM
table function from Oracle SQL.
BEGIN -- create a settings table UPDATE cluster_settings SET setting_value = 3 WHERE setting_name = DBMS_DATA_MINING.kmeans_block_growth; /* build a k-Means clustering model */ DBMS_DATA_MINING.CREATE_MODEL( model_name => 'eight_clouds', function => DBMS_DATA_MINING.CLUSTERING, data_table_name => 'eight_clouds_build', case_id_column_name => 'id', target_column_name => NULL, settings_table_name => 'cluster_settings'); END; / -- View the (unformatted) rules from SQL/Plus SELECT id, record_count, parent, tree_level, dispersion, child, centroid, histogram, rule FROM TABLE(DBMS_DATA_MINING_GET_MODEL_DETAILS_KM('eight_clouds'));
This table function returns a set of rows that provide the details of a Naive Bayes model. The rows are an enumeration of the patterns generated during the creation of the model.
DBMS_DATA_MINING.GET_MODEL_DETAILS_NB ( model_name IN VARCHAR2) RETURN DM_NB_Details pipelined;
Parameter | Description |
---|---|
|
Name of the model (see Rules and Limitations). |
RNDS
, WNDS
, RNPS
, WNPS
The table function pipes out rows with the schema:
(target_attr_name VARCHAR2(30), target_attr_num_value NUMBER, target_attr_str_value VARCHAR2(4000), prior_probability NUMBER, conditionals DM_Conditionals)
Each DM_Conditionals
value is itself as nested table that, when un-nested, can return rows of type DM_Conditiona
l, of the form:
(attribute_name VARCHAR2(30), attribute_num_value NUMBER, attribute_str_value VARCHAR2(4000), conditional_probability NUMBER)
The examples shown in this section describe how to un-nest the values from each of the columns discussed earlier.
Assume that you have built a classification model census_model
using the Naive Bayes algorithm. You can retrieve the model details as shown in this example.
-- You can view the Naive Bayes model details in many ways -- Consult the Oracle Application Developer's Guide - -- Object-Relational Features for different ways of -- accessing Oracle Objects. -- View the (unformatted) details from SQL/Plus SELECT attribute_name, attribute_num_value, attribute_str_value, prior_probability, conditionals, FROM TABLE(DBMS_DATA_MINING.GET_MODEL_DETAILS_NB('census_model');
See nbdemo.sql
for generation of formatted rules.
This table function returns a set of rows that provide the details of a Non-Negative Matrix Factorization model.
DBMS_DATA_MINING.GET_MODEL_DETAILS_NMF ( model_name IN VARCHAR2) RETURN DM_NMF_Details pipelined;
Parameter | Description |
---|---|
|
Name of the model (see Rules and Limitations). |
RNDS
, WNDS
, RNPS
, WNPS
The table function pipes out rows with the schema:
(feature_id INTEGER, attribute_set DM_NMF_Attribute_Set)
Each DM_NMF_Feature_Set
value is itself as nested table that, when un-nested, can return rows of the form:
(attribute_name) VARCHAR2, attribute_value VARCHAR2(4000), coefficient NUMBER)
The examples shown in this section describe how to un-nest the values from each of the columns discussed earlier.
Assume you have built an NMF model called my_nmf_model
. You can retrieve model details as shown:
--View (unformatted) details from SQL/Plus SELECT feature_id, attribute_set FROM TABLE(DBMS_DATA_MINING.GET_MODEL_DETAILS_NMF( 'my_nmf_model'));
This table function returns a set of rows that provide the details of a Support Vector Machines model. This is applicable only for classification or regression models built using a linear kernel. For any other kernel, the table function returns ORA-40215.
DBMS_DATA_MINING.GET_MODEL_DETAILS_SVM ( model_name IN VARCHAR2) RETURN DM_SVM_Linear_Coeff pipelined;
Parameter | Description |
---|---|
|
Name of the model (see Rules and Limitations). |
The table function pipes out rows with the schema:
(class VARCHAR2(4000), attribute_set DM_SVM_Attribute_Set)
class
represents classification target values. For regression targets, class is NULL
. For each classification target value for classification models or once only for regression models, the DM_SVM_Attribute_Set
value is itself a nested table that, when un-nested, can return rows of the form:
(attribute_name VARCHAR2(30), attribute_value VARCHAR2(4000), coefficient NUMBER)
The examples shown in this section describe how to un-nest the values from each of the columns discussed earlier.
The following example demonstrates an SVM model build followed by an invocation of GET_MODEL_DETAILS_SVM
table function from Oracle SQL:
-- Create SVM model BEGIN dbms_data_mining.create_model( model_name => 'SVM_Clas_sample', mining_function => dbms_data_mining.classification, data_table_name => 'svmc_sample_build_prepared', case_id_column_name => 'id', target_column_name => 'affinity_card', settings_table_name => 'svmc_sample_settings'); END; / -- Display model details SELECT * FROM TABLE(dbms_data_mining.get_model_details_svm('SVM_Clas_sample')) ORDER BY class;
This table function returns the list of settings that were used to build the model.
DBMS_DATA_MINING.GET_MODEL_SETTINGS( model_name IN VARCHAR2) RETURN DM_Model_Settings pipelined;
Parameter | Description |
---|---|
|
Name of the model (see Rules and Limitations). |
Return Value | Description |
---|---|
|
Represents a set of rows with schema: (setting_name VARCHAR2(30), setting_value VARCHAR2(128)) |
RNDS
, WNDS
, RNPS
, WNPS
You can use this table function to determine the settings that were used to build the model. This is purely for informational purposes only -- you cannot alter the model to adopt new settings.
Assume that you have built a classification model census_model
using the Naive Bayes algorithm. You can retrieve the model settings using Oracle SQL as follows:
SELECT setting_name, setting_value FROM TABLE(DBMS_DATA_MINING.GET_MODEL_SETTINGS('census_model'));
This table function returns the model signature, which is a set of rows that provide the name and type of each attribute required as input to the APPLY operation.
The case identifier is not considered a mining attribute. For classification and regression models, the target attribute is also not considered part of the model signature.
DBMS_DATA_MINING.GET_MODEL_SIGNATURE( model_name IN VARCHAR2) RETURN DM_Model_Signature pipelined;
Parameter | Description |
---|---|
|
Name of the model (see Rules and Limitations). |
Return Value | Description |
---|---|
|
Represents a set of rows with schema: (attribute_name VARCHAR2(30), attribute_type VARCHAR2(106)) |
RNDS
, WNDS
, RNPS
, WNPS
You can use this table function to get the list of attributes used for building the model. This is particularly helpful to describe a model when an APPLY
operation on test or scoring data is done a significant time period after the model is built, or after it is imported into another schema.
Assume that you have built a classification model census_model
using the Naive Bayes algorithm. You can retrieve the model details using Oracle SQL as follows:
SELECT attribute_name, attribute_type FROM TABLE(DBMS_DATA_MINING.GET_MODEL_SIGNATURE('census_model');
This procedure imports specified data mining models from a dump file set or from a remote database.
DBMS_DATA_MINING.IMPORT_MODEL ( filename IN VARCHAR2, directory IN VARCHAR2, model_names IN VARCHAR2 DEFAULT NULL, operation IN VARCHAR2 DEFAULT NULL, remote_link IN VARCHAR2 DEFAULT NULL, jobname IN VARCHAR2 DEFAULT NULL, schema_remap IN VARCHAR2 DEFAULT NULL);
Use IMPORT_MODEL
to import all or specific data mining models from a dump or from a remote database by means of a DB link.
The dump file set must be created by expdp
or EXPORT_MODEL
procedure. The dump files must be located in the directory mapped by the directory
object. The user must have READ
and WRITE
privileges on the directory object. In order to import models from a dump created by another user, you must have IMP_FULL_DATABASE
privilege or have SYS role.
When import operation completes successfully, a log is created in the directory mapped by directory
object. If jobname
is provided, the log is named jobname
.log
. If jobname
is NULL
(the default) the log is named username_imp_nnnn.
log, where nnnn
is a number. If the dump file set contains multiple files, you can use "%U
" in the filename. For example, if your dump file set contains 3 files, archive01.dmp
, archive02.dmp
, and archive03.dmp
, you may specify filename=>'archive%U'
.
Use the parameter model_names
to specify which models to import. The use of model_names
is dependent on the operation, whether the source is a dump file set or a remote database. When importing from dump files, model names are the sole useful reference. In this case, you can either assign model_names
with comma-delimited model names or a WHERE
clause with names explicitly specified. When importing from a remote database, model_names
can be specified in the same fashion as in the EXPORT_MODEL
procedure. Details are listed in Table 23-48.
IMPORT_MODEL
is not mutually exclusive with the DDL-like operations such as CREATE_MODEL
, DROP_MODEL
, and RENAME_MODEL
. In other words, if an import operation is underway, models with the same name are dropped at that time, and the results in the schema are unpredictable.
The following example shows user MARY
imports all models from a dump file, model_exp_001.dmp
, created by user SCOTT
. The dump file is located in the file system directory mapped to a directory object called DM_DUMP
. Note that if user MARY
does not have IMP_FULL_DATABASE
privileges, IMPORT_MODEL
will raise an error.
-- import all models declare file_name VARCHAR2(40); BEGIN file_name := 'model_exp_001.dmp'; DBMS_DATA_MINING.import_model( filename=>file_name, directory=>'DM_DUMP', schema_remap=>'SCOTT:MARY'); dbms_output.put_line( 'DBMS_DATA_MINING.import_model of all models from SCOTT done!'); END; /
This procedure ranks the results of an APPLY
operation based on a top-N specification for predictive and descriptive model results. For classification models, you can provide a cost matrix as input, and obtain the ranked results with costs applied to the predictions.
DBMS_DATA_MINING.RANK_APPLY ( apply_result_table_name IN VARCHAR2, case_id_column_name IN VARCHAR2, ranked_apply_result_tab_name IN VARCHAR2, top_N IN INTEGER DEFAULT 1, cost_matrix_table_name IN VARCHAR2 DEFAULT NULL, apply_result_schema_name IN VARCHAR2 DEFAULT NULL, cost_matrix_schema_name IN VARCHAR2 DEFAULT NULL);
You can use RANK_APPLY
to generate ranked apply results, based on a top-N filter and also with application of cost for predictions, if a cost matrix is provided.
The behavior of RANK_APPLY
is similar to that of APPLY
with respect to other DDL-like operations such as CREATE_MODEL
, DROP_MODEL
, and RENAME_MODEL
. The procedure does not depend on the model; the only input of relevance is the apply results generated in a fixed schema table from APPLY
.
The main intended use of RANK_APPLY
is for the generation of the final APPLY
results against the scoring data in a production setting. You can apply the model against test data using APPLY
, compute various test metrics against various cost matrix tables, and use the candidate cost matrix for RANK_APPLY
.
The schema for the apply results from each of the supported algorithms is listed in subsequent sections. The case_id
column will be the same case identifier column as that of the apply results.
For numerical targets, the ranked results table will have the schema as shown:
(case_id VARCHAR2/NUMBER, prediction NUMBER, probability NUMBER, cost NUMBER, rank INTEGER)
For categorical targets, the ranked results table will have the following schema:
(case_id VARCHAR2/NUMBER, prediction VARCHAR2, probability NUMBER, cost NUMBER, rank INTEGER)
Clustering is an unsupervised mining function, and hence there are no targets. The results of an APPLY
operation contains simply the cluster identifier corresponding to a case, and the associated probability. Cost matrix is not considered here. The ranked results table will have the schema as shown, and contains the cluster ids ranked by top-N
.
(case_id VARCHAR2/NUMBER, cluster_id NUMBER, probability NUMBER, rank INTEGER)
Feature extraction is also an unsupervised mining function, and hence there are no targets. The results of an APPLY
operation contains simply the feature identifier corresponding to a case, and the associated match quality. Cost matrix is not considered here. The ranked results table will have the schema as shown, and contains the feature ids ranked by top-N
.
(case_id VARCHAR2/NUMBER, feature_id NUMBER, match_quality NUMBER, rank INTEGER)
BEGIN /* build a model with name census_model. * (See example under CREATE_MODEL) */ /* if build data was pre-processed in any manner, * perform the same pre-processing steps on apply * data also. * (See examples in the section on DBMS_DATA_MINING_TRANSFORM) */ /* apply the model to data to be scored */ DBMS_DATA_MINING.rank_apply( apply_result_table_name => 'census_apply', case_id_column_name => 'person_id', ranked_apply_result_tab_name => 'census_ranked_apply', top_N => 3, cost_matrix_table_name => 'census_cost_matrix'); END; / -- View Apply Results SELECT * FROM census_ranked_apply;
This procedure renames a mining model to a specified new name.
DBMS_DATA_MINING.RENAME_MODEL ( model_name IN VARCHAR2, new_model_name IN VARCHAR2);
Parameter | Description |
---|---|
|
Old name of the model (see Rules and Limitations). |
|
New name of the model (see Rules and Limitations). |
You can use RENAME_MODEL
to rename an existing mining model.
The behavior of the RENAME_MODEL
is similar to a SQL DDL RENAME
operation. It blocks DROP_MODEL
and CREATE_MODEL
operations. It does not block APPLY
, which is a SQL query-like operation that does not update any model data.
If an APPLY
operation is using a model, and you attempt to rename the model during that time, the RENAME
will succeed and APPLY
will return indeterminate results. This is in line with the conventional behavior in the RDBMS, where DDL operations do not block on query operations.
Assume the existence of a model census_model
. The following example shows how to rename this model.
BEGIN DBMS_DATA_MINING.rename_model( model_name => 'census_model', new_model_name => 'census_new_model'); END; /