4
DBMS_DATA_MINING

This chapter discusses the following topics related to writing data mining programs with the PL/SQL interface:

The requirements for compiling and executing Oracle Data Mining programs.
How to perform common data mining tasks using Oracle Data Mining.
Tips and techniques for using the algorithms.

This chapter provides an overview of the steps required to perform basic Oracle Data Mining tasks. For detailed examples of how to perform these tasks, see the sample programs in Chapter 5.

This chapter does not include detailed descriptions of the PL/SQL subprograms. For that information, see the DBMS_DATA_MINING and DBMS_DATA_MINING_TRANSFORM chapters in the PL/SQL Packages and Types Reference.

The DBMS_DATA_MINING package provides PL/SQL support for in-database data mining. You can use the package to build a mining model, test the model, and apply this model to your data to obtain predictive and descriptive information.

4.1 Development Methodology

The development methodology for data mining using the DBMS_DATA_MINING API is divided into two phases.

The first phase includes your application and data analysis and design, where you perform the following two steps:

Analyze your problem, and choose the mining function and algorithm.
Analyze the data to be used for building mining models (build data), testing predictive models (test data), and the new data on which the model will be applied (scoring data).

The second phase involves developing a mining application using DBMS_DATA_MINING and DBMS_DATA_MINING_TRANSFORM packages.

Prepare the build, test, and scoring data using the DBMS_DATA_MINING_TRANSFORM package or other third-party tool or direct SQL or PL/SQL utility scripts in a manner suitable for the chosen mining function and algorithm. An important caveat is that the three detests referred to above have to be prepared in an identical manner for mining results to be meaningful. This is an optional step.
Prepare a settings table that overrides the default mining algorithm for a given mining function, and the default algorithm settings. This is also an optional step.
Build a mining model for the given training dataset.
For predictive models (classification and regression), test the model for its accuracy and other attributes. This amounts to applying the model on the test data (i.e., scoring the test data), and computing various matrix on the apply results.
Retrieve the model signature to determine the mining attributes required by a given model for scoring. This information will help ascertain that the scoring data is suitable for a given model. This is an optional step.
Apply a classification, regression, clustering, or feature extraction model to new data to generate predictions and/or descriptive summaries and patterns about the data.
Retrieve the model details to understand why a model scored the data in a particular manner. This is an optional step.
Repeat steps 3 through 9 until you obtain satisfactory results.

4.2 Mining Models, Function, and Algorithm Settings

The DBMS_DATA_MINING package creates a mining model for a mining function using a specified mining algorithm that supports the function. The algorithm can be influenced by specific algorithm settings.

4.2.1 Mining Model

A model is identified by its name. Like tables in the database, a model has storage associated with it. But unlike a table, the form, shape, and content of this storage is opaque to the user. However, the user can view the contents of a model -- that is, the patterns and rules that constitute a mining model -- using algorithm-specific GET_MODEL_DETAILS functions. In addition, dm_user_models provides the model size in megabytes.

4.2.2 Mining Function

The DBMS_DATA_MINING package supports Classification, Regression, Association Rules, Clustering, and Feature Extraction. You can specify your choice of mining function through a parameter to the CREATE_MODEL procedure.

4.2.3 Mining Algorithm

Each mining function can be implemented using one or more algorithms. Table 4-1 provides a list of supported algorithms. Oracle assumes a default algorithm for each mining function, but you can override this default through an explicit setting in the settings table.

Table 4-1 DBMS_DATA_MINING Summary of Functions and Algorithms

Mining Function	Mining Algorithm
Classification	Naive Bayes (NB) -- default algorithm
Classification	Adaptive Bayes Network (ABN)
Classification	Support Vector Machine (SVM)
Regression	Support Vector Machine (SVM) -- default algorithm
Association Rules	Apriori Association Rules (AR)
Clustering	k-Means (KM)
Feature Extraction	Non-Negative Matrix Factorization (NMF)

Each algorithm has one or more settings or parameters that influence the way it builds the model. Oracle assumes a default set of algorithm settings for each mining algorithm. These defaults are available for your review through the table function GET_DEFAULT_SETTINGS. To override the defaults, you must provide the choice of the algorithm and the settings for the algorithm through a settings table input to the CREATE_MODEL procedure.

4.2.4 Settings Table

The settings table is a simple relational table with a fixed schema. You can choose the name of the settings table, but the column names and their types must be defined as specified below.

(setting_name   VARCHAR2(30),
setting_value   VARCHAR2(128))

The values provided in the settings table override the default values assumed by the system. The values inserted into the setting_name column are one or more of several constants defined in the DBMS_DATA_MINING package. Depending on what the setting name denotes, the value for the setting_value column can be a predefined constant or the actual numerical value corresponding to the setting itself. The setting_value column is defined to be VARCHAR2, so you must cast numerical inputs to string using the TO_CHAR() function before input into the settings table.

Table 4-2 through Table 4-7 list the various setting names and the valid setting values, with a brief explanation of each setting

Table 4-2 DBMS_DATA_MINING Function Settings

Algorithm Settings	Setting Value (with Permissible Value Ranges)
`algo_name`	Classification: One of: `algo_naive_bayes` `algo_support_vector_machines` `algo_adaptive_bayes_network` Regression: `algo_support_vector_machines` Association Rules: `algo_apriori_association_rules` Clustering: `algo_kmeans` Feature Extraction: `algo_non_negative_matrix_factor` Attribute Importance: `algo_ai_mdl`
`clas_priors_table_name`	`VARCHAR2` string denoting the name of a relational table of fixed schema containing prior probabilities. The schema of this table is provided in Section 4.2.4.1. This input is applicable only for classification algorithms. The prior probabilities table must be present in the current user's schema
`clus_num_clusters`	`TO_CHAR(``numeric_expr` >= 1) Number of clusters generated by a clustering algorithm Default value is 10
`feat_num_features`	`TO_CHAR(``numeric_expr` `>= 1)` Number of features to be extracted Default value estimated from the data by the algorithm
`asso_max_rule_length`	`TO_CHAR(2 <=` `numeric_expr` `<= 20)` Maximum rule length for AR algorithm Default value is 4
`asso_min_confidence`	`TO_CHAR(0` <= `numeric_expr` <= 1) Minimum confidence value for AR algorithm Default value is 0.1
`asso_min_support`	`TO_CHAR(0 <=` `numeric_expr` `<= 1)` Minimum support value for AR algorithm Default value is 0.1

Table 4-3 Algorithm Settings for Adaptive Bayes Network

Setting Name	Setting Value (with Permissible Value Ranges)
`abns_model_type`	Model type for Adaptive Bayes Network: `abns_single_feature (single feature)` `abns_multi_feature (multi feature` `abns_naive_bayes (naive bayes)` Default value is `abns_multi_feature`
`abns_max_build_minutes`	`TO_CHAR(``numeric_expr` `>= 0)` The maximum time threshold for completion of model build. Default value is 0, which implies no time limit.
`abns_max_nb_predictors`	`TO_CHAR(``numeric_expr` `> 0)` Maximum number of Naive Bayes predictors to be considered for model build, when the model type is chosen to be `abns_naive_bayes`. Default value is 10.
`abns_max_predictors`	`TO_CHAR(``numeric_expr` `> 0)` Default is 25

Table 4-4 Algorithm Settings for Naive Bayes

Setting Name Setting Value (with Permissible Value Ranges)

Setting Name	Setting Value (with Permissible Value Ranges)
`nabs_singleton_threshold`	`TO_CHAR(0 <=` `numeric_expr` `<=1)` Value of singleton threshold for NB algorithm Default value is 0.01
`nabs_pairwise_threshold`	`TO_CHAR (0 <=` `numeric_expr` `<=1)` Value of pairwise threshold for NB algorithm Default value is 0.01

nabs_singleton_threshold

TO_CHAR(0 <= numeric_expr <=1)

Value of singleton threshold for NB algorithm

Default value is 0.01

nabs_pairwise_threshold

TO_CHAR (0 <= numeric_expr <=1)

Value of pairwise threshold for NB algorithm

Default value is 0.01

Table 4-5 Algorithm Settings for Support Vector Machines

Setting Name	Setting Value (with Permissible Value Ranges)
`svms_kernel_function`	Kernel for Support Vector Machine: `svms_linear` (for Linear Kernel `svms_gaussian` (for Gaussian Kernel) Default value is `svms_linear`
`svms_kernel_cache_size`	`TO_CHAR(``numeric_expr` `> 0)` Value of kernel cache size for SVM algorithm. Applies to Gaussian kernel only. Default value is 50000000 bytes
`svms_conv_tolerance`	`TO_CHAR(``numeric_expr` `> 0)` Convergence tolerance for SVM algorithm Default value is 0.001.
`svms_std_dev`	`TO_CHAR(``numeric_expr` `> 0)` Value of standard deviation for SVM algorithm This is applicable only for Gaussian kernel Default value estimated from the data by the algorithm
`svms_complexity_factor`	`TO_CHAR(``numeric_expr` `> 0)` Value of complexity factor for SVM algorithm Default value estimated from the data by the algorithm
`svms_epsilon`	`TO_CHAR(``numeric_expr` `> 0)` Value of epsilon factor for SVM Regression Default value estimated from the data by the algorithm

Table 4-6 Algorithm Settings for Non-Negative Matrix Factorization

Setting Name Setting Value (with Permissible Value Ranges)

Setting Name	Setting Value (with Permissible Value Ranges)
`nmfs_random_seed`	`TO_CHAR``(numeric_expr)` Number of iterations for NMF algorithm Default value is -1
`nmfs_num_iterations`	`TO_CHAR(1` <= `numeric_expr` `<= 500)` Number of iterations for NMF algorithm Default value is 50
`nmfs_conv_tolerance`	`TO_CHAR(0 <` `numeric_expr` `<= 0.5)` Convergence tolerance for NMF algorithm Default value is 0.05

nmfs_random_seed

TO_CHAR(numeric_expr)

Number of iterations for NMF algorithm

Default value is -1

nmfs_num_iterations

TO_CHAR(1 <= numeric_expr <= 500)

Number of iterations for NMF algorithm

Default value is 50

nmfs_conv_tolerance

TO_CHAR(0 < numeric_expr <= 0.5)

Convergence tolerance for NMF algorithm

Default value is 0.05

Table 4-7 Algorithm Settings for k-Means

Setting Name	Setting Value (with Permissible Value Ranges)
`kmns_distance`	Distance Function for k-Means Clustering: `kmns_euclidean` `kmns_cosine` `kmns_fast_cosine` Default value is `kmns_euclidean`
`kmns_iterations`	`TO_CHAR(0 <``numeric_expr` `<= 20)` Number of iterations for k-Means algorithm Default value is 3
`kmns_conv_tolerance`	`TO_CHAR(0 <` `numeric_expr` `<= 0.5)` Convergence tolerance for k-Means algorithm Default value is 0.01
`kmns_split_criterion`	Split criterion for k-Means Clustering: `kmns_varianc`e `kmns_size` Default value is `kmns_variance`
`kmns_num_bins`	Number of histogram bins. Specifies the number of bins in the attribute histogram produced by k-Means. The bin boundaries for each attribute are computed globally on the entire training data set. The binning method is equi-width. All attributes have the same number of bins with the exception of attributes with a single value that have only one bin. Range > 0 Default value is 10.
`kmns_block_growth`	`TO_CHAR(1 <` numeric_expr `<= 5)` Growth factor for memory allocated to hold cluster data Default value is 2
`kmns_min_pct_attr_support`	Minimum percentage support required for attributes in rules. Specifies the minimum percentage of values for an attribute in a given cluster required to include this attribute in the rule description of the cluster. That is, if the required support level is not met, the attribute would be omitted from the rule. This would allow retaining in the rule only the well-represented attributes. Setting the parameter value too high in data with missing values can result in very short or even empty rules. Range >= 0 and <= 1 Default is 0.1.

You can create a settings table as shown in the example below for an SVM classification model, and edit the individual values using SQL DML.

CREATE TABLE drugstore_settings (
  setting_name VARCHAR2(30),
  setting_value VARCHAR2(128))
BEGIN
-- override the default for convergence tolerance for SVM Classification
INSERT INTO drugstore_model_settings (setting_name, setting_value)
VALUES (dbms_data_mining.svms_conv_tolerance, TO_CHAR(0.081));
COMMIT;
END;

The table function GET_DEFAULT_SETTINGS provides you all the default settings for mining functions and algorithms. If you intend to override all the default settings, you can create a seed settings table and edit them using DML.

BEGIN
CREATE TABLE drug_store_settings AS
SELECT setting_name, setting_value
  FROM TABLE (DBMS_DATA_MINING.GET_DEFAULT_SETTINGS
 WHERE setting_name LIKE 'SVMS_%';
-- update the values using appropriate DML
END;

You can also create a settings table based on another model's settings using GET_MODEL_SETTINGS, as shown in the example below.

BEGIN
CREATE TABLE my_new_model_settings AS
SELECT setting_name, setting_value 
  FROM TABLE (DBMS_DATA_MINING.GET_MODEL_SETTINGS('my_other_
model'));
END;

4.2.4.1 Prior Probabilities Table

Consult Oracle Data Mining Concepts for an explanation of the prior probabilities table. You can specify a prior probabilities table as an optional function setting when building classification models.

You must create the prior probabilities table using the fixed schema shown below. For numerical targets, use the following schema:

target_value      NUMBER
prior_probability NUMBER

For categorical targets, use the following schema:

target_value      VARCHAR2
prior_probability NUMBER

Next, provide the name of the prior probabilities table as input to the setting_value column in the settings table, with the corresponding value for the setting_name column to be DBMS_DATA_MINING.clas_priors_table_name, as shown below:

BEGIN
INSERT INTO drugstore_settings (setting_name, setting_value) VALUES (DBMS_
DATA_MINING.class_priors_table_name, 
'census_priors');
COMMIT;
END;

4.2.4.2 Cost Matrix Table

Consult Oracle Data Mining Concepts for an explanation of the cost matrix. You must create a cost matrix table with the fixed schema shown below. For numerical targets, use the following schema:

actual_target_value    NUMBER
predicted_target_value NUMBER
cost                   NUMBER

For categorical targets, use the following schema:

actual_target_value    VARCHAR2
predicted_target_value VARCHAR2
cost                   NUMBER

The DBMS_DATA_MINING package enables you to evaluate the cost of predictions from classification models in an iterative manner during the experimental phase of mining, and to eventually apply the optimal cost matrix to predictions on the actual scoring data in a production environment.

The data input to each COMPUTE procedure in the package is the result generated from applying the model on test data. In addition, if you also provide a cost matrix as an input, the COMPUTE procedure generates test results taking the cost matrix into account. This enables you to experiment with various costs for a given prediction against the same APPLY results, without rebuilding the model and applying it against the same test data for every iteration.

Once you arrive at an optimal cost matrix, you can then input this cost matrix to the RANK_APPLY procedure along with the results of APPLY on your scoring data. RANK_APPLY will provide your new data ranked by cost.

4.3 Mining Operations and Results

There are essentially four classes of mining operations supported by the DBMS_DATA_MINING package:

Operations that create, drop, and rename a model: CREATE_MODEL, DROP_MODEL, RENAME_MODEL.
Operation that reads a model to score new data: APPLY.
Operation that enables ranking of APPLY results or other data that is presented using the same schema as that of APPLY results: RANK_APPLY.
Operations that read and describe a model: GET_MODEL_DETAILS, GET_MODEL_SETTINGS, GET_MODEL_SIGNATURE.
Operations that help test a classification model, based on the results of an APPLY operation on the test data, or based on any other data that is presented using the same schema as that of the APPLY results: COMPUTE_CONFUSION_MATRIX, COMPUTE_LIFT, and COMPUTE_ROC.
Operations that help transport a model from one schema to another, or from one database instance to another: EXPORT_MODEL, IMPORT_MODEL.
GET_DEFAULT_SETTINGS returns default values for all the settings.

Of these, the first set represents DDL-like operations. The last set represents utilities. The rest are query-like operations in that they do not modify the model. Besides these operations, the following capabilities are also provided as part of the Oracle Data Mining installation:

User Views -- DM_USER_MODELS
Queries to compute metrics that test regression models.

Mining results are either returned as result sets or persisted as fixed schema tables.

4.3.1 Build Results

The CREATE_MODEL operation creates a mining model. The viewable contents of a mining model are provided to you through GET_MODEL_DETAILS functions for each supported algorithm. In addition, GET_MODEL_SIGNATURE and GET_MODEL_SETTINGS provide descriptive information about the model.

4.3.2 Apply Results

The APPLY operation creates and populates a fixed schema table with a given name. The schema of this table varies based on the particular mining function, algorithm, and target attribute type -- numerical or categorical.

The RANK_APPLY operation takes this results table as input and generates another table with results ranked based on a top-N input, and for classification models, also based on cost. The schema of this table varies based on the particular mining function, algorithm, and the target attribute type -- numerical or categorical.

4.3.3 Test Results for Classification Models

The COMPUTE routines provided in the package are the most popularly used metrics for classification. They are not tied to a particular model -- they can compute the metrics from any meaningful data input as long as the schema of the input tables fits the specification of the apply results table and the targets tables. Please consult any of the COMPUTE descriptions in this document for details.

4.3.4 Test Results for Regression Models

The most commonly used metrics for regression models are root mean square error and mean absolute error. You can use the simple SQL queries provided below to compute these metrics by replacing the italicized tokens with table and column names appropriate for your application.

4.3.4.1 Root Mean Square Error

SELECT sqrt(avg((A.prediction - B.target_column_name) * 
                (A.prediction - B.target_column_name))) rmse 
  FROM apply_results_table A, targets_table B 
 WHERE A.case_id_column_name = B.case_id_column_name;

4.3.4.2 Mean Absolute Error

Given the targets_table generated from the test data with the schema:

(case_id_column_name    VARCHAR2,
target_column_name      NUMBER)

and apply results table for regression with the schema:

(case_id_column_name    VARCHAR2, 
prediction              NUMBER)

and a normalization table (optional) with the schema:

(attribute_name        VARCHAR2(30),
scale                  NUMBER,
shift                  NUMBER)

the query for mean absolute error is:

SELECT  /*+PARALLEL(T) PARALLEL(A)*/
    AVG(ABS(T.actual_value - T.target_value)) mean_absolute_error
  FROM (SELECT B.case_id_column_name
              (B.target_column_name * N.scale + N.shift) actual_value
          FROM targets_table B,
               normalization_table N
        WHERE N.attribute_name = B.target_column_name AND
              B.target_column_name = 1) T,
apply_results_table_name A
 WHERE A.case_id_column_name = T.case_id_column_name;

You can fill in the italicized values with the actual column and table names chosen by you. If the data has not undergone normalization transformation, you can eliminate those references from the subquery. See dm/demo/sample/plsql/svmrdemo.sql for an example.

4.4 Mining Data

The data input for all the mining operations should be through standard relational tables, where each row in the table represents a case, and each column in the table uniquely represents a mining attribute. We call this format single-record case.

A table in the Oracle RDBMS supports 1000 columns. The DBMS_DATA_MINING package requires a mandatory case identifier column, which implies that you can provide 999 mining attributes using this representation, which is adequate for most business applications.

4.4.1 Wide Data Support

The notable exceptions to the common scenario are applications in the domains of bioinformatics, text mining, and such specialized areas where data is characterized to be wide and shallow -- with relatively few cases numbering in the thousands, but with several thousand mining attributes.

You can provide such data in a multi-record case format, where attribute, value pairs are grouped into collections (nested tables) associated with a given case-id. You must use the fixed collection types DM_Nested_Numericals and DM_Nested_Categoricals to define columns that represent collections of numerical attributes and categorical attributes respectively.

A few caveats on the use of multi-record case format:

You can intersperse columns defined to be of the two types referred to above with scalar columns that represent individual attributes in a table or view.
For a given case-id, attribute names have to be unique across all the collections and individual columns.
Target (for classification, regression) must be a non-nested attribute. We cannot accept a target attribute represented in a nested column or the nested column itself.

From a data modeling perspective, the multi-record case format is analogous to the dimension-fact relationship commonly found in OLAP applications, where the dimension and fact tables are aggregated based on a common key for generating reports using materialized views or SQL queries.

The main distinction between the OLAP fact table and the multi-record case is this: Each row in the fact table represents a column value, whereas each row in the multi-record collection represents an attribute name (paired with its corresponding value in another column in the nested table). The two fixed collection types are provided to highlight and enforce this distinction. These two types are defined with the reasonable assumption that mining attributes of the same type (numericals versus categoricals) are generally grouped together, just as a fact table contains values that logically correspond to the same entity.

Oracle strongly recommends that you present your multi-record case data using object views, and present the view as input to CREATE_MODEL and APPLY operations. Apart from the benefit of providing all your mining attributes through a single row-source without impacting their physical data storage, the view acts as a join specification on the underlying tables that can be used by the server for efficiently accessing your data.

We illustrate this discussion on wide data with a real-world example of an analytical pipeline for brain tumor research. The datasets of relevance are gene expression data from the instruments (fact table), and the clinical data about the patient (dimension table). The schemas of these tables are provided below.

4.4.1.1 Clinical Data -- Dimension Table

(case_id    NUMBER,
name        VARCHAR2(30)
type        VARCHAR2(30)
subtype     VARCHAR2(30),
gender      CHAR(1),
age         NUMBER,
status      VARCHAR2(30))

4.4.1.2 Gene Expression Data -- Fact Table

(case_id    NUMBER,
gene        VARCHAR2(30),
expr        NUMBER)

Let us consider building a model with status as the target prediction, and with sex, age, and expr being the predictors. You can provide the build data input using an object view that combines the clinical_data table and the gene_expression_data table with the schema:

(case_id     NUMBER,
gender       CHAR(1),
age          NUMBER,
gene_expr    DM_Nested_Numericals,
status       VARCHAR2(30))

The query for constructing such an object view is relatively simple in Oracle SQL:

CREATE OR REPLACE VIEW gene_expr_build AS
SELECT C.case_id,
       C.gender,
       C.age,
       CAST(MULTISET(
       SELECT gene, expr
         FROM gene_expression_data
        WHERE case_id = C.case_id) AS DM_Nested_Numericals
       ) gene_expr,
       C.status
 FROM  clinical_data C

Now gene_expr_build can be provided as the input data_table_name for CREATE_MODEL.

4.4.2 Attribute Types

Oracle Data Mining handles categorical and numerical attributes. See Oracle Data Mining Concepts for explanation and examples of attribute types.

The DBMS_DATA_MINING package infers the attribute type based on the type of the column representing the mining attribute. You must define all columns representing numerical attributes to be of type NUMBER. You must define all columns representing categorical attributes to be of type VARCHAR2 or CHAR.

In situations where you have numbers that are deemed categorical data by your application logic, you must typecast such attribute values using the TO_CHAR() operator and populate them into a VARCHAR2 or CHAR column representing the mining attribute.

In situations where you have numeric attribute values stored in a CHAR or VARCHAR2 column, you must typecast those attribute values using the TO_NUMBER() operator and store them in a NUMBER column.

If persisting these transformed values in another table is not a viable option, you could also create a view with these conversions in place, and provide the view name to represent the training data input for the CREATE_MODEL operation.

4.4.3 Target Attribute

Classification and Regression algorithms require a target attribute. The package supports single targets. The target attribute for all classification algorithms can be numerical or categorical. SVM Regression supports only numerical target attributes.

4.4.4 Data Transformations

All mining operations assume the incoming data to be already prepared and transformed. You can preprocess your data using the DBMS_DATA_MINING_TRANSFORM package, or by using any third-party tool, or using your own homegrown utilities based on SQL and/or PL/SQL scripts.

If you preprocess or transform the input data, you must also reverse-transform the results from APPLY to map the data back to its original form.

See PL/SQL Packages and Types Reference for a description of the DBMS_DATA_MINING_TRANSFORM package.

The suggestion to customers with wide data is to perform transforms before forming WD views on data in single-record (2D) or multi-record (transactional) format. It is possible to use dbmsdmxf for multi-record format. In this case, all attributes are transformed in a similar way. In most of the cases, attributes in transactional (multi-record) form are of the same scale and thus this approach works. Otherwise, our suggestion is to split the data into sets of similar items and then transform them separately.

4.5 Performance Considerations

If you have mining attributes numbering in the few hundreds, and your application requires them to be represented as columns in the same row of the table, here are some data storage issues to consider. For a table with several columns, the key question to consider is the (average) row length, not the number of columns. Having more than 255 columns in a table built with a smaller block size typically results in intra-block chaining. Oracle stores multiple row pieces in the same block owing to pragmatics of design, but the overhead to maintain the column information is minimal as long as all row pieces fit in a single data block. If the rows don't fit in a single data block, you may consider using a larger database block size (or use multiple block sizes in the same database). For more details, consult the Oracle Data Mining Concepts and the Oracle Database Performance Tuning Guide.

4.6 Rules and Limitations for DBMS_DATA_MINING

The use of DBMS_DATA_MINING is subject to the following rules and limitations:

The CREATE_MODEL and APPLY operations require a case identifier column of type VARCHAR2, CHAR, or NUMBER. The length of the case identifier values is limited to 128 bytes.
The model name must not be greater than 25 bytes in length, must not be enclosed in quotes, and should not contain any special characters except underscores ('_'). In other words, DBMS_DATA_MINING does not treat model names in a case-sensitive manner like Oracle schema objects and up-cases all model names internally.
All schema objects with prefix DM_, DM$P, DM$J, DM$T are system objects that should not be read or updated by the user. Direct queries against these tables may be possible, but the queries may provide meaningless results. Direct DML against these tables can irrevocably corrupt your model. Hence Oracle strongly recommends that you do not perform any operations on these system objects.
Oracle Data Mining does not support a general privilege model that spans multiple users. GRANT and REVOKE of read and update privileges on a mining model across user schemas are not yet supported. The user can only read and update models that are generated in a given schema owned by that user. Models in one schema or database instance can be exported using EXPORT_MODEL to other schemas or database instances.
As a corollary, results of all mining operations are restricted to being generated in the schema corresponding to the user session from which a particular mining operation is launched.
In any given session, you must commit all DML to schema objects before invoking operations in the DBMS_DATA_MINING package.

4.7 Summary of Data Types, Constants, Exceptions, and User Views

The DBMS_DATA_MINING and the DBMS_DATA_MINING_TRANSFORM packages use the data types shown in Table 4-8.

Table 4-8 DBMS_DATA_MINING Summary of Data Types

Data Type	Purpose
`DM_ABN_Detail`	This type represents each row of the model detail output generated by `GET_MODEL_DETAILS_ABN`
`DM_ABN_Details`	This type represents the ABN model details generated by `GET_MODEL_DETAILS_ABN`
`DM_Centroid`	This type represents the centroid of a cluster. It is used when retrieving cluster details using `GET_MODEL_DETAILS_KM`.
`DM_Child`	This type represents each child node of a cluster
`DM_Children`	This type represents a set of children nodes for a given cluster identifier
`DM_Cluster`	This type represents a cluster retrieved using `GET_MODEL_DETAILS_KM`
`DM_Clusters`	This type represents a set of clusters
`DM_Conditional`	This type represents each conditional probability from a set of conditional probabilities associated with each mining attribute used in a Naive Bayes or Adaptive Bayes Network model
`DM_Conditionals`	This type represents conditional probabilities associated with a given mining attribute used in a Naive Bayes or Adaptive Bayes Network model. It is used when retrieving model details using `GET_MODEL_DETAILS_NB` or `GET_MODEL_DETAILS_ABN` respectively.
`DM_Histogram_Bin`	This type represents a histogram associated with a cluster identifier. It is used when retrieving cluster details using `GET_MODEL_DETAILS_KM`
`DM_Histograms`	This type represents a set of histograms
`DM_Item`	This type represents an item in a set of items
`DM_Items`	This type represents the set of items in an `ItemSet`
`DM_ItemSet`	This type represents an `ItemSet`
`DM_ItemSets`	This type represents frequent `Itemsets` in Association models.
`DM_Model_Settings`	This type represents the algorithm settings retrieved using the `GET_MODEL_SETTINGS` function.
`DM_Model_Signature`	This type represents a list of model signature attributes generated by `GET_MODEL_SIGNATURE`
`DM_Modelname_List`	This type represents a list of model names provided as input for the parameter `model_names` in `EXPORT_MODEL` and `IMPORT_MODEL` procedures.
`DM_NB_Detail`	his type represents the each row of the model detail output generated by `GET_MODEL_DETAILS_NB`
`DM_NB_Details`	This type represents the NB model details generated by `GET_MODEL_DETAILS_NB`.
`DM_Nested_Categoricals`	This type represents a nested table of categorical attributes, used for representing wide data.
`DM_Nested_Numericals`	This type represents a nested table of numerical attributes, used for representing wide data.
`DM_NMF_Attribute`	This type represents each attribute in an attribute set for NMF model details
`DM_NMF_Attribute_Set`	This type represents a set of attributes that correspond to a feature identifier, returned by `GET_MODEL_DETAILS_NMF`.
`DM_NMF_Feature`	This type represents a feature in a NMF model
`DM_NMF_Feature_Set`	This type represents a set of features returned by `GET_MODEL_DETAILS_NMF`.
`DM_Predicate`	This type represents each predicate in the set of predicates in a rule.
`DM_Predicates`	This type represents a set of predicates that constitute either the antecedent or the consequent of a rule.
`DM_Ranked_Attribute`	This type represents an entry in the set of ranked attribute returned by `GET_MODEL_DETAILS_AI`, ranked by the attribute's importance.
`DM_Ranked_Attributes`	This type represents a list of ranked attributes returned by `GET_MODEL_DETAILS_AI`.
`DM_Rule`	This type represents each rule in a list of rules generated by either `GET_ASSOCIATION_RULES` or `GET_MODEL_DETAILS_KM`.
`DM_Rules`	This type represents rules retrieved for Association Rules or k-means models using `GET_ASSOCIATION_RULES` and `GET_MODEL_DETAILS_KM` respectively.
`DM_SVM_Attribute`	This type represents each attribute in an attribute set for SVM model details
`DM_SVM_Attribute_Set`	This type represents a set of attributes returned by `GET_MODEL_DETAILS_SVM` for a linear model.
`DM_SVM_Linear_Coeff`	This type represents an entry in the set of linear coefficients returned by `GET_MODEL_DETAILS_SVM`
`DM_SVM_Linear_Coeff_Set`	This type represents the set of linear coefficients returned by `GET_MODEL_DETAILS_SVM` for an SVM model built using the linear kernel.

Table 4-9 through Table 4-16 list the constants to be used for various settings in the settings table.

Table 4-9 DBMS_DATA_MINING Constants Summary: Mining Function

Constant	Purpose
association	Parameter value for `mining_function` in `CREATE_MODEL`, representing association mining function
attribute importance	Parameter value for `mining_function` in `CREATE_MODEL`, representing attribute importance mining function
classification	Parameter value for `mining_function` in `CREATE_MODEL`, representing classification mining function
regression	Parameter value for `mining_function` in `CREATE_MODEL`, representing regression mining function
clustering	Parameter value for `mining_function` in `CREATE_MODEL`, representing clustering mining function
feature_extraction	Parameter value for `mining_function` in `CREATE_MODEL`, representing Feature Extraction mining function

Table 4-10 DBMS_DATA_MINING Constants Summary: Function Settings

Constant	Purpose
`clas_priors_table_name`	Setting name representing prior probability table name for classification function
`clus_num_clusters`	Setting name representing number of clusters for clustering function
`feat_num_features`	Setting name representing number of features for feature selection function
`asso_max_rule_length`	Setting name representing maximum rule length
`asso_min_confidence`	Setting name representing minimum confidence
`asso_min_support`	Setting name representing minimum support

Table 4-11 DBMS_DATA_MINING Constants Summary: Algorithm Settings

Constant	Purpose
`algo_name`	Setting name representing the mining algorithm
`algo_apriori_association_rules`	Setting value for Apriori algorithm for association rules
`algo_naive_bayes`	Setting value for Naive Bayes (NB) algorithm for classification
`algo_support_vector_machines`	Setting value for Support Vector Machine (SVM) algorithm for classification or regression
`algo_nonnegative_matrix_factor`	Setting value for Non-Negative Matrix Factorization (NMF) for feature selection
`algo_kmeans`	Setting value for k-Means (KM) for clustering
`algo_ai_mdl`	Setting value for Minimum Description Length based algorithm for Attribute Importance.

Table 4-12 DBMS_DATA_MINING Constants Summary: Adaptive Bayes Network

Constant	Purpose
`abns_model_type`	Setting name representing ABN model type
`abns_single_feature`	Setting value representing single feature ABN model
`abns_multi_feature`	Setting value representing multi feature ABN model
`abns_naive_bayes`	Setting value representing Naive Bayes ABN model
`abns_max_build_minutes`	Setting name representing maximum time threshold to complete an ABN model build
`abns_max_nb_predictors`	Setting name representing the maximum number of Naive Bayes predictors to be considered for building an ABN model of type `abns_naive_bayes`

Table 4-13 DBMS_DATA_MINING Constants Summary: Naive Bayes

Constant	Purpose
`nabs_singleton_threshold`	Setting value for singleton threshold for Naive Bayes
`nabs_pairwise_threshold`	Setting value for pair-wise threshold for Naive Bayes

Table 4-14 DBMS_DATA_MINING Constants Summary: Support Vector Machines

Constant	Purpose
`svms_kernel_function`	Setting name representing the kernel function for SVM
`svms_linear`	Setting value for Linear Kernel for SVM
`svms_guassian`	Setting value for Gaussian Kernel for SVM
`svms_kernel_cache_size`	Setting name representing for Kernel Cache Size for Support Vector Machine
`svms_conv_tolerance`	Setting name representing tolerance for SVM
`svms_std_dev`	Setting name representing standard deviation for
`svms_complexity_factor`	Setting name representing complexity factor for SVM
`svms_epsilon`	Setting name representing epsilon for SVM Regression

Table 4-15 DBMS_DATA_MINING Constants Summary: Non-Negative Matrix Factorization

Constant	Purpose
`nmfs_num_iterations`	Setting name representing number of iterations
`nmfs_conv_tolerance`	Setting name representing convergence tolerance
`nmfs_random_seed`	Setting name representing random seed for NMF

Table 4-16 DBMS_DATA_MINING Constants Summary: k-Means

Constant	Purpose
`kmns_distance`	Setting name representing distance function
`kmns_euclidean`	Setting value representing Euclidean distance function
`kmns_cosine`	Setting value representing cosine distance function
`kmns_fast_cosine`	Setting value representing fast cosine distance function
`kmns_iterations`	Setting name representing number of iterations
`kmns_conv_tolerance`	Setting name representing convergence tolerance
`kmns_split_criterion`	Setting name representing split criterion
`kmns_variance`	Setting value representing variance as the split criterion
`kmns_size`	Setting value representing size as the split criterion
`kmns_block_growth`	Setting name representing growth factor for memory allocated to hold cluster data
`kmns_num_bins`	Setting value for number of histogram bins
`kmns_min_pct_attr_support`	Setting value for minimum percentage report required for attributes in rules

Table 6-18 lists the errors generated by DBMS_DATA_MINING.

Table 4-17 DBMS DATA_MINING Errors Summary

Oracle Error	Description
`ORA-40201`	Invalid input parameter %s
`ORA-40202`	Column %s does not exist in the input table %s
`ORA-40203`	Model %s does not exist
`ORA-40204`	Model %s already exists
`ORA-40205`	Invalid setting name %s
`ORA-40206`	Invalid setting value for setting name %s
`ORA-40207`	Duplicate or multiple function settings
`ORA-40208`	Duplicate or multiple algorithm settings for function %s
`ORA-40209`	Invalid setting: %s for function %s
`ORA-40211`	Algorithm name: %s is invalid
`ORA-40212`	Invalid target data type in input data for function: %
`ORA-40213`	Contradictory values for settings: %s, %s
`ORA-40214`	Duplicate setting: %s
`ORA-40215`	Model %s is incompatible with current operation
`ORA-40216`	Feature not supported
`ORA-40219`	Apply result table %s is incompatible with current operation
`ORA-40220`	Maximum number of attributes exceeded
`ORA-40221`	Maximum target cardinality exceeded
`ORA-40222`	Data mining model export failed, job name=%s, error=%s
`ORA-40223`	Data mining model import failed, job name=%s, error=%s
`ORA-40225`	Model is currently in use by another process
`ORA-40251`	No support vectors were found
`ORA-40252`	No target values were found
`ORA-40253`	No target counter examples were found
`ORA-40261`	Input data for model build contains negative values
`ORA-40262`	NMF: number of features not between [1, %s]
`ORA-40271`	No statistically significant features were found
`ORA-40272`	Apply rules prohibited for this model mode
`ORA-40273`	Invalid model type %s for Adaptive Bayes Network algorithm

Table 4-18 lists the user views provided by Oracle to obtain information about the models generated using DBMS_DATA_MINING.

Table 4-18 DBMS_DATA_MINING Summary of User Views

User View	Purpose
`DM_USER_MODELS`	Lists all models in a given user's schema.

4.8 Summary of DBMS_DATA_MINING Subprograms

Table 4-19 DBMS_DATA_MINING Summary of Subprograms

Data Type	Purpose
`APPLY` Procedure	Applies a model to scoring data
`CREATE_MODEL` Procedure	Creates (builds) a mining model
`COMPUTE_CONFUSION_MATRIX` Procedure	Computes the confusion matrix from the `APPLY` results on test data for a classification model; also provides the accuracy of the model
`COMPUTE_LIFT` Procedure	Computes lift for a given positive target value from the `APPLY` results on test data for a classification model
`COMPUTE_ROC` Procedure	Computes Receiver Operating Characteristic for target attributes with binary class from the `APPLY` results on test data for a classification model.
`DROP_MODEL` Procedure	Drops a model
`EXPORT_MODEL` Procedure	Exports one or more models from a schema
`GET_ASSOCIATION_RULES` Function	This table function returns the rules from an Association model
`GET_DEFAULT_SETTINGS` Function	This table function returns all the default settings for all mining functions and algorithms.
`GET_FREQUENT_ITEMSETS` Function	Returns a set of rows that represent the frequent itemsets from an Association model.
`GET_MODEL_DETAILS_ABN` Function	Provides the details of an Adaptive Bayes Network model
`GET_MODEL_DETAILS_KM` Function	Provides the details of a k-Means model
`GET_MODEL_DETAILS_NB` Function	Provides the details of a Naive Bayes model
`GET_MODEL_DETAILS_NMF` Function	Provides the details of an NMF model
`GET_MODEL_DETAILS_SVM` Function	Provides the details of a SVM model
`GET_MODEL_SETTINGS` Function	Provides the settings used to build a model
`GET_MODEL_SIGNATURE` Function	Provides the signature of a model
`IMPORT_MODEL` Procedure	Imports one or more models into the current schema
`RANK_APPLY` Procedure	Ranks the predictions from the `APPLY` results for a classification model
`RENAME_MODEL` Procedure	Renames a model

4.9 Model Export and Import

Data mining models can be moved between Oracle databases or schemas. For example, in an organization, data mining specialists may build and test data mining models in a data mining lab. After models are built and tested in the lab, the chosen model may be moved to a scoring engine used by applications. Because data mining lab and scoring engine usually do not share the same database, the model must be exported from the lab and then imported to the scoring engine. Model export and import can be a routine procedure. As new data are accumulated, data mining specialists will build and test new models, and newer and better models will be loaded onto the scoring engine on a regular basis. DBAs will want to back up and restore models in their routine database maintenance.

Native export and import of data mining models are supported in the following scenarios:

Database export/import: When a DBA exports a full database using utility expdp, all the existing data mining models in the database are exported. By the same token, when a DBA imports a database dump using utility impdp, all the data mining models in the dump are restored.
Schema export/import: When a user or DBA exports a schema using expdp, all the data mining models in the schema are exported. When the user or DBA imports the schema dump using impdp, all the models in the dump are imported.
Selective model export/import: Users can export specific models using DBMS_DATA_MINING.EXPORT_MODEL and import specified models using DBMS_DATA_MINING.IMPORT_MODEL.

4.9.1 Limitations

The use of model export and import is subject to the following limitations:

Only models built by the DBMS_DATA_MINING interface are supported.
Only local export and import via dump files is supported. That is, models from a given local schema or database can be exported into a dump file. Models in a dump file can be imported into a local database or schema.

4.9.2 Prerequisites

Prerequisites for model export are as follows:

A valid directory object must be made available to the operator who has write privileges. See CREATE_DIRECTORY in the PL/SQL Packages and Types Reference.
The source must be an Oracle database with Data Mining option installed.
For database and schema export, the new Data Pump utility expdp must be used.

Prerequisites for model import are as follows:

There must be a valid directory object pointing to where dump files reside, and the operator must have read and write privileges.
The destination database must have Data Mining option or Data Mining Scoring Engine option installed.
Dump files must be created by expdp or EXPORT_MODEL.
For database and schema import, the new Data Pump import utility impdp must be used

4.9.3 Choose the Right Utility

There are two ways to export models:

Export all models, in a user schema or in the entire database.
Export selected models in a user schema.

To export all data mining models in a user schema, you can either run expdp or use EXPORT_MODEL with the parameter model_filter set to NULL. Note the difference between the two operations: When you run expdp to export the schema, all objects in the schema including data mining models are exported. When you run EXPORT_MODEL with a NULL model_filter, only the models will be exported.

There are also two ways to import models from the dump file:

Import all models as well as other database objects and data.
Import models only, either all or a selected few from the dump file.

In general, if you want to import the full dump file set, run impdp. This imports all database objects and data, including all data mining models, from the dump file set. If you want to import models only, use IMPORT_MODEL. When the parameter model_filter is set to NULL, all models are imported from the dump. If valid model names are assigned in model_filter, this operation imports only named models from the dump file set.

4.9.4 Temp Tables

Data mining model export and import jobs will create and manage two temporary tables in the user schema: DM$P_MODEL_EXPIMP_TEMP and DM$P_MODEL_TABKEY_TEMP. Users should not manipulate these tables.