Oracle Data Mining Application Developer's Guide 10g Release 1 (10.1) Part Number B10699-01 |
|
|
View PDF |
This chapter provides an overview of the steps required to perform basic Oracle Data Mining tasks and discusses the following topics related to writing data mining programs using the Java interface:
Detailed demo programs are provided as part of the installation.
Oracle Data Mining depends on the following Java archive (.jar
) files:
$ORACLE_HOME/dm/lib/odmapi.jar$ORACLE_HOME/jdbc/lib/ojdbc14.jar $ORACLE_HOME/jlib/orai18n.jar $ORACLE_HOME/lib/xmlparserv2.jar
These files must be in your CLASSPATH
to compile and execute Oracle Data Mining programs.
This section describes the steps required to perform several common data mining tasks using Oracle Data Mining. Data mining tasks are usually performed in a particular sequence. The following sequence is typical:
All work in Oracle Data Mining is done using MiningTask
objects.
To implement a sequence of dependent task executions, you may periodically check the asynchronous task execution status using the getCurrentStatus
method or block for completion using the waitForCompletion
method. You can then perform the dependent task after completion of the previous task.
For example, follow these steps to perform the build, test, and compute lift sequence:
ClassificationTestTask
or RegressionTestTask
object. Either periodically check the status of the test operation or block until the task completes.MiningComputeLiftTask
object.You now have (with a little luck) a model that you can use in your data mining application.
Different algorithms require different preparation and preprocessing of the input data. Some algorithms require normalization; some require binning (discretization). In the Java interface the algorithms can prepare data automatically.
This section summarizes the steps required for different data preparation methodologies supported by the ODM Java API.
The ODM Java interface supports automated data preparation. If the user specifies active unprepared attributes, the data mining server automatically prepares the data for those attributes.
In the case of algorithms that use binning as the default data preparation, bin boundary tables are created and stored as part of the model. The model's bin boundary tables are used for the data preparation of the dataset used for testing or scoring using that model. In the case of algorithms that use normalization as the default data preparation, the normalization details are stored as part of the model. The model uses those details for preparing the dataset used for testing or scoring using that model.
The algorithms that use binning as the default data preparation are Naive Bayes, Adaptive Bayes Network, Association, k-Means, and O-Cluster. The algorithms that use normalization are Support Vector Machines and Non-Negative Matrix Factorization. For normalization, the ODM Java interface supports only the automated method.
For certain distributions, you may get better results if you bin the data before the model is built.
External binning consists of two steps:
Specifically, the steps for external binning are as follows:
DiscretizationSpecification
objects to specify the bin boundary specifications for the attributes.Transformation.createDiscretizationTables
method to create bin boundariesTransformation.discretize
method to discretize/bin the data.Note that in the case of external binning, the user needs to bin the data consistently for all build, test, apply, and lift operations.
Embedded binning allows users to define their own customized automated binning. The binning strategy is specified by providing a bin boundary table that is produced by the bin specification creation step of external binning.
Specifically, the steps for embedded binning are as follows:
DiscretizationSpecification
objects to specify the bin boundary specifications for the attributes.Transformation.createDiscretizationTables
method to create bin boundaries.setUserSuppliedDiscretizationTables
method in the LogicalDataSpecification
object to attach the user created bin boundaries tables with the mining function settings object.Keep in mind that because binning can have an effect on a model's accuracy, it is best when the binning is done by an expert familiar with the data being binned and the problem to be solved. However, if there is no additional information that can inform decisions about binning or if what is wanted is an initial exploration and understanding of the data and problem, ODM can bin the data using default settings, either by explicit user action or as part of the model build.
ODM groups the data into 5 bins by default. For categorical attributes, the 5 most frequent values are assigned to 5 different bins, and all remaining values are assigned to a 6th bin. For numerical attributes, the values are divided into 5 bins of equal size according to their order.
After the data is processed, you can build a model.
For an illustration of binning, see Appendix A.
This section summarizes the steps required to build a model.
MiningFunctionSettings
object.MiningBuildTask
object.getCurrentStatus
method to get the status of the task. Alternatively, use the waitForCompletion
method to wait until all asynchronous activity for task completes.After successful completion of the task, a model object is created in the database.
Models based on data sets with a large number of attributes can have very long build times. To minimize build time, you can use ODM Attribute Importance to identify the critical attributes and then build a model using only these attributes.
Identify the most important attributes by building an Attributes Importance model as follows:
After identifying the important attributes, build a model using the selected attributes as follows:
adjustAttributeUsage
defined on MiningFunctionSettings
. Only the attributes returned by Attribute Importance will be active for model building.This section summarizes the steps required to test a classification or a regression model.
ClassificationTestTask
; for regression, use RegressionTestTask
.execute
method; the execute
method queues the work for asynchronous execution and returns an execution handle to the caller.getCurrentStatus
method to get the status of the task. As an alternative, use the waitForCompletion
method to wait until all asychronous activity for the task completes.ClassificaionTestResult
object; for regression problems, results are represented using RegressionTestResult
object.This section summarizes the steps required to compute lift using a classification model.
MiningLiftTask
object.execute
method; the execute
method queues the work for asynchronous execution and returns an execution handle to the caller.getCurrentStatus
method to get the status of the task. As an alternative, use the waitForCompletion
method to wait until all asychronous activity for the task completes.MiningLiftResult
object is created in the DMS.You make predictions by applying a model to new data, that is, by scoring the data.
Any table that you score (apply a model to) must have the same format as the table used to build the model. If you build a model using a table that is in multi-record (transactional) format , any table that you apply that model to must be in multi-record format. Similarly, if the table used to build the model was in nontransactional (single-record) format, any table to which you apply the model must be in nontransactional format.
Note that you can score a single record, which must also be in the same format as the table used to build the model.
The steps required to apply a classification, clustering, or a regression model are as follows:
MiningApplyTask
object. The MiningApplyOutput
object is used to specify the format of the apply output table.execute
method; the execute
method queues the work for asynchronous execution and returns an execution handle to the caller.getCurrentStatus
method to get the status of the task. As an alternative, use the waitForCompletion
method to wait until all asynchronous activity for the task completes.MiningApplyResult
object is created in the DMS and the apply output table/view is created at the user-specified name and location.