2
ODM Java Programming

This chapter provides an overview of the steps required to perform basic Oracle Data Mining tasks and discusses the following topics related to writing data mining programs using the Java interface:

The requirements for compiling and executing programs.
How to perform common data mining tasks.

Detailed demo programs are provided as part of the installation.

2.1 Compiling and Executing ODM Programs

Oracle Data Mining depends on the following Java archive (.jar) files:

$ORACLE_HOME/dm/lib/odmapi.jar$ORACLE_HOME/jdbc/lib/ojdbc14.jar
$ORACLE_HOME/jlib/orai18n.jar
$ORACLE_HOME/lib/xmlparserv2.jar

These files must be in your CLASSPATH to compile and execute Oracle Data Mining programs.

2.2 Using ODM to Perform Mining Tasks

This section describes the steps required to perform several common data mining tasks using Oracle Data Mining. Data mining tasks are usually performed in a particular sequence. The following sequence is typical:

Collect and preprocess (bin or normalize) data. (This step is optional; ODM algorithms can automatically prepare input data.)
Build a model
Test the model and calculate lift (classification problems only)
Apply the model to new data

All work in Oracle Data Mining is done using MiningTask objects.

To implement a sequence of dependent task executions, you may periodically check the asynchronous task execution status using the getCurrentStatus method or block for completion using the waitForCompletion method. You can then perform the dependent task after completion of the previous task.

For example, follow these steps to perform the build, test, and compute lift sequence:

Perform the build task as described in Section 2.2.2 below.
After successful completion of the build task, start the test task by calling the execute method on a ClassificationTestTask or RegressionTestTask object. Either periodically check the status of the test operation or block until the task completes.
After successful completion of the test task, execute the compute lift task by calling the execute method on a MiningComputeLiftTask object.

You now have (with a little luck) a model that you can use in your data mining application.

2.2.1 Prepare Input Data

Different algorithms require different preparation and preprocessing of the input data. Some algorithms require normalization; some require binning (discretization). In the Java interface the algorithms can prepare data automatically.

This section summarizes the steps required for different data preparation methodologies supported by the ODM Java API.

Automated Discretization (Binning) and Normalization

The ODM Java interface supports automated data preparation. If the user specifies active unprepared attributes, the data mining server automatically prepares the data for those attributes.

In the case of algorithms that use binning as the default data preparation, bin boundary tables are created and stored as part of the model. The model's bin boundary tables are used for the data preparation of the dataset used for testing or scoring using that model. In the case of algorithms that use normalization as the default data preparation, the normalization details are stored as part of the model. The model uses those details for preparing the dataset used for testing or scoring using that model.

The algorithms that use binning as the default data preparation are Naive Bayes, Adaptive Bayes Network, Association, k-Means, and O-Cluster. The algorithms that use normalization are Support Vector Machines and Non-Negative Matrix Factorization. For normalization, the ODM Java interface supports only the automated method.

External Discretization (Binning)

For certain distributions, you may get better results if you bin the data before the model is built.

External binning consists of two steps:

The user creates binning specification either explicitly or by looking at the data and using one of the predefined methods. For categorical attributes, there is only one method: Top-N Frequency. For numerical attributes, there are two methods: Equi-width and equi-width with winsorizing.
The user bins the data following the specification created.

Specifically, the steps for external binning are as follows:

Create DiscretizationSpecification objects to specify the bin boundary specifications for the attributes.
Call Transformation.createDiscretizationTables method to create bin boundaries
Call Transformation.discretize method to discretize/bin the data.

Note that in the case of external binning, the user needs to bin the data consistently for all build, test, apply, and lift operations.

Embedded Discretization (Binning)

Embedded binning allows users to define their own customized automated binning. The binning strategy is specified by providing a bin boundary table that is produced by the bin specification creation step of external binning.

Specifically, the steps for embedded binning are as follows:

Create DiscretizationSpecification objects to specify the bin boundary specifications for the attributes.
Call the Transformation.createDiscretizationTables method to create bin boundaries.
Call the setUserSuppliedDiscretizationTables method in the LogicalDataSpecification object to attach the user created bin boundaries tables with the mining function settings object.

Keep in mind that because binning can have an effect on a model's accuracy, it is best when the binning is done by an expert familiar with the data being binned and the problem to be solved. However, if there is no additional information that can inform decisions about binning or if what is wanted is an initial exploration and understanding of the data and problem, ODM can bin the data using default settings, either by explicit user action or as part of the model build.

ODM groups the data into 5 bins by default. For categorical attributes, the 5 most frequent values are assigned to 5 different bins, and all remaining values are assigned to a 6th bin. For numerical attributes, the values are divided into 5 bins of equal size according to their order.

After the data is processed, you can build a model.

For an illustration of binning, see Appendix A.

2.2.2 Build a Model

This section summarizes the steps required to build a model.

Prepocess and prepare the input data as required.
Construct and store a MiningFunctionSettings object.
Construct and store a MiningBuildTask object.
Call the execute method; the execute method queues the work for asynchronous execution and returns an execution handle to the caller.
Periodically call the getCurrentStatus method to get the status of the task. Alternatively, use the waitForCompletion method to wait until all asynchronous activity for task completes.

After successful completion of the task, a model object is created in the database.

2.2.3 Find and Use the Most Important Attributes

Models based on data sets with a large number of attributes can have very long build times. To minimize build time, you can use ODM Attribute Importance to identify the critical attributes and then build a model using only these attributes.

Build an Attribute Importance Model

Identify the most important attributes by building an Attributes Importance model as follows:

Create a Physical Data Specification for input data set.
Discretize (bin) the data if required.
Create and store mining settings for the Attribute Importance.
Build the Attribute Importance model.
Access the model and retrieve the attributes by threshold.

Build a Model Using the Selected Attributes

After identifying the important attributes, build a model using the selected attributes as follows:

Access the model and retrieve the attributes by threshold or by rank.
Modify the Data Usage Specification by calling the function adjustAttributeUsage defined on MiningFunctionSettings. Only the attributes returned by Attribute Importance will be active for model building.
Build a model using the new Mining Function Settings.

2.2.4 Test the Model

This section summarizes the steps required to test a classification or a regression model.

Preprocess the test data as required. Test data must have all the active attributes used in the model and the target attribute in order to assess the model's accuracy.
Prepare (bin or normalize) the input data the same way the data was prepared for building the model.
Construct and store a task object. For classification problems, use ClassificationTestTask; for regression, use RegressionTestTask.
Call the execute method; the execute method queues the work for asynchronous execution and returns an execution handle to the caller.
Periodically, call the getCurrentStatus method to get the status of the task. As an alternative, use the waitForCompletion method to wait until all asychronous activity for the task completes.
After successful completion of the task, a test result object is created in the DMS. For classification problems, the results are represented using ClassificaionTestResult object; for regression problems, results are represented using RegressionTestResult object.

2.2.5 Compute Lift

This section summarizes the steps required to compute lift using a classification model.

Lift operation is typically done using the test data. Data preparation steps described in the section above also apply to the lift operation.
Construct and store a MiningLiftTask object.
Call the execute method; the execute method queues the work for asynchronous execution and returns an execution handle to the caller.
Periodically, call the getCurrentStatus method to get the status of the task. As an alternative, use the waitForCompletion method to wait until all asychronous activity for the task completes.
After successful completion of the task, a MiningLiftResult object is created in the DMS.

2.2.6 Apply the Model to New Data

You make predictions by applying a model to new data, that is, by scoring the data.

Any table that you score (apply a model to) must have the same format as the table used to build the model. If you build a model using a table that is in multi-record (transactional) format , any table that you apply that model to must be in multi-record format. Similarly, if the table used to build the model was in nontransactional (single-record) format, any table to which you apply the model must be in nontransactional format.

Note that you can score a single record, which must also be in the same format as the table used to build the model.

The steps required to apply a classification, clustering, or a regression model are as follows:

Preprocess the apply data as required. The apply data must have all the active attributes that were present in creating the model.
Prepare (bin or normalize) the input data the same way the data was prepared for building the model. If the data was prepared using the automated option at build time, then the apply data is also prepared using the automated option and other preparation details from building the model.
Construct and store a MiningApplyTask object. The MiningApplyOutput object is used to specify the format of the apply output table.
Call the execute method; the execute method queues the work for asynchronous execution and returns an execution handle to the caller.
Periodically, call the getCurrentStatus method to get the status of the task. As an alternative, use the waitForCompletion method to wait until all asynchronous activity for the task completes.
After successful completion of the task, a MiningApplyResult object is created in the DMS and the apply output table/view is created at the user-specified name and location.

2 ODM Java Programming