Oracle Data Mining Application Developer's Guide 10g Release 1 (10.1) Part Number B10699-01 |
|
|
View PDF |
This chapter describes how to use the ODM Java interface to write data mining applications in Java. Our approach in this chapter is to use a simple example to describe the use of different features of the API.
For detailed descriptions of the class and method usage, refer to the Javadoc that is shipped with the product. See the administrator's guide for the location of the Javadoc.
To perform any mining operation in the database, first create an instance of oracle.dmt.odm.DataMiningServer
class. This instance is used as a proxy to create connections to a data mining server (DMS), and to maintain the connection. The DMS is the server-side, in-database component that performs the actual data mining operations within ODM. The DMS also provides a metadata repository consisting of mining input objects and result objects, along with the namespaces within which these objects are stored and retrieved.
In this step, we illustrate creating a DataMiningServer
object and then logging in to get the connection. Note that there is a logout
method to release all the resources held by the connection
// Create an instance of the DMS server and get a connection. // The database JDBC URL, user_name, and password for data mining // user schema DataMiningServer dms = new DataMiningServer( "DB_URL",// JDBC URL jdbc:oracle:thin:@Host name:Port:SID "user_name", "password"); //Login to get the DMS connection oracle.dmt.odm.Connection m_dmsConn = dms.login();
In the ODM Java interface, oracle.dmt.odm.data.LocationAccessData
(LAD) and oracle.dmt.odm.PhysicalDataSpecification
(PDS) classes are used for describing the mining dataset (table/view in the user schema). To represent single-record format dataset, use an instance of NonTransactionalDataSpecification
class, and to represent multi-record format dataset, use TransactionalDataSpecification
class. Both classes are inherited from the common super class PhysicalDataSpecification
. For more information about the data formats, refer to ODM Concepts.
In this step, we illustrate creating LAD and PDS objects for both types of formats.
LocationAccessData
(LAD) class encapsulates the dataset location details. The following code describes the creation of this object.
// Create a LocationAccessData by specifying the table/view name // and the schema name LocationAccessData lad = new LocationAccessData("input table name", "schema name");
The NonTransactionalDataSpecification
class contains the LocationAccessData
object and specifies the data format as single-record case. The following code describes the creation of this object.
// Create the actual NonTransactionalDataSpecification PhysicalDataSpecification pds = new NonTransactionalDataSpecification(lad);
The TransactionalDataSpecification
class contains a LocationAccessData
object; it specifies the data format as multi-record case and it specifies the column roles.
This dataset must contain three types of columns: Sequence-Id/case-id column to represent each case, attribute name column, and attribute value column. This format is commonly used when the data has a large number of attributes. For more information, refer to ODM Concepts. The following code illustrates the creation of this object.
// Create the actual TransactionalDataSpecification for transactional data. PhysicalDataSpecification pds = new TransactionalDataSpecification( "CASE_ID", //column name for sequence id "ATTRIBUTES", //column name for attribute name "VALUES", //column name for value lad //Location Access Data );
The class oracle.dmt.odm.settings.function.MiningFunctionSettings
(MFS) is the common super class for all types of mining function settings classes. It encapsulates the details of function and algorithm settings, logical data, and data usage specifications. For more detailed information about logical data and data usage specification, refer to Javadoc documentation for oracle.dmt.odm.data.LogicalDataSpecification
and oracle.dmt.odm.settings.function.DataUsageSpecification
.
An MFS object is a named object that can be stored in the DMS. If no algorithm is specified, the underlying DMS selects the default algorithm and its settings for that function. For example, Naive Bayes is the default algorithm for classification function. In this step, the ODM Java interface has the following function settings classes and a list of associated algorithm settings classes with each function.
oracle.dmt.odm.settings.function.ClassificationFunctionSettings
oracle.dmt.odm.settings.algorithm.NaiveBayesSettings (Default) oracle.dmt.odm.settings.algorithm.AdaptiveBayesNetworkSettings oracle.dmt.odm.settings.algorithm.SVMClassificationSettings
oracle.dmt.odm.settings.function.RegressionFunctionSettings
oracle.dmt.odm.settings.algorithm.SVMRegressionSettings (Default)
oracle.dmt.odm.settings.function.AssociationRulesFunctionSettings
oracle.dmt.odm.settings.algorithm.AprioriAlgorithmSettings (Default)
oracle.dmt.odm.settings.function.ClusteringFunctionSettings
oracle.dmt.odm.settings.algorithm.KMeansAlgorithmSettings (Default) oracle.dmt.odm.settings.algorithm.OClusterAlgorithmSettings (Default)
oracle.dmt.odm.settings.function.AttributeImportanceFunctionSettings
oracle.dmt.odm.settings.algorithm.MinimumDescriptionLengthSettings (Defaults)
oracle.dmt.odm.settings.function.FeatureExtractionFunctionSettings
oracle.dmt.odm.settings.algorithm.NMFAlgorithmSettings
In this step, we illustrate the creation of a ClassificationFunctionSettings
object using Naive Bayes algorithm.
The class oracle.dmt.odm.settings.algorithm.MiningAlgorithmSettings
is the common superclass for all algorithm settings. It encapsulates all the settings that can be tuned by a data-mining expert based on the problem and the data. ODM provides default values for algorithm settings; refer to the Javadoc documentation for more information about each the algorithm settings. For example, Naive Bayes has two settings: singleton_threshold
and pairwise_threshold
. The default values for both of these settings is 0.01.
In this step we create a NaiveBayesSettings
object that will be used by the next step to create the ClassificationFunctionSettings
object.
// Create the Naive Bayes algorithm settings by setting both the pairwise // and singleton thresholds to 0.01. NaiveBayesSettings nbAlgo = new NaiveBayesSettings(0.02f,0.02f);
An MFS object can be created in two ways: by using the constructor or by using create and adjust
utility methods. If you have the input dataset, it is recommended that you use the create
utility method because it simplifies the creation of this complex object.
In this example, the utility method is used to create a ClassificationFunctionSettings
object for a dataset, which has all unprepared categorical attributes and an ID column. Here we use automated binning; for more information about data preparation, see
// Create classification function settings ClassificationFunctionSettings mfs = ClassificationFunctionSettings.create( m_dmsConn, //DMS Connection nbAlgo, //NB algorithm settings pds, //Build data specification "target_attribute_name", //Target column AttributeType.categorical, //Target attribute type DataPreparationStatus.unprepared //Default preparation status ); //Set ID attribute as an inactive attribute mfs.adjustAttributeUsage(new String[]{"ID"},AttributeUsage.inactive);
Because the MiningFunctionSettings
object is a complex object, it is a good practice to validate the correctness of this object before persisting it. If you use utility methods to create MFS, then it will be a valid object.
The following code illustrates validation and persistence of the MFS object.
// Validate and store the ClassificationFunctionSettings object try { mfs.validate(); mfs.store(m_dmsConn, "Name_of_the_MFS"); } catch(ODMException invalidMFS) { System.out.println(invalidMFS.getMessage()); throw invalidMFS; }
The class oracle.dmt.odm.task.MiningTask
is the common superclass for all the mining tasks. This class provides asynchronous execution of mining operations in the database using DBMS_JOBS
. For each execution of the task an oracle.dmt.odm.task.ExecutionHandle
object is created. The ExecutionHandle
object provides the methods to retrieve the status of the execution and utility methods like waitForCompletion
, terminate
, and getStatusHistory
. Refer to the Javadoc API documentation of these classes for more information.
The ODM Java interface has the following task classes:
oracle.dmt.odm.task.MiningBuildTask
This class is used for building a mining modeloracle.dmt.odm.task.ClassificationTestTask
This class is used for testing a classification modeloracle.dmt.odm.task.RegressionTestTask
This class is used for testing a regression modeloracle.dmt.odm.task.CrossValidateTask
This class is used for testing a Naive Bayes model using cross validationoracle.dmt.odm.task.MiningLiftTask
This class is used for computing lift in case of classification modelsoracle.dmt.odm.task.MiningApplyTask
This class is used for scoring new data using the mining modeloracle.dmt.odm.task.ModelImportTask
This class is used for importing a PMML mining model to ODM Java API native modeloracle.dmt.odm.task.ModelExportTask
This class is used for exporting a ODM Java API native model to PMML mining modelTo build a mining model, the MiningBuildTask
object is used. It encapsulates the input and output details of the model build operation.
In this step, we illustrate creation, storing, and executing the MiningBuildTask
object and task execution status retrieval by using ExecutionHandle
object.
// Create a build task and store it. MiningBuildTask buildTask = new MiningBuildTask( pds, "name_of_the_input_MFS", "name_of_the_model"); // Store the task buildTask.store(m_dmsConn, "name_of_the_build_task"); // Execute the task ExecutionHandle execHandle = buildTask.execute(m_dmsConn); // Wait for the task execution to complete MiningTaskStatus status = execHandle.waitForCompletion(dmsConnection);
After the build task completes successfully, the model is stored in the DMS with a name specified by the user.
The class oracle.dmt.odm.model.MiningModel
is the common superclass for all the mining models. It is a wrapper class for the actual model stored in the DMS. Each model class provides methods for retrieving the details of the models. For example, AssociationRulesModel
provides methods to retrieve the rules from the model using different filtering criteria. Refer to Javadoc API documentation for more details about the model classes.
In this step, we illustrate restoring the NaiveBayesModel
object and retrieve the ModelSignature
object. The ModelSignature
object specifies the input attributes required to apply data using a specific model.
//Restore the naïve bayes model NaiveBayesModel nbModel = (NaiveBayesModel)SupervisedModel.restore( m_dmsConn, "name_of_the_model"); //Get the model signature ModelSignature nbModelSignature = nbModel.getSignature();
After creating the classification model, you can test the model to assess its accuracy and compute a confusion matrix using the test dataset.
In this step, we illustrate how to test the classification model using the ClassificationTestTask
object and how to retrieve the test results using the ClassificationTestResult
object.
To test the model, a compatible test dataset is required. For example, if the model is built using single-record dataset, then the test dataset must be single-record dataset. All the active attributes and target attribute columns must be present in the test dataset.
To test a model, the user needs to specify the test dataset details using the PhysicalDataSpecification
class.
//Create PhysicalDataSpecification LocationAccessData lad = new LocationAccessData( "test_dataset_name", "schema_name" ); PhysicalDataSpecification pds = new NonTransactionalDataSpecification( lad );
After creating the PhysicalDataSpecification
object, create a ClassificationTestTask
instance by specifying the input arguments required to perform the test operation. Before executing a task, it must be stored in the DMS. After invoking execute
on the task, the task is submitted for asynchronous execution in the DMS. To wait for the completion of the task, use waitForCompletion
method.
//Create, store & execute Test Task ClassificationTestTask testTask = new ClassificationTestTask( pds, //test data specification "name_of_the_model_to_be_tested", "name_of_the_test_results_object" ); testTask.store(m_dmsConn, "name_of_the_test_task"); taskTask.execute(m_dmsConn); //Wait for completion of the Test task MiningTaskStatus taskStatus = taskTask.waitForCompletion(m_dmsConn);
After successful completion of the test task, you can restore the results object persisted in the DMS using the restore
method. The ClassificationTestResult
object has get
methods for accuracy and confusion matrix. The toString
method can be used to display the test results.
//Restore the test results ClassificationTestResult testResult = ClassificationTestResult.restore(m_dmsConn, "name of the test results"); //Get accuracy double accuracy = testResult.getAccuracy(); //Get confusion matrix ConfusionMatrix confMatrix = testResult.getConfusionMatrix(); //Display results System.out.println(testResult.toString());
Lift is a measure of how much better prediction results are using a model than could be obtained by chance. You can compute lift after the model is built successfully. You can compute lift using the same test dataset. The test dataset must be compatible with the model as described in Section 2.2.4.
In this step, we illustrate how to compute lift by using MiningLiftTask
object and how to retrieve the test results using MiningLiftResul
t object.
To compute lift, a positive target value needs to be specified. This value depends on the dataset and the data mining problem. For example, for a marketing campaign response model, the positive target value could be "customer responds to the campaign". In the Java interface, oracle.dmt.odm.Category
class is used to represent the target value.
Category positiveCategory = new Category( "Display name of the positive target value", "String representation of the target value", DataType.intType //Data type );
To compute lift, create a MiningLiftTask
instance by specifying the input arguments that are required to perform the lift operation. The user needs to specify the number of quantiles to be used. A quantile is the specific value of a variable that divides the distribution into two parts: those values that are greater than the quantile value and those values that are less. Here the test dataset records are divided into the user-specified number of quantiles and lift is computed for each quantile.
//Create, store & execute Lift Task MiningLiftTask liftTask = new MiningLiftTask ( pds, //test data specification 10, //Number of quantiles positiveCategory, //Positive target value "name_of_the_input_model", "name_of_the_lift_results_object" ); liftTask.store(m_dmsConn, name_of_the_lift_task"); liftTask.execute(m_dmsConn); //Wait for completion of the lift task MiningTaskStatus taskStatus = liftTask.waitForCompletion(m_dmsConn);
After successful completion of the test task, you can restore the results object persisted in the DMS using restore method.MiningLiftResult
. To get the lift measures for each quantile use getLiftResultElements()
. Method toString()
can be used to display the lift results.
//Restore the lift results MiningLiftResult liftResult = MiningLiftResult.restore(m_dmsConn, "name_of_the_lift_results"); //Get lift measures for each quantile LiftResultElement[] quntileLiftResults = liftResult.getLiftResultElements() //Display results System.out.println(liftResult.toString());
A classification or clustering model can be applied to new data to make predictions; the process is referred to as "scoring data."
Similar to the test dataset, the apply dataset must have all the active attributes that were used to build the model. Unlike test dataset, apply dataset does not have a target attribute column; the apply process predicts the values of the target attribute. ODM Java API supports real-time scoring in addition to batch scoring (i.e., scoring with an input table)
In this step, we illustrate how to apply a model to a table/view to make predictions and how to apply a model to a single record for real-time scoring.
The Apply operation requires an input dataset that has all the active attributes that were used to build the model. It produces an output table in the user- specified format.
//Create PhysicalDataSpecification LocationAccessData lad = new LocationAccessData( "apply_input_table/view_name", "schema_name" ); PhysicalDataSpecification pds = new NonTransactionalDataSpecification( lad ); //Output table location details LocationAccessData outputTable = new LocationAccessData( "apply_output_table/view_name", "schema_name" );
The DMS also needs to know the content of the scoring output. This information is captured in a MiningApplyOutput
(MAO) object. An instance of MiningApplyOutput
specifies the data (columns) to be included in the apply output table that is created as the result of an apply operation. The columns in the apply output table are described by a combination of ApplyContentItem
objects. These columns can be either from the input table or generated by the scoring task (for example, prediction and probability). The following steps create a MiningApplyOutput
object:
// Create MiningApplyOutput object using default settings MiningApplyOutput mao = MiningApplyOutput.createDefault(); // Add all the source attributes to be returned with the scored result. // For example, here we add attribute "CUST_ID" from the original table // to the apply output table MiningAttribute sourceAttribute = new MiningAttribute("CUST_ID", DataType.intType, AttributeType.notApplicable); Attribute destinationAttribute = new Attribute( "CUST_ID",DataType.intType); ApplySourceAttributeItem m_ApplySourceAttributeItem = new ApplySourceAttributeItem(sourceAttribute,destinationAttribute); // Add a source and destination mapping mao.addItem(m_ApplySourceAttributeItem);
To apply the model, create a MiningApplyTask
instance by specifying the input arguments that are required to perform the apply operation.
//Create, store & execute apply Task MiningApplyTask applyTask = new MiningApplyTask( pds, //test data specification "name_of_the_model", //Input model name mao, //MiningApplyOutput object outputTable, //Apply output table location details "name_of_the_apply_results" //Apply results name ); applyTask.store(m_dmsConn, name_of_the_apply_task"); applyTask.execute(m_dmsConn); //Wait for completion of the apply task MiningTaskStatus taskStatus = applyTask.waitForCompletion(m_dmsConn);
To apply the model to a single record, use the oracle.dmt.odm.result.RecordInstance
class. Model classes that support record apply have the static apply method, which can take RecordInstance
object as input and returns with the prediction and probability.
In this step, we illustrate the creation of the RecordInstance
object and score using Naive Bayes model's apply static method.
//Create RecordInstance object for a model with two active attributes RecordInstance inputRecord = new RecordInstance(); //Add active attribute values to this record AttributeInstance attr1 = new AttributeInstance("Attribute1_Name", value); AttributeInstance attr2 = new AttributeInstance("Attribute2_Name", value); inputRecord.addAttributeInstance(attr1); inputRecord.addAttributeInstance(attr2); //Record apply, output record will have the prediction value and its probability value RecordInstance outputRecord = NaiveBayesModel.apply( m_dmsConn, inputRecord, "model_name");
The class oracle.dmt.odm.CostMatrix
is used to represent the costs of the false positive and false negative predictions. It is used for classification problems to specify the costs associated with the false predictions. A user can specify the cost matrix in the classification function settings. For more information about the cost matrix, see ODM Concepts.
The following code illustrates how to create a cost matrix object where the target has two classes: YES (1) and NO (0). Suppose a positive (YES) response to the promotion generates $2 and the cost of the promotion is $1. Then the cost of misclassifying a positive responder is $2. The cost of misclassifying a non-responder is $1.
// Define a list of categories Category negativeCat = new Category("negativeResponse", "0", DataType.intType); Category positiveCat = new Category("positiveResponse", "1", DataType.intType); // Define a Cost Matrix // AddEntry( Actual Category, Predicted Category, Cost Value) CostMatrix costMatrix = new CostMatrix(); // Row 1 costMatrix.addEntry(negativeCat, negativeCat, new Integer("0")); costMatrix.addEntry(negativeCat, positiveCat, new Integer("1")); // Row 2 costMatrix.addEntry(positiveCat, negativeCat, new Integer("2")); costMatrix.addEntry(positiveCat, positiveCat, new Integer("0")); // Set Cost Matrix to MFS mfs.setCostMatrix(costMatrix);
The class oracle.dmt.odm.PriorProbabilities
is used to represent the prior probabilities of the target values. It is used for classification problems if the actual data has a different distribution for target values than the data provided for the model build. A user can specify the prior probabilities in the classification function settings. For more information about the prior probabilities, see ODM Concepts.
The following code illustrates how to create PriorProbabilities
object, when the target has two classes: YES (1) and NO (0), and probability of YES is 0.05, probability of NO is 0.95.
// Define a list of categories Category negativeCat = new Category( "negativeResponse", "0", DataType.intType); Category positiveCat = new Category( "positiveResponse", "1", DataType.intType); // Define a Prior Probability // AddEntry( Target Category, Probability Value) PriorProbabilities priorProbability = new PriorProbabilities(); // Row 1 priorProbability.addEntry(negativeCat, new Float("0.95")); // Row 2 priorProbability.addEntry(positiveCat, new Float("0.05")); // Set Prior Probabilities to MFS mfs.setPriors(priorProbability);
Data Mining algorithms require the data to be prepared to build mining models and to score. Data preparation requirements can be specific to a function and an algorithm. ODM algorithms require binning (discretization) or normalization, depending on the algorithm. For more information about which algorithm requires what type of data preparation, see ODM Concepts. Java API supports automated binning, automated normalization, external binning, winsorizing, and embedded binning.
In this section, we illustrate how to do the automated binning, automated normalization, external binning, and embedded binning.
In the MiningFunctionSettings
, if any of the active attributes are set as unprepared attributes, the DMS chooses the appropriate data preparation (i.e., binning or normalization), depending on the algorithm, and prepares the data automatically before sending it to the algorithm codes.
The class oracle.dmt.odm.transformation.Transformation
provides the utility methods to perform external binning. Binning is a two-step process, first bin boundary tables need to be created and then bin the actual data using the bin boundary tables as input.
The following code illustrates the creation of bin boundary tables for a table with one categorical attribute and one numerical attribute.
//Create an array of DiscretizationSpecification //for the two columns in the table DiscretizationSpecification[] binSpec = new DiscretizationSpecification[2]; //Specify binning criteria for categorical column. //In this example we are specifying binning criteria //as top 5 frequent values need to be used and //the rest of the less frequent values need //to be treated as OTHER_CATEGORY CategoricalDiscretization binCategoricalCriteria = new CategoricalDiscretization(5,"OTHER_CATEGORY"); binSpec[0] = new DiscretizationSpecification( "categorical_attribute_name", binCategoricalCriteria); //Specify binning criteria for numerical column. //In this example we are specifying binning criteria //as use equal width binning with 10 bins and use //winsorize technique to filter 1 tail percent float tailPercentage = 1.0f; //tail percentage value NumericalDiscretization binNumericCriteria = new NumericalDiscretization(10, tailPercentage); binSpec[1] = new DiscretizationSpecification( "numerical_attribute_name", binNumericCriteria); //Create PhysicalDataSpecification object for the input data LocationAccessData lad = new LocationAccessData( "input_table_name", "schema_name" ); PhysicalDataSpecification pds = new NonTransactionalDataSpecification( lad ); //Create bin boundaries tables Transformation.createDiscretizationTables( m_dmsConn, //DMS connection lad, pds, //Input data details binSpec, //Binning criteria "numeric_bin_boundaries_table", "categorical_bin_boundaries_table", "schema_name>"); //Resulting discretized view location LocationAccessData resultViewLocation = new LocationAccessData( "output_discretized_view_name", "schema_name" ); //Perform binning Transformation.discretize( m_dmsConn, // DMS connection lad, pds, // Input data details "numeric_bin_boundaries_table", "categorical_bin_boundaries_table>, "schema_name", resultViewLocation, // location of the resulting binned view true // open ended binning );
In case of external binning, the user needs to maintain the bin boundary tables and use these tables to bin the data. In case of embedded, the user can give the binning bin boundary tables as an input to the model build operation. The model will maintain these tables internally and use them for binning of the data for build, apply, test, or lift operations.
The following code illustrates how to associate the bin boundary tables with the mining function settings.
//Create location access data objects for bin boundary tables LocationAccessData numBinBoundaries = new LocationAccessData( "numeric_bin_boundaries_table", "schema_name"); LocationAccessData catBinBoundaries = new LocationAccessData( "categorical_bin_boundaries_table>, "schema_name"); //Get the Logical Data Specification from the MiningFunctionSettings class LogicalDataSpecification lds = mfs.getLogicalDataSpecification(); //Set the bin boundary tables to the logical data specification lds.setUserSuppliedDiscretizationTables(numBinBoundaries, catBinBoundaries);
ODM Java API supports text mining for SVM and NMF algorithms. For these algorithms, an input table can have a combination of categorical, numerical, and text columns. The data mining server (DMS) internally performs the transformations required for the text data before building the model.
Note that for text mining, the case-id
column must be specified in the NonTransactionalDataSpecification
object, case-id
column must have not-NULL
unique values.
The following code illustrates how to set the text attribute in the ODM Java API.
//Set a caseid/sequenceid column for the dataset with active text attribute Attribute sequenceAttr = new Attribute ("case_id_column_name", DataType.int); pds.setSequenceAttribute( Attribute sequenceAttr ) //Set the text attribute mfs.adjustAttributesType( new String[] {"text_attribute_column"}, AttributeType.text );
All the demo programs listed in the tables below are located in the directory $ORACLE_HOME/dm/demo/sample/java.
The summary description of these sample programs is also provided in $ORACLE_HOME/dm/demo/sample/java.101/README.txt
.
Note: Before executing these programs, make sure that the SH schema and user schema are installed with the datasets used by these programs. You also need to provide DB URL, username, and password in login method and a valid data schema name by changing the DATA_SCHEMA_NAME
constant value in the program.