Association

Evaluation: DBMiner 2.0 Enterprise

20 October 2000

Course: Knowledge Discovery in Data

Professor: Dr. Osmar Zaiane

Jeffery Antoniuk

Andrew Foss

Yaun Ji

Weinan Wang

Introduction:

One of the many database mining tools available is DBMiner. DBMiner originated as a data mining research prototype at Simon Fraser University. Led by Dr. Jiawei Han, the Data Mining Research Group at Simon Fraser University's Intelligent Database System Research Laboratory developed the application. The prototype evolved into a commercially available product. DBMiner Technology Inc., a B.C based company maintains and enhances DBMiner while maintaining close relations with Simon Fraser University. The current version is DBMiner 2.0 Enterprise released in August 1999.

System Requirements:

- MS Windows NT

- MS SQL Server 7.0's OLAP service

- MS Excel 2000's ability to browse and to visualize OLAP functions

Features:

DBMiner allows for a connection to various vendors' relational database products and data warehouse products. DBMiner is a MS Windows based application that allows the user to interactively manipulate data visualizations (OLAP functions) and to interactively adjust the data mining process. The 3D Cube Explorer viewing utility produces variable views of the warehouse data and changeable views of the results created by OLAP functions. An OLAP browser allows the user to complete OLAP functions such a slicing and dicing.

A user can interact with the data mining process using DMSQL or a GUI interface to help gain the desired data mining results. As of the current release, DBMiner contains three data mining modules along with visualizations for each. The modules are for association rules, classification and clustering. Association rule visualizations include: a spreadsheet, a 3D bar graph or a 3D ball graph. Classification visualizations include: a 3D tree grid and a tabular tree view. Clustering visualizations include: 2D plot graph or a tabular tree grid view.

DBMiner Data Set and Data Cube

DBMiner works on top of a customized data cube. In DBMiner E2.0, it depends on MS SQL Server's OLAP service to build these data cubes. The user must define data on which the OLAP or the mining function will be performed. In order to do that, the user needs to first use SQL server's OLAP service to manipulate data sets, and transfer these data sets into a data cube.

DBMiner E1.1 can integrate with MS Access, MS SQL Server, Oracle 7.3.3 and text files because it simply takes in relational data and builds its data cube internally.
DBMiner E2.0 does not build its own data cube; instead it uses cubes built in MS SQL Server 7.0. SQL Server 7.0 is therefore required to run this version of DBMiner.

After creating the data cubes in SQL server and starting DBMiner, the user will be able to view the DBMiner’s available data cubes and dimension information about these data cubes through a tree structure. This tree structure always appears at the left part of DBMiner’s window to show the user the global picture of the data cubes on the system.

In DBMiner, data cubes appear in two formats: users can see a two-dimensional table using OLAP browser, or users can see a three-dimensional data cube through 3D Cube Explorer. When a user selects a data cube, starts to view data cube or mines knowledge, the following window will appear prompting the user to make a selection. >From here, the user can either view a data cube or begin mining a data cube.

The user can browse a table to get its structure. The user cannot see the detailed data contained in the original data tables, what the user can see is the processed data in the data cube. Now if the user selects to view the data cube through OLAP browser, the user will see the OLAP browser window like the following:

What the OLAP browser shows is always a two-dimensional table. The user specifies a set of fields to create the table the user wants. The user can select three data characteristics: row fields, column fields, and data item, to specify the table the user is interested in. The user can also add constraints on the page fields from the pivot table. These constraints specify what part of the data the user is interested in. The table will then display the corresponding table with the specified fields, like the following.

User can change the scale of the fields and constraints, thus she/he can drill down or roll up the table. This OLAP browser is actually based on MS Excel, so a user also needs to have support from MS Excel in order to run OLAP browser.

If the user selects to view data through 3D Cube Explorer, he will see a picture like this:

The figure shows the three-dimensional data cube. The user can then manipulate this data cube, enlarge or shrink it, rotate it, add or delete the grid, move it, etc. There are buttons for all these functionalities. The user can also select the three dimensions for the 3-D cube, and also change other properties of the data cube through this control window. Slicing or dicing can also be used to prepare the data within the data cube.

In general, we think that DBMiner’s data set and data cube is nice looking and easy to use, but we feel that it is DBMiner’s shortcoming that this version of DBMiner is not an "independent" software. For example, it does not have its own warehousing ability, so it cannot directly handle raw data. This is not convenient for the user because the user has to buy SQL server together with DBMiner, and needs to learn SQL server. At least data files have to be in format readable by the OLAP Services Manager and one has to use the OLAP SM to set up the data cube. Fortunately, Microsoft have provided an excellent tutorial with their product which is largely copied in the DBMiner help. Another example, the DBMiner OLAP browser is based on MS Excel. So, we feel that this version of DBMiner that we have played with is still not mature in its data set/data cube capability. Maybe later versions will improve in this aspect.

Association

If a user wants to do association mining on, say, the "Sales" cube of the MS example file, with five dimensions, namely "Customers", "Education Level", "Gender", "Marital Status" and "Product", simply perform each of the following steps:

Start the Mining Wizard.

Follow the steps below:

Select the data cube "Sales"

Choose "Association" option for mining

Select the five dimensions you are interested in

Specify the association category and various constraints (optional)

Verify the default support and confidence thresholds

Verify the DMQL query and click "Finish"

Analyzing Association Results

Initially the user is presented with a spreadsheet rule form display as shown in following figure:

Once the rules are formed, it is possible to view them in several different ways. The user can choose from the following display bar:

Click To

Display the mining statistics and progress information

Display the frequent itemsets.

Display in rule form.

Display rules in a plane view (LHS, RHS).

Display rules in a ball graph view.

Change ball graph settings.

Change mining settings.

Frequent Itemset View

Frequent itemsets are displayed in a grid as shown below.

3D Bar Chart View

By clicking the button the user can achieve the following plane view.

Each association rule contains the rule body (LHS) and rule head (RHS), which are represented by labeled axes. By pointing to any label on either axis, the text information for that label will be shown in the left upper corner of the screen.

3D Ball Graph View

To view a single rule using the ball graph, a user can simply double-click on the corresponding bar in the rule plane. Clicking on the button in the toolbar, displays all the mined rules in the 3D ball graph format.

A ball graph consists of a set of nodes and arrows. All the nodes are yellow, green or blue. The blue nodes are active nodes representing the items in the rule in which the user is interested. The yellow nodes are passive nodes representing items related to the active nodes in some way. The green nodes merely assist in visualizing two or more items in either the head or the body of the rule. The conventions of a ball graph in DBMiner are as follows.

A circular node represents a frequent (large) data item. The volume of the ball represents the support of the item. Only those items that occur sufficiently frequent (i.e. no less than the minimum support) will appear in the rule graph.

An arrow between two nodes represents the rule implication between the two items. An arrow will be drawn only when the support of a rule is no less than the minimum support for the participating items, and the confidence of the rule is no less than the minimum confidence.

Again, the association rule may be viewed in the text form by placing the cursor over the arrow between the balls for a short period of time. The rule will be displayed right away.

When there are too many arrows in the graph, press CTRL+SELECT (left mouse button) or SHIFT+SELECT to activate, neutralize or disable the balls. The legend in the top right corner tells you what each ball color represents.

Classification

Each user can better understand how to perform classification by considering the following tutorial example. Suppose the user wishes to mine the following, expressed in DMQL:

MINE Classification Rules

ANALYZE Gender

ON DIMENSIONS Customer, Education Level, Product, Promotion Media

from CUBE FoodMart_Sales

seT Classification Threshold = 90.00%

Noise Threshold = 2.00%

Training Set Threshold = 80.00%

Simply perform each of the following steps:

Start the Mining Wizard.

Follow the steps below:

Select the FoodMart_Sales cube
Choose the "Classification" option
Select "Gender" to be analyzed
Select Customer, Education Level, Product, Promotion Media
Focus the query by supplying a where clause (optional)
Verify the DMQL query and click "Finish"

Analyzing Classification Results

Once the rules are formed, it is possible to view them in several different ways. The user can choose from the following display bar:

Click To

Display the mining statistics and progress information.

Display the decision tree in grid form.

Display the decision tree in graph view.

Display the rules derived from the decision tree.

Change the mining settings.

3D Tree Graph View

By default, the decision tree is displayed in graphical form as following:

A classification tree is a hierarchical structure consisting of a set of pie charts and the branches (links) between them. Each node shows the classification situation at that point. Each pie chart represents the distribution proportion of both the classes in "Gender". With generalization and a good classification algorithm, the classification tree is usually relatively small and has a good classification result (many leaves of the tree contain one majority class).

To give users a clear impression of the classification process, the classifier displays the complete tree first. Users may click on any non-leaf node in the tree to expand it (showing all of its children nodes as well) or shrink it (shrinking all of its descendents to the current node).

Tree Grid View

Alternatively, the user may view this in tabular form by clicking the button. A view similar to the one in the following figure will appear.

Clustering

The clustering module currently only uses the k-means algorithm. The user can learn how to use the Clustering module by following a demonstration. Suppose the user wishes to mine the following, expressed in DMQL:

MINE Clustering

ON DIMENSIONS Customer WITH WEIGHT 1.000

Store WITH WEIGHT 1.000

FROM CUBE FoodMart_Sales

SET Number of clusters = 4,

Maximum clustering passes = Unlimited

Simply perform each of the following steps:

Start the Mining Wizard.

Follow these steps:

Select the FoodMart_Sales
Choose the "Clustering" option
Select Product and Store as the involved dimensions.
Focus the query by supplying dimension weights and/or the number of clusters required. (Optional)
Verify the DMQL query and click "Finish"

2D Plot Graph View

The clustering result can be visualized by a plot graph consisting of several sets of plots and center marks. Simply click , the user can get this view. Each plot represents a point in the original data set. Each plot shape, such as triangle, square, etc., represents each cluster. The red crosses marked inside each group of plots, indicate the centers of those clusters.

Tree Grid View

Alternatively, the user may view this in tabular form by clicking the button. A view similar to the one in following figure will appear. "Number of counts" indicates the sum of the cell entries (counts) of each cell in this cluster.

Conclusion:

DBMiner maintains its strong ties with Simon Fraser University. [HANJ1996] DBMiner receives new features and improvements incrementally as research in this new field of data mining expands what is known in the area. Another attractive feature is the price. DBMiner is affordable at $999 U.S. plus applicable taxes. However, DBMiner’s requirements of MS SQL Server 7.0’s OLAP module and MS Excel’s spreadsheet and graphics tools can be a drawback for organizations that are unacquainted with these products though they are widely used.

The clustering module is limited as it currently only uses the k-means algorithm. The number of incomprehensible error messages was also a significant drawback. It was impossible to know why DBMiner had failed to complete the task.

The user interface is intuitive yet powerfully represents the OLAP and data mining information. In previous sections, various figures and their accompanying descriptions backup the claim that DBMiner contains good visualization techniques of the data cube, OLAP functions and data mining functions. The three dimensional display provides an informative overall view of the data mining results to aid in the discovery of useful knowledge.

Although DBMiner was not compared to other software in terms of speed, the MS SQL Server sample data took anywhere form a few seconds to a few minutes to complete various data mining functions. According to [HANJ1996], DBMiner has been tested using several large databases including the NSERC research grant database with satisfactory performance.

Future Improvements:

Currently, there are plans to transform DBMiner into a parallel data mining system in order to increase performance. Another area of increased performance is the research into new, more efficient algorithms. A number of new data mining modules are planed such as: a data dispersion model, a time series analysis module and a prediction module. Because DBMiner has strong ties to Simon Fraser University's research environment, it will continue to evolve as a product that uses cutting edge data mining research.

Bibliography

J. Han, Y. Fu, W. Wan, J. Chiang, O. R. Zaiane, and K. Koperski, "DBMiner: Interactive Mining of Multiple-Level Knowledge in Relational Databases", Proc. 1996 ACM-SIGMOD Int'l Conf. On Management of Data (SIGMOD'96), Montreal, Canada, June 1996.

The Data Mining Research Group, ``DBMiner User Manual'', December 1997.

Jiawei Han, Yongjian Fu, Wei Wang, Jenny Chiang, Wan Gong, Krzysztof Koperski, Deyi Li, Yijun Lu, Amynmohamed Rajan, Nebojsa Stefanovic, Betty Xia, Osmar R. Zaiane, "DBMiner: A System for Mining Knowledge in Large Relational Databases", Proc. 1996 Int'l Conf. on Data Mining and Knowledge Discovery (KDD'96), Portland, Oregon, August 1996, pp. 250-255.

J. Han, J. Chiang, S. Chee, J. Chen, Q. Chen, S. Cheng, W. Gong, M. Kamber, K. Koperski, G. Liu, Y. Lu, N. Stefanovic, L. Winstone, B. Xia, O. R. Zaiane, S. Zhang, H. Zhu, "DBMiner: A System for Data Mining in Relational Databases and Data Warehouses", Proc. CASCON'97: Meeting of Minds, Toronto, Canada, November 1997.