CMPUT 690: KDD Principales

[an error occurred while processing this directive]

                 

Activities


Class presentations
Each student is expected to read and present one paper from the research literature that I provided. You should view this as if you were presenting the paper at a conference. Be prepared to answer detailed technical questions about the content of the paper or related to the topic of the paper. However, you do not need to be an advocate for the paper. You do not have to defend it. If you want to critique it and highlight the problems and weaknesess: go ahead. In your presentation, you should properly place the work within the field, cite if possible related work, and clearly indicate the innovative aspects of the work and its contribution to the field, as well as explain the details of the ideas, algorithms and experiments presented in the paper.

Students must prepare their own materials for presentation. I highly recommend powerpoint slides. However, transparencies are fine too. Bring me or send me by e-mail the final slides one hour before class. I will make sure the files are transfered to the classroom computer before the class starts. These slides are also made available to the other students via our course web site.

If the student is not present (does not show up) for his/her presentation, the evaluation is automatically 0 (zero) unless a good medical justification is given. For any serious absence, I should be advised early enough to organize a reshuffling of the presentations.

In addition to the presentation slides, the student making a presentation is also expected to review the paper, and write a written report for the assigned paper. The review should be about 2 pages and should be written as if you were reviewing a journal article or a paper submitted to a conference program committee.

The review should contain at least these sections:
1-Brief summary of the main contributions of the paper
2-Elaboration on the positive aspects presented in the paper
3-Elaboration on the negative aspects presented in the paper
4-Comments on how to improve the ideas/issues/experiments presented in the paper.

Here is an article giving suggestions on how to review a paper (Alan Jay Smith, The Task of the Referee, IEEE, 1990)

Reviews should be an independent analysis of the paper. Collaboration to understand the intricacies of the paper is acceptable. However, students should write their own reviews and interpretation.

Reviews should be sent by e-mail (in pdf format) the week before the presentation.. Defaulted reviews count for 0 (zero). The reviews will be put on-line for all students to read (along with the paper) before the presentation is given.
DayPaperPresenter
October 28, 2004 Closet+: Searching for the best strategies for mining frequent closed itemsets, J.Wang, J. Han, and J. Pei, Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD),Washington, DC, USA, 2003. Nasimeh Asgarian (slides )
October 28, 2004 CHARM: An Efficient Algorithm for Closed Itemset Mining , M. Zaki, and C-J Hsiao, SIAM SDM 2002, Arlington, VA, April 2002. Junfeng Wu (slides )
Novemver 9, 2004 Efficiently mining long patterns from databases, R. J. Bayardo, ACM SIGMOD international conference on Management of data, 1998. Dean Cheng (slides )
Novemver 9, 2004 Mafia: A maximal frequent itemset algorithm for transactional databases , D. Burdick, M. Calimlim, and J. Gehrke, 17th International Conference on Data Engineering (ICDE), April 2001.

See also:

MAFIA: A Performance Study of Mining Maximal Frequent Itemsets, Doug Burdick, Manuel Calimlim, Jason Flannick, Johannes Gehrke, and Tomi Yiu, Workshop on Frequent Itemset Mining Implementations (FIMI'03). Melbourne, Florida, November 2003.

Ben Chu (slides )
November 16, 2004 Dualminer: A dual-pruning algorithm for itemsets with constraints , C. Bucila, J. Gehrke, D. Kifer, and W. White, Data Mining and Knowledge Discovery, Vol. 7, Issue 4, July 2003, pages 241-272 Paul Nalos (slides )
November 16, 2004 Constrained Frequent Pattern Mining: A Pattern-Growth View , J. Han and J. Pei, ACM SIGKDD Explorations (Special Issue on Constraints in Data Mining), June 2002. Rafal Rak (slides )
November 18, 2004 Mining Sequential Patterns , R. Agrawal, R. Srikant, International Conference on Data Engineering (ICDE), 1995. Yunping Wang (slides )
November 18, 2004 PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth , J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, M. Hsu, 17th International Conference on Data Engineering (ICDE), April 2001. Wojciech Stach (slides )
November 23, 2004 A Robust Outlier Detection Scheme in Large Data Sets , J. Tang, Z. Chen, A. Fu , D. Cheung, the Sixth Pacific-Asia Conference on Knowledge Discovery and Data Mining, (PAKDD), Taipei, 6-8 May, 2002. John Sheldon (slides )
November 23, 2004 LOF: Identifying Density-Based Local Outliers , M. Breunig, H.-P. Kriegel, R. Ng, and J. Sander, ACM SIGMOD Int. Conf. on Management of Data, 2000. Jessica Enright (slides )
November 25, 2004 Mining Top-n Local Outliers in Large Databases , W. Jin, K.H. Tung and J. Han, ACM SIGKDD 2001, San Jose, California, Aug. 2001. Hongqin Fan
November 25, 2004 Rainforest - a framework for fast decision tree construction of large datasets , J. Gehrke, R. Ramakrishnan and V. Ganti, Proc. Very Large DataBases (VLDB), 1998. Leila Homaeian
November 30, 2004 Data Bubbles for Non-Vector Data: Speeding-up Hierarchical Clustering in Arbitrary Metric Spaces, , J.Zhou and J. Sander, Conf. on Very Large DataBases (VLDB), 2003. Jon Klippenstein
November 30, 2004 Privacy-Preserving Data Mining , R. Agrawal and R. Srikant, ACM SIGMOD 2000, Dallas, May 2000. Haobin Li

Important Dates

See table of the paper presentations above.

The class presentation will count 16% of the overall grade.

Assignments
There are 4 assignments. The first is an implementation of Frequent Itemset mining, the second is an evaluation of a data mining tool, the thisr is the evaluation of classification models using Weka, and the last one is a review of a paper presented in class (see above).

Important Dates

 Due DateComments
Assignment 1October 8Counts for 5%
Assignment 2November 5counts for 5%
Assignment 3TBAcounts for 5%
Assignment 4variablecounts for 5%

Assignments will count together 20% of the overall grade.

Projects
All projects should be implemented preferably using C++ on kylix. Kylix is a freely available C++ programming environment (compiler + debugger) for Linux by Borland. Projects dealing with text documents can also use scripting languages such as Perl or Python. Exceptionally, Java can be used for web-based interfaces.

Teams need to meet with me on these days:
Monday October 25
9:00 to 10:00Rafal Rak + Wojciech (project 4)
Jon Klippenstein (project 6)
Haobin Li (project 9)
11:00 to 11:30Nasimeh Asgarian (project 2)
11:30 to 12:00Paul Nalos (project 1)
12:30 to 13:00Leila Homaeian (project 7)
13:00 to 13:30Junfeng Wu and Hongqin Fan (project 8)
Tuesday October 26
11:00 to 11:30John Sheldon + Dean Cheng (project 11)
11:30 to 12:00Ben Chu + Jess Enright (project 13)
Wednesday October 27
12:00 to 12:30Yunping Wang (project 3)
Monday November 8-15-22
9:00 to 9:30Rafal Rak + Wojciech (project 4)
9:30 to 10:00Jon Klippenstein (project 6)
10:00 to 10:30Haobin Li (project 9)
11:00 to 11:30Nasimeh Asgarian (project 2)
11:30 to 12:00Paul Nalos (project 1)
12:30 to 13:00Leila Homaeian (project 7)
14:00 to 14:30Junfeng Wu and Hongqin Fan (project 8)
Tuesday November 9-16-23
11:00 to 11:30John Sheldon + Dean Cheng (project 11)
11:30 to 12:00Ben Chu + Jess Enright (project 13)
Wednesday November 10-17-24
12:00 to 12:30Yunping Wang (project 3)

Suggested Projects: (strong suggestion)

Project 1: Constaint-Based Mining

One or Two People (coordinated by Osmar and Mohammad El-Hajj)
Paul Nalos
Constraint-based Association Rule Mining for queries such as "find all frequent itemsets where the total price is no more than $100" has received much attention, for it can improve the mining efficiency. Some important classes of rule constraints such as antimonotone, monotone, succinct constraint are discussed in Ch6. In this project, you are required to implement these constraints based on the Apriori algorithm, and compare your constraint-based algorithm with the original Apriori algorithm. When implementing the rule constraints, the naive approach is to deal with them one by one, and only one type of constraint is handled at a time. Can you handle more then one type at the same time to further improve the efficiency? For example, can you simultaneously take advantage of both monotone and antimonotone to mine constraint-based itemsets? Implement DualMiner to take care of both types of constraints.

Project 2: Sequential Patterns vs. Association Rules

One Person (coordinated by Osmar)
Nasimeh Asgarian
Sequential Patterns and Association Rules are two similar data mining techniques. They are only different in that the patterns in the former are time-ordered. It seems that Sequential Patterns plays a more important role in Web Usage Mining since it considers the time order between items. However, some people argue that when the application is to provide dynamic recommendations for web users, Association Rules performs better. In this project, you are required to build a prototype of a recommender agent using both of the techniques and compare their performance in your system.

Project 3: Sequential Patterns mining using COFI trees

One Person (coordinated by Osmar and Mohammad El-Hajj)
Yunping Wang
GSP finds sequential patterns based on the Apriori algorithm. Treespan uses the FP-tree idea. Can we take advantage of the COFI tree aproach to find sequential patterns?

Project 4: Associative classifier with reoccurring items

Two People (coordinated by Osmar and Luiza Antonie)
Rafal Rak and Wojciech Stach
Recently a new classification method was proposed in the Data Mining community: associative classification. In this case the learning model consists of association rules mined. The main idea behind this approach is to discover strong patterns that are associated with the class labels. Three such models were presented in the literature: CMAR, CBA and ARC-AC/ARC-BC. All of these algorithms deal with binary association rules and assumes attribute values to appear at most ones in each transaction and use simple apriori-like or FP-growth to associations. Consider that attribute values can reoccur in the same transactions, devise an associative classifier that takes into account reoccurring items.

Project 5: User interface for text categorization

One Person (coordinated by Osmar and Luiza Antonie)
Knowledge Discovery is an iterative process and even data mining one single step of this process is also iterative, interactive sub-process needing a lot of tuning, trials, adjustments, interpretations, etc. This also true for the case of text categorization. There is a step in which a user selects a document set, designates a split (training/testing) and starts training. The model is tested and evaluated in a second step, possibly using different accuracy measures. In a final step the model is used to categorize new documents. One can also consider yet another step where the result of the classification (or the model) is visualized. Design a user interface that incorporates all these steps for a text document categorizer using an associative classifier. Assume you have a module that gets as input transactions and some thresholds and returns a set of association rules, as well as a module that classifies new transactions. The user interface should allow to control the parameters, the control of different accuracy measures, as well as visualize the association rules (model) and the classification results.

Project 6: Protein localization

One Person (coordinated by Osmar and Luiza Antonie)
Jon Klippenstein
Recently a new classification method was proposed in the Data Mining community: associative classification. In this case the learning model consists of association rules mined. The main idea behind this approach is to discover strong patterns that are associated with the class labels. Three such models were presented in the literature: CMAR, CBA and ARC-AC/ARC-BC. None of these algorithms have been tested on biological data for protein localization. Given features from sequences of amino-acides such as protein compositions or frequent sub sequences, can we use associative classifiers to predict whether a protein is intra or extra cellular?

Project 7: Privacy Preservation

One Person (coordinated by Osmar and Stanley Oliveira)
Leila Homaeian
The sanitization of transactional data is preprocessing that eliminates items and transactions from transactional data to avoid the discovery of sensitive association rules. The project consists of using sanitization algorithms to "blacken" sensitive text from textual documents. Given a text database, Apriori can be used to find and display all associations of text. Selecting some sensitive rules to hide, IGA (or other algorithm) could be used to identify text to sanitize.

Project 8: Outlier detection

Two People (coordinated by Osmar and Andrew Foss)
Junfeng Wu and Hongqin Fan
LOF and COF are measures are given to points in space to rank and identify local outliers. TURN* is a clustering algorithm that also identifies outliers. Can we use the resolution idea behind TURN* to identify top-n local outliers based on distance?

Project 9: SPAM detection

One Person (coordinated by Osmar and Luiza Antonie)
Haobin Li
e-mail is either SPAM or non SPAM. This is a typical 2-class classification problem? Which classification model would work best in this case: Naive Bayesian, Associative classifier, or other?

Project 10: Immersive data mining

One Person (coordinated by Osmar and Ayman Ammoura)
An initial prototype for visualizing 3-dimentional data cubes in an interactive virtual-reality immersive environment has been implemented using the CAVE at UofA. The project would be to reimplement this prototype using the SGI performer library and include an 3D interactive torus menu system allowing OLAP operations and other ad-hoc data mining operations.

Project 11: Gene Promoter Sequence Analysis

Two People (Coordinated by Osmar and Marcelo Marcet)
John Sheldon and Dean Cheng
The central dogma of molecular biology states that genetic information flows from DNA code (genes), to RNA code and then to protein code. Proteins could be thought of as the cell's hands. When proteins are produced at abnormal concentrations (too high or too low) cell processes are affected in many cases leading to live threatening disorders such as asthma. Thus it is important to understand how proteins production is regulated. The most important mechanism to control protein production takes place at the DNA level inside the nucleus of the cell. The process of converting DNA code to RNA code (transcription) is the first level at which the cell controls protein production. Transcription regulation takes place in a DNA region called the promoter. This DNA sequence, unlike genes, does not code for protein. The promoter is directly in front of genes, and for the gene to be transcribed, a number of special proteins called transcription factors must bind to specific DNA sequences in the promoter to form a 3D DNA-protein complex which will determine whether the gene is transcribed or not. Different proteins are regulated in different ways because their genes have different promoter sequences. The aim of this study is to understand and compare different promoter sequences. The promoter-analysis software will be developed to extract promoter sequences from a database whereas the DNA sequence of transcription factors-binding sites will be obtained from a second database. The promoter-analysis software should also provide a graphic interface to facilitate the analysis and comparison of promoters.

Project 12: Sumarization

Two People (coordinated by Osmar)
Characteristic rules and discriminant rules are very useful descriptive patterns for large data. The project consists of implementing the Attribute-oriented induction algorithm to generate interactive table and cross-table as well as typical charts for characterization and discrimination. Interactivity includes OLAP-like operations. An existing package for charting would be used.

Project 13: Intelligent Tutoring Agent

Two People (coordinated by Osmar)
Ben Chu and Jessica Enright
The project consists of building a prototype for a tutoring system providing exercises about functional dependencies. The tutoring system would store user actions in activity logs and mine these logs to identify user behaviours that could be used for improving the tutoring agent recommendations.

Project demo schedule
GroupDate
Project 11: Gene Promoter Sequence Analysis
John Sheldon and Dean Cheng
December 2
Project 8: Outlier detection
Junfeng Wu and Hongqin Fan
December 2
Project 4: Associative classifier with reoccurring items
Rafal Rak and Wojciech Stach
December 2
Project 13: Intelligent Tutoring Agent
Ben Chu and Jessica Enright
December 2
Project 9: SPAM detection
Haobin Li
November 30 (CSC B-02 3:00-3:30)
Project 1: Constaint-Based Mining
Paul Nalos
December 7
Project 2: Sequential Patterns vs. Association Rules
Nasimeh Asgarian
December 7
Project 3: Sequential Patterns mining using COFI trees
Yunping Wang
December 7
Project 6: Protein localization
Jon Klippenstein
December 7
Project 7: Privacy Preservation
Leila Homaeian
December 7

Important Dates

  • Project proposal due: Tuesday 19 October. (10%)
  • Project Initial Demo: Week of November 22-26. (35%)
  • Project report due: Thursday December 2. (15%)
  • Final Demo: Thursday December 2 and Tuesday December 7. (40%)

Projects will count 39% of the overall grade.

[an error occurred while processing this directive]
Copyright Osmar R. Zaiane, 1999