[an error occurred while processing this directive]
|
|
Activities
Class presentations |
Each student is expected to read and present one paper from the research literature that I provided. You should view this as if
you were presenting the paper at a conference. Be prepared to answer detailed technical questions about the content of the paper or related to the topic of the paper. However, you do not need to be an advocate
for the paper. You do not have to defend it. If you want to
critique it and highlight the problems and weaknesess: go ahead.
In your presentation, you should properly place the work within the field, cite
if possible related work, and clearly indicate the innovative aspects of the work
and its contribution to the field, as well as explain the details of the
ideas, algorithms and experiments presented in the paper.
Students must prepare their own materials for presentation. I highly
recommend powerpoint slides. However, transparencies are fine
too. Bring me or send me by e-mail the final slides one hour before
class. I will make sure the files are transfered to the classroom
computer before the class starts.
These slides are also made available to the other students via our
course web
site.
If the student is not present (does not show up) for his/her presentation, the evaluation is automatically 0 (zero) unless a good medical justification is given. For any serious absence, I should be advised early enough to organize a reshuffling of the presentations.
In addition to the presentation slides, the student making a
presentation is also expected to review the paper, and write a written
report for the assigned paper. The review should be about 2 pages and should be written as if you were reviewing a journal article or a paper submitted to a conference program committee.
The review should contain at least these sections:
1-Brief summary of the main contributions of the paper
2-Elaboration on the positive aspects presented in the paper
3-Elaboration on the negative aspects presented in the paper
4-Comments on how to improve the ideas/issues/experiments presented in the paper.
Here is an article giving suggestions on how to review a paper (Alan Jay Smith, The Task of the Referee, IEEE, 1990)
Reviews should be an independent analysis of the paper.
Collaboration to understand the intricacies of the paper is
acceptable. However, students should write their own reviews and interpretation.
Reviews should be sent by e-mail (in pdf format) the week before the
presentation.. Defaulted reviews count for 0 (zero). The reviews will
be put on-line for all students to read (along with the paper) before
the presentation is given.
Day | Paper | Presenter |
October 28, 2004 |
Closet+: Searching for the best strategies
for mining frequent closed itemsets, J.Wang, J. Han, and J. Pei,
Ninth ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining
(SIGKDD),Washington, DC, USA, 2003.
| Nasimeh Asgarian (slides ) |
October 28, 2004 |
CHARM: An
Efficient Algorithm for Closed Itemset Mining
, M. Zaki, and C-J Hsiao, SIAM SDM 2002, Arlington, VA, April 2002.
| Junfeng Wu (slides ) |
Novemver 9, 2004 |
Efficiently mining long patterns from databases, R. J. Bayardo,
ACM SIGMOD international conference on Management of data, 1998.
| Dean Cheng (slides ) |
Novemver 9, 2004 |
Mafia: A maximal frequent
itemset algorithm for transactional databases ,
D. Burdick, M. Calimlim, and J. Gehrke, 17th International
Conference on Data Engineering (ICDE), April 2001.
See also:
MAFIA:
A Performance Study of Mining Maximal Frequent Itemsets, Doug
Burdick, Manuel Calimlim, Jason Flannick, Johannes Gehrke, and Tomi
Yiu, Workshop on Frequent Itemset Mining Implementations
(FIMI'03). Melbourne, Florida, November 2003.
| Ben Chu (slides ) |
November 16, 2004 |
Dualminer: A dual-pruning algorithm for itemsets
with constraints , C. Bucila, J. Gehrke, D. Kifer, and W. White,
Data Mining and Knowledge Discovery, Vol. 7, Issue 4, July 2003, pages
241-272
| Paul Nalos (slides ) |
November 16, 2004 |
Constrained
Frequent Pattern Mining: A Pattern-Growth View , J. Han and J. Pei, ACM SIGKDD Explorations
(Special Issue on Constraints in Data Mining), June 2002.
| Rafal Rak (slides ) |
November 18, 2004 |
Mining Sequential Patterns , R. Agrawal, R. Srikant, International
Conference on Data Engineering (ICDE), 1995.
| Yunping Wang (slides ) |
November 18, 2004 |
PrefixSpan: Mining sequential patterns efficiently by prefix-projected
pattern growth ,
J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, M. Hsu,
17th International
Conference on Data Engineering (ICDE), April 2001.
| Wojciech Stach (slides ) |
November 23, 2004 |
A Robust Outlier Detection Scheme in Large Data Sets ,
J. Tang, Z. Chen, A. Fu , D. Cheung, the Sixth Pacific-Asia Conference
on Knowledge Discovery and Data Mining, (PAKDD), Taipei, 6-8 May, 2002.
| John Sheldon (slides ) |
November 23, 2004 |
LOF: Identifying Density-Based Local Outliers , M. Breunig, H.-P. Kriegel, R. Ng, and
J. Sander, ACM SIGMOD Int. Conf. on Management of Data, 2000.
| Jessica Enright (slides ) |
November 25, 2004 |
Mining Top-n Local Outliers in Large Databases ,
W. Jin, K.H. Tung and J. Han, ACM SIGKDD 2001, San Jose, California,
Aug. 2001.
| Hongqin Fan |
November 25, 2004 |
Rainforest - a
framework for fast decision tree construction of large datasets , J. Gehrke, R. Ramakrishnan and V. Ganti,
Proc. Very Large DataBases (VLDB), 1998.
| Leila Homaeian |
November 30, 2004 |
Data Bubbles for Non-Vector Data: Speeding-up Hierarchical
Clustering in Arbitrary Metric Spaces, ,
J.Zhou and J. Sander, Conf. on Very Large DataBases (VLDB), 2003.
| Jon Klippenstein |
November 30, 2004 |
Privacy-Preserving Data Mining ,
R. Agrawal and R. Srikant, ACM SIGMOD 2000, Dallas, May 2000.
| Haobin Li |
Important Dates
See table of the paper presentations above.
The class presentation will count 16% of the overall grade.
|
Assignments |
There are 4 assignments. The first is an implementation of Frequent
Itemset mining, the second is an evaluation of a data mining tool, the
thisr is the evaluation of classification models using Weka, and the
last one is a review of a paper presented in class (see above).
Important Dates
Assignments will count together 20% of the overall grade.
|
Projects |
All projects should be implemented preferably using C++ on
kylix. Kylix is a freely available C++ programming environment (compiler
+ debugger) for Linux by Borland. Projects dealing with text documents
can also use scripting languages such as Perl or
Python. Exceptionally, Java can be used for web-based interfaces.
Teams need to meet with me on these days:
Monday October 25 |
9:00 to 10:00 | Rafal Rak + Wojciech (project 4)
Jon Klippenstein (project 6) Haobin Li (project 9)
|
11:00 to 11:30 | Nasimeh Asgarian (project 2) |
11:30 to 12:00 | Paul Nalos (project 1) |
12:30 to 13:00 | Leila Homaeian (project 7) |
13:00 to 13:30 | Junfeng Wu and Hongqin Fan (project 8) |
Tuesday October 26 |
11:00 to 11:30 | John Sheldon + Dean Cheng (project 11) |
11:30 to 12:00 | Ben Chu + Jess Enright (project 13) |
Wednesday October 27 |
12:00 to 12:30 | Yunping Wang (project 3) |
Monday November 8-15-22 |
9:00 to 9:30 | Rafal Rak + Wojciech (project 4) |
9:30 to 10:00 | Jon Klippenstein (project 6) |
10:00 to 10:30 | Haobin Li (project 9) |
11:00 to 11:30 | Nasimeh Asgarian (project 2) |
11:30 to 12:00 | Paul Nalos (project 1) |
12:30 to 13:00 | Leila Homaeian (project 7) |
14:00 to 14:30 | Junfeng Wu and Hongqin Fan (project 8) |
Tuesday November 9-16-23 |
11:00 to 11:30 | John Sheldon + Dean Cheng (project 11) |
11:30 to 12:00 | Ben Chu + Jess Enright (project 13) |
Wednesday November 10-17-24 |
12:00 to 12:30 | Yunping Wang (project 3) |
Suggested Projects: (strong suggestion)
Project 1: Constaint-Based Mining
One or Two People (coordinated by Osmar and Mohammad El-Hajj)
Paul Nalos
Constraint-based Association Rule Mining for queries such as "find all
frequent itemsets where the total price is no more than $100" has
received much attention, for it can improve the mining
efficiency. Some important classes of rule constraints such as
antimonotone, monotone, succinct constraint are discussed in
Ch6. In this project, you are required to implement these
constraints based on the Apriori algorithm, and compare your
constraint-based algorithm with the original Apriori algorithm. When
implementing the rule constraints, the naive approach is to deal with
them one by one, and only one type of constraint is handled at a
time. Can you handle more then one type at the same time to further
improve the efficiency? For example, can you simultaneously take
advantage of both monotone and antimonotone to mine constraint-based
itemsets? Implement DualMiner to take care of both types of constraints.
Project 2: Sequential Patterns vs. Association Rules
One Person (coordinated by Osmar)
Nasimeh Asgarian
Sequential Patterns and Association Rules are two similar data mining
techniques. They are only different in that the patterns in the former
are time-ordered. It seems that Sequential Patterns plays a more
important role in Web Usage Mining since it considers the time order
between items. However, some people argue that when the application is
to provide dynamic recommendations for web users, Association Rules
performs better. In this project, you are required to build a
prototype of a recommender agent using both of the techniques and
compare their performance in your system.
Project 3: Sequential Patterns mining using COFI trees
One Person (coordinated by Osmar and Mohammad El-Hajj)
Yunping Wang
GSP finds sequential patterns based on the Apriori algorithm. Treespan
uses the FP-tree idea. Can we take advantage of the COFI tree aproach
to find sequential patterns?
Project 4: Associative classifier with reoccurring items
Two People (coordinated by Osmar and Luiza Antonie)
Rafal Rak and Wojciech Stach
Recently a new classification method was proposed in the Data Mining
community: associative classification. In this case the learning model
consists of association rules mined. The main idea behind this
approach is to discover strong patterns that are associated with the
class labels. Three such models were presented in the literature:
CMAR, CBA and ARC-AC/ARC-BC. All of these algorithms deal with binary
association rules and assumes attribute values to appear at most ones
in each transaction and use simple apriori-like or FP-growth to
associations. Consider that attribute values can reoccur in the same
transactions, devise an associative classifier that takes into account
reoccurring items.
Project 5: User interface for text categorization
One Person (coordinated by Osmar and Luiza Antonie)
Knowledge Discovery is an iterative process and even data mining one
single step of this process is also iterative, interactive sub-process
needing a lot of tuning, trials, adjustments, interpretations,
etc. This also true for the case of text categorization. There is a
step in which a user selects a document set, designates a split
(training/testing) and starts training. The model is tested and
evaluated in a second step, possibly using different accuracy
measures. In a final step the model is used to categorize new
documents. One can also consider yet another step where the result of
the classification (or the model) is visualized. Design a user
interface that incorporates all these steps for a text document
categorizer using an associative classifier. Assume you have a module
that gets as input transactions and some thresholds and returns a set
of association rules, as well as a module that classifies new
transactions. The user interface should allow to control the
parameters, the control of different accuracy measures, as well as
visualize the association rules (model) and the classification results.
Project 6: Protein localization
One Person (coordinated by Osmar and Luiza Antonie)
Jon Klippenstein
Recently a new classification method was proposed in the Data Mining
community: associative classification. In this case the learning model
consists of association rules mined. The main idea behind this
approach is to discover strong patterns that are associated with the
class labels. Three such models were presented in the literature:
CMAR, CBA and ARC-AC/ARC-BC. None of these algorithms have been tested
on biological data for protein localization. Given features from
sequences of amino-acides such as protein compositions or frequent sub
sequences, can we use associative classifiers to predict whether a
protein is intra or extra cellular?
Project 7: Privacy Preservation
One Person (coordinated by Osmar and Stanley Oliveira)
Leila Homaeian
The sanitization of transactional data is preprocessing that
eliminates items and transactions from transactional data to avoid the
discovery of sensitive association rules. The project consists of
using sanitization algorithms to "blacken" sensitive text from textual
documents. Given a text database, Apriori can be used to find and
display all associations of text. Selecting some sensitive rules to
hide, IGA (or other algorithm) could be used to identify text to
sanitize.
Project 8: Outlier detection
Two People (coordinated by Osmar and Andrew Foss)
Junfeng Wu and Hongqin Fan
LOF and COF are measures are given to points in space to rank and
identify local outliers. TURN* is a clustering algorithm that also
identifies outliers. Can we use the resolution idea behind TURN* to
identify top-n local outliers based on distance?
Project 9: SPAM detection
One Person (coordinated by Osmar and Luiza Antonie)
Haobin Li
e-mail is either SPAM or non SPAM. This is a typical 2-class
classification problem? Which classification model would work best in
this case: Naive Bayesian, Associative classifier, or other?
Project 10: Immersive data mining
One Person (coordinated by Osmar and Ayman Ammoura)
An initial prototype for visualizing 3-dimentional data cubes in an
interactive virtual-reality immersive environment has been implemented
using the CAVE at UofA. The project would be to reimplement this
prototype using the SGI performer library and include an 3D
interactive torus menu system allowing OLAP operations and other
ad-hoc data mining operations.
Project 11: Gene Promoter Sequence Analysis
Two People (Coordinated by Osmar and Marcelo Marcet)
John Sheldon and Dean Cheng
The central dogma of molecular biology states that genetic information flows from DNA code (genes), to RNA code and then to protein code. Proteins could be thought of as the cell's hands. When proteins are produced at abnormal concentrations (too high or too low) cell processes are affected in many cases leading to live threatening disorders such as asthma. Thus it is important to understand how proteins production is regulated.
The most important mechanism to control protein production takes place at the DNA level inside the nucleus of the cell. The process of converting DNA code to RNA code (transcription) is the first level at which the cell controls protein production. Transcription regulation takes place in a DNA region called the promoter. This DNA sequence, unlike genes, does not code for protein. The promoter is directly in front of genes, and for the gene to be transcribed, a number of special proteins called transcription factors must bind to specific DNA sequences in the promoter to form a 3D DNA-protein complex which will determine whether the gene is transcribed or not. Different proteins are regulated in different ways because their genes have different promoter sequences. The aim of this study is to understand and compare different promoter sequences. The promoter-analysis software will be developed to extract promoter sequences from a database whereas the DNA sequence of transcription factors-binding sites will be obtained from a second database. The promoter-analysis software should also provide a graphic interface to facilitate the analysis and comparison of promoters.
Project 12: Sumarization
Two People (coordinated by Osmar)
Characteristic rules and discriminant rules are very useful
descriptive patterns for large data. The project consists of
implementing the Attribute-oriented induction algorithm to generate
interactive table and cross-table as well as typical charts for
characterization and discrimination. Interactivity includes OLAP-like
operations. An existing package for charting would be used.
Project 13: Intelligent Tutoring Agent
Two People (coordinated by Osmar)
Ben Chu and Jessica Enright
The project consists of building a prototype for a tutoring system
providing exercises about functional dependencies. The tutoring system
would store user actions in activity logs and mine these logs to
identify user behaviours that could be used for improving the tutoring
agent recommendations.
Project demo schedule
Group | Date |
Project 11: Gene Promoter Sequence Analysis
John Sheldon and Dean Cheng | December 2 |
Project 8: Outlier detection
Junfeng Wu and Hongqin Fan | December 2 |
Project 4: Associative classifier with reoccurring items
Rafal Rak and Wojciech Stach | December 2 |
Project 13: Intelligent Tutoring Agent
Ben Chu and Jessica Enright | December 2 |
Project 9: SPAM detection
Haobin Li | November 30 (CSC B-02 3:00-3:30) |
Project 1: Constaint-Based Mining
Paul Nalos | December 7 |
Project 2: Sequential Patterns vs. Association Rules
Nasimeh Asgarian | December 7 |
Project 3: Sequential Patterns mining using COFI trees
Yunping Wang | December 7 |
Project 6: Protein localization
Jon Klippenstein | December 7 |
Project 7: Privacy Preservation
Leila Homaeian | December 7 |
Important Dates
- Project proposal due: Tuesday 19 October. (10%)
- Project Initial Demo: Week of November 22-26. (35%)
- Project report due: Thursday December 2. (15%)
- Final Demo: Thursday December 2 and Tuesday December 7. (40%)
Projects will count 39% of the overall grade.
|
[an error occurred while processing this directive]
|