Principles of Knowledge Discovery in Data

Projects

Due Date: June 16th, 2006 at 13:00(hard deadline)
Percentage overall grade: 30%
Penalties: 20% off a day for late delivery
Maximum Marks: 100


Project 1

Spatial outlier detection in the presence of physical constraints.
Outliers are points that deviates significantly from the rest of the data. Outlier detection consists in discovering these abnormalities automatically. Clustering is grouping similar data points. Intuitively, clustering and outlier detection are related to each other.
There is some work published about clustering in the presence of obstacles. In spacial data, we could have obstacles such as rivers or highways that may prevent similar datapoints from grouping together. An algorithm that takes into account such physical constraints would avoid clustering points that are close together but separated by an obstacle. There could be also bridges that facilitate the connection of far apart points and algorithms that take these connectors into consideration would cluster points together if their are linked by such bridges.
Can an outlier detection algorithm also consider physical constraints in the detection of outliers? Yes. Obstacles can create outliers and bridges can bring outliers to join clusters.

Consider the algorithm DBCluC and the algorithm LOF, both based on DBScan. Combine them all together to generate a spatial outlier detection that considers physical constraints.

References:
- O. R. Zaiane, C-H. Lee, Clustering Spatial Data When Facing Physical Constraints , in Proc. of the IEEE International Conference on Data Mining (ICDM'2002), pp 737-740, 2002
- O. R. Zaiane and Chi-Hoon Lee, Clustering Spatial Data in the Presence of Obstacles: a Density-Based Approach, Sixth International Database Engineering and Applications Symposium (IDEAS 2002), pp 214-223, 2002
- H. Fan, O. R. Zaiane, A. Foss, J. Wu, A Nonparametric Outlier Detection for Effectively Discovering Top-N outliers from Engineering Data, Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2006
-Breunig M. M., Kriegel H.-P., Ng R., Sander J.: LOF: Identifying Density-Based Local Outliers, Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD), pp. 93-104, 2000
- Ester M., Kriegel H.-P., Sander J., Xu X.: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining (KDD), pp. 226-231., 1996


Project 2

Contrast sets for nominal data.
A contrast set is a conjunction of association-rule-like attribute-value pairs for which a conjunction is true for some group. In other words, they represent attributes, values and instances that differ significantly across groups. Contrast sets give insight into the fundamental differences between groups.
Example: we may want to compare good students from not so good students and mining the database of students we may discover the following contrast set
(Good_Student => Hours_week = High AND NB_Years = High, 35%) (NotGood_Student => Hours_Week = Low AND NB_Years=Low, 25%)

The paper that introduced contrast sets is "Detecting Change in Categorical Data: Mining Contrast Sets" by S. Bay and M. Pazzani, in ACM SIGKDD International Conference on Knowledge Discoveryu and Data Mining, p 302-306, 1999. An extended version is "Detecting Group Differences: Mining Contrast Sets" by S. Bay and M. Pazzani, in Data Mining and Knowledge Discovery, 5(3), p 213-246, 2001.

You are asked to implement their algorithm STUCCO then convert the Apriori algorithm for discovering association rules to find contrast sets and compare the 2 algorithms.

Other useful references are:
- Mining Interesting Contrast Rules for web-based Educational System, by B. Minaei-Bidgoli, P-N Tan, W. Punch, in International Conference on Machine Learning (ICML), 2004
- A Statistically Sound Alternative Approach to Mining Contrast Sets, by R.J. Hilderman and T. Peckham, in 4th Australasian Data Mining Conference (AusDM), 2005
- On detecting differences between groups, by G. Webb, S. Butler, D. Newlands, in ACM SIGKDD, 2003.


Project 3

Contrast sets for sequential data.
There is a body of research regarding understanding the differences between groups from the statistical community as well as the data mining community. Data mining researchers call this problem mining for contrast sets. It is a variant of association rules mining: basically association rules with similar attribute definitions in the antecedents but with the consequent either one of the groups to compare. Unfortunately for us, the works on contrast sets focus on discrete data. What is we have groups of sequences - i.e. sequential data?
If we have two sets of sequences A and B and we want to identify the subsequences that distinguish each set from the other we could proceed the by following these steps:

  1. each group (i.e. set of event sequences) is treated to extract the frequent subsequences (locally in the group). Now each group is represented by its frequent subsequences. You could use any algorithm such as GSP or prefix-tree from the internet if you find one implementation that works and you trust.
  2. define a distance measure between subsequences of events. Here you can benefit from work in bioinformatics comparing genes or proteins (i.e. sequences of an alphabet of 4 or 20 symbols). The distance has to be symmetric but I am not sure if it has to be a metric (triangle of inequality).
  3. Given this distance measure, each frequent subsequence in group A has a closest match in group B. Let's define a contrasting subsequence a subsequence S1 in group A that has the largest distance to its closest match in group B. S1 may not exist in B so there is no symmetry and there is another contrasting subsequence S2 in group B that has the largest distance to its closest match in group A. We could define a certain number k of most contrasting subsequences by taking the k most distant subsequences from their respective closest matches.
The last step can be done with a naive way in a brute force maner with two nested loops. I can explai to you who to improve this with a simple test to break the second loop and make the algorithm faster.


IMPORTANT:

Deliverables:

  • A short report explaining your project and showing experimental results
  • A demonstration in class
  • The source code of your implementation.

Posted on April 23. Last updated (April 23, 2006 - 21:00)