Principles of Knowledge Discovery in Data
Projects
Due Date: June 16th, 2006 at 13:00(hard deadline)
Percentage overall grade: 30%
Penalties: 20% off a day for late delivery
Maximum Marks: 100
Project 1
Spatial outlier detection in the presence of physical constraints.
Outliers are points that deviates significantly from the rest of the
data. Outlier detection consists in discovering these abnormalities
automatically. Clustering is grouping similar data
points. Intuitively, clustering and outlier detection are related to
each other. There is some work published about clustering in the
presence of obstacles. In spacial data, we could have obstacles such
as rivers or highways that may prevent similar datapoints from
grouping together. An algorithm that takes into account such physical
constraints would avoid clustering points that are close together but
separated by an obstacle. There could be also bridges that facilitate
the connection of far apart points and algorithms that take these
connectors into consideration would cluster points together if their
are linked by such bridges.
Can an outlier detection algorithm also consider physical constraints
in the detection of outliers? Yes. Obstacles can create outliers and
bridges can bring outliers to join clusters.
Consider the algorithm DBCluC and the algorithm LOF, both based on
DBScan. Combine them all together to generate a spatial outlier
detection that considers physical constraints.
References:
- O. R. Zaiane, C-H. Lee, Clustering Spatial Data When Facing
Physical Constraints , in Proc. of the IEEE International Conference
on Data Mining (ICDM'2002), pp 737-740, 2002
- O. R. Zaiane and Chi-Hoon Lee, Clustering Spatial Data in the
Presence of Obstacles: a Density-Based Approach, Sixth International
Database Engineering and Applications Symposium (IDEAS 2002), pp
214-223, 2002
- H. Fan, O. R. Zaiane, A. Foss, J. Wu, A
Nonparametric Outlier Detection for Effectively Discovering Top-N
outliers from Engineering Data, Pacific-Asia Conference on Knowledge
Discovery and Data Mining (PAKDD), 2006
-Breunig M. M., Kriegel H.-P., Ng R., Sander J.: LOF: Identifying
Density-Based Local Outliers, Proc. ACM SIGMOD Int. Conf. on
Management of Data (SIGMOD), pp. 93-104, 2000
- Ester M., Kriegel H.-P., Sander J., Xu X.: A Density-Based Algorithm
for Discovering Clusters in Large Spatial Databases with Noise,
Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining (KDD), pp. 226-231., 1996
Project 2
Contrast sets for nominal data.
A contrast set is a conjunction of association-rule-like attribute-value
pairs for which a conjunction is true for some group. In other words,
they represent attributes, values and instances that differ
significantly across groups. Contrast sets give insight into the
fundamental differences between groups.
Example: we may want to compare good students from not so good
students and mining the database of students we may discover the
following contrast set (Good_Student => Hours_week = High AND
NB_Years = High, 35%) (NotGood_Student => Hours_Week = Low AND
NB_Years=Low, 25%)
The paper that introduced contrast sets is "Detecting Change in
Categorical Data: Mining Contrast Sets" by S. Bay and M. Pazzani,
in ACM SIGKDD International Conference on Knowledge Discoveryu and
Data Mining, p 302-306, 1999. An extended version is "Detecting
Group Differences: Mining Contrast Sets" by S. Bay and M. Pazzani,
in Data Mining and Knowledge Discovery, 5(3), p 213-246, 2001.
You are asked to implement their algorithm STUCCO then convert the
Apriori algorithm for discovering association rules to find contrast
sets and compare the 2 algorithms.
Other useful references are:
- Mining Interesting Contrast Rules for web-based Educational System, by
B. Minaei-Bidgoli, P-N Tan, W. Punch, in International Conference on
Machine Learning (ICML), 2004
- A Statistically Sound Alternative Approach
to Mining Contrast Sets, by R.J. Hilderman and T. Peckham, in 4th
Australasian Data Mining Conference (AusDM), 2005
- On detecting differences between groups, by G. Webb, S. Butler,
D. Newlands, in ACM SIGKDD, 2003.
Project 3
Contrast sets for sequential data.
There is a body of research regarding understanding the differences
between groups from the statistical community as well as the data
mining community. Data mining researchers call this problem mining for contrast sets. It
is a variant of association rules mining: basically association rules
with similar attribute definitions in the antecedents but with the
consequent either one of the groups to compare. Unfortunately for us,
the works on contrast sets focus on discrete data. What is we have
groups of sequences - i.e. sequential data?
If we have two sets of sequences A and B and we want to identify the
subsequences that distinguish each set from the other we could proceed
the by following these steps:
- each group (i.e. set of event sequences) is treated to extract
the frequent subsequences (locally in the group). Now each
group is
represented by its frequent subsequences. You could use any
algorithm such as GSP or prefix-tree from the internet if you
find one implementation that works and you trust.
- define a distance measure between subsequences of
events. Here you can benefit from work in bioinformatics
comparing
genes or proteins (i.e. sequences of an alphabet of 4 or 20
symbols).
The distance has to be symmetric but I am not sure if it has
to be a
metric (triangle of inequality).
- Given this distance measure, each frequent subsequence in group
A has a closest match in group B.
Let's define a contrasting subsequence a subsequence S1 in
group
A that has the largest distance to its closest match in group
B. S1 may
not exist in B so there is no symmetry and there is another
contrasting
subsequence S2 in group B that has the largest distance to its
closest
match in group A.
We could define a certain number k of most contrasting
subsequences by
taking the k most distant subsequences from their respective
closest
matches.
The last step can be done with a naive way in a brute force maner with
two nested loops. I can explai to you who to improve this with a
simple test to break the second loop and make the algorithm faster.
IMPORTANT:
Deliverables:
- A short report explaining your project and showing experimental
results
- A demonstration in class
- The source code of your implementation.
|