Assignment 1
CMPUT 695 (Fall 2004)
Due Date: October 8th, 2004 at 17:00(hard deadline)
Percentage overall grade: 5%
Penalties: 20% off a day for late assignments
Maximum Marks: 10
This assignment is about mining for association rules. You are asked
to implement (on your own), without copying any code from any on-line
sources, the Apriori algorithm as published in the paper:
The algorithm for finding frequent itemsets is as follows:
Ck: Candidate itemset of size k
Lk: Frequent itemset of size k
INPUT: database and min_support
L1 = {frequent items};
for (k = 1; Lk is not empty; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in
Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return all Lk;
|
To find all strong rules, the following algorithm can be used:
For each frequent itemset, f
generate all non-empty subsets of f.
For every non-empty subset s of f do
output rule s -> (f-s) if support(f)/support(s) >= min_confidence
end
|
Here is a transactional database with 10,000 transactions each with 12
items on average from a set of 500 distinct items (dataT10K500D12L).
Test your implementation on this dataset. you are asked to provide
execution time measures as well as numbers of frequent itemsets and
number of association rules for the following cases:
- 10000 transactions
- 1000 transactions
- 100 transactions
with the minimum support set to 0.1% and minimum confidence set to
80%.
For each case, the output should be:
Number of frequent 1_itemsets: | |
Number of frequent 2_itemsets: | |
Number of frequent 3_itemsets: | |
Number of frequent 4_itemsets: | |
... |
Number of frequent k_itemsets: | |
Number of association rules | |
IMPORTANT:
- It would be better to implement the algorithm in c/c++. The
implementations will be compared with among other criteria the
execution time and scalability.
- The implementation will be tested with other datasets.
- Plagiarism will not be tolerated. Write your own code.
- Marks will be based on correctness first, then efficiency and
cleanliness of code.
- If the execusion time takes more than twice the average execusion
time of the class, the implementation will not be marked for
efficiency.
- The number of transactions and the
number of distinct items are not given and must be detemined by
the program while reading the input file.
Deliverables:
This assignment is to be submitted via email. Send only one tar file that contains all that
you are submitting:
- cpp files (should be compatible with gcc compiler on linux). The
executable should get four parameters as input: 1- file name;
2-minimum-support; and 3- minimum confidence. The thresholds
should be numbers between 0 and 1. the forth parameter is either
"r", "f", "a", or absent. When "r", then all strong association rules are
displayed. When "f" then all frequent itemsets are displayed. When
"a" then all frequent itemsets and all strong association rules
are displayed. When absent, then only the number of frequent itemsets
of different sizes and the number of strong rules are displayed.
The code should be sufficiently (and reasonably) commented.
When displaying the frequent
itemsets, print their support "(s)". When displaying the strong
rules, print their support and confidence "(s, c)". In both cases
with 2 decimal point precision.
- One file containing all association rules for the 10,000
transaction case. Each line should contain a rule with the
following syntax:
x, y, z -> a, b (s,c)
meaning that items are separated by ", " and the body and head of
the rule are separated by " -> ". Do not forget the spaces. "s"
and "c" are support and confidence respectively. They need to be
with 2 decimal point precision.
- A file containing the time and count measures as explained above.
|