Instructor: |
Alex Thomo |
Phone: |
(250) 472-5786 |
Office: |
ECS 556 |
Office Hours: |
T 2:30 - 3:30 p.m., F 1:30 - 2:30 |
Email: |
thomo@cs.uvic.ca |
TA: |
Marina Barsky |
Email: |
mgbarsky@uvic.ca |
Course Outline: |
Link |
Books:
Introduction to Data Mining (First Edition)
by Pang-Ning Tan, Michael Steinbach, Vipin Kumar.
Addison Wesley, 2005. (PSK)
2 hours reserve in the library.
Data Mining: Practical Machine Learning Tools and Techniques
by Ian H. Witten, Eibe Frank.
Morgan Kaufmann; 2nd edition, 2005. (WF)
2 hours reserve in the library.
Programming Collective Intelligence
by Toby Segaran
O'Reilly; 1st edition, 2007. (SEG)
Accessible online through the UVic library:
link
.
Marks so far:
link.
Midterm Solutions:
link.
Reading list:
link
Assignments:
Assignment 1.
Hints.
Solutions.
Assignment 2.
Solutions.
Assignment 3.
Solutions.
Project:
Description.
Labs (by Marina Barsky):
Lab1,
Lab2,
Lab3,
Lab4,
Lab5,
Lab6
Lab7
Lab8
Lecture Handouts:
Predictive Data Mining
- Intro to Data Mining
Slides.
- Applying Decision Trees. Learning Decision Trees. Measures of Node Impurity, Entropy. Information Gain.
Decision Trees with Numerical Attributes. Regression Trees.
Slides (1).
Slides (2).
Python Code.
- SLIQ and SPRINT Decision Trees Algorithms.
Slides.
SLIQ paper.
SPRINT paper.
- MapReduce Framework.
Slides.
MapReduce paper.
Python test code.
Word count example.
- Rule-Based Classifiers. Coverage and Accuracy.
Decision Trees vs. rules. Ordered Rule Set.
Separate-and-conquer algorithms. PRISM and RIPPER algorithms.
Slides.
- Uncertain knowledge. Belief and Probability. Conditional
probability. Bayes' Rule. Conditional Independence. Normalization constant.
Naive Bayes Classifier. Text Categorization.
Slides.
- Bayesian Belief Networks: Semantics, Inference, Classification, Construction, Complexity.
Slides.
- Bayesian Belief Networks: Practice.
Slides (a).
Slides (b).
- Credibility: Evaluating what's been learned. Predicting performance. Confidence intervals.
Holdout estimation. Cross-validation. The bootstrap.
Counting the cost.
Slides.
- ROC curves.
Slides.
(A more concise version is
here.
A useful page with tutorials and code is
here.)
- Linear Separators: Hyperplane Geometry, Margin, Perceptron Algorithm.
Beyond Linear Separability: Kernel Trick. Support Vectors.
Slides.
See also
Point-LineDistance.
- Beyond Linear Separability: Artificial Neural Networks.
Slides.
- Genetic Algorithms.
Slides.
- Instance Based Learning.
Slides.
- Recommender Systems.
Slides.
Association Analysis
- Frequent Itemset Generation: The Apriori Principle, Apriori Algorithm, Candidate Generation and Pruning, Support Counting.
Slides.
- More on Apriori Algorithm. Rule Generation: Confidence-Based Pruning, Rule Generation in Apriori
Algorithm. Compact Representation of Frequent Itemsets: Maximal Frequent Itemsets, Closed Frequent Itemsets.
Slides.
-
Alternative Methods for Frequent Itemset Generation.
FP-Growth Algorithm: FP-Tree Representation, Frequent Itemset Generation in FP-Growth Algorithm.
Slides.
- FPTree/FPGrowth Complete Example.
Slides.
- Evaluation of Association Patterns: Objective Measures of Interestingness.
Simpson's Paradox.
Skewed distribution, Cross support patterns, Lowest confidence rule.
Slides.
-
Data Engineering: Transforming attributes. Multi-level Association Rules.
Mining word associations. Min-Apriori.
Slides.
- Mining of sequences. Candidate Generation. Timing Constraints.
Slides.
-
Mining Graphs.
Frequent Subgraph Mining. Edge Growing. Multiplicity of Candidates.
Slides.
-
Finding Similar Items. Minhashing. Locality Sensitive Hashing.
Slides.
Cluster Analysis
- Applications of Cluster Analysis. Types of Clusters. K-means Algorithm.
Problems with Selecting Initial Points. Bisecting K-means.
Limitations of K-means.
Slides.
- Agglomerative Hierarchical Clustering. Density based clustering DBSCAN.
Slides
- Self Organizing Maps. HICAP: Hierarchical Clustering with Pattern Preservation.
Mining the Web
Assignments: There will be three
assignments.
Interesting Links:
A Map Reduce Framework for Programming Graphics Processors
Mars: A MapReduce Framework on Graphics Processors
|