We are ready for the second assignment, due on November 29, this is almost a whole month. Please note that this cannot be extended as after the exam there is little time to enter all marks into the system and I shall need some time for marking examination scripts and marking your assignments.
The first assignment was on visualization, the second is on the data understanding.
Please read the comments below carefully.
Do not focus your work on accuracy only, or on new methods and their improvement - assignment is not a research report!
The main goal for this assignment is to learn something about the data, not about the methods.
Try to find a simplest description: most informative features, rules from trees, associations between variables, clusters, understanding of category structure.
Accuracy is of secondary importance as long as it is not very bad. Fashionable methods like SVMs may provide some knowledge if small number of support vectors is left and they are carefully analyzed, or if SVM is used with different subsets of features etc, but we should not be happy with a high accuracy black box that does not explain its decisions.
The deadline should be on the 30th of November - we should finish before the exam!
Electronic copy submission is sufficient: please zip or compress all files, give the file
your name, write in the email Your-Name, Paper Title, Method Used, Software Used, Data Used (this will be placed in the Table below), and send it to me.
Please keep this format to make it easy for me, renaming all your files from "assignment2" etc. to your names is not fun.
If more than one file is send please zip or compress them, give the file YOUR NAME and send it to me.
Try to minimize the size of the file, I want to attach them to this page and I do not have much space in WWW!
Q & A:
1. Is a formal report required? If so, how long should it be?
Your paper is a report. The length should be sufficient for others that have taken this course to be able to understand the method and interpret the results.
2. What must we do to score high marks for this assignment? Do we have to study the methods in much greater depth than what was covered in the lectures?
Find an interesting method, perhaps new variant of one of the methods I talked about,
find interesting data,
provide interpretation of the results,
describe what can we learn from it,
mention potential applications.
Great book with pointers on how to understand data is here.
Topics taken for the second assignment in 2006. Max. number of points is 10
| No | Name | Paper | Method | Software | Data | Remarks | Mark |
| 39 | Annamalai Parimal | Analyzing Boston housing data using classification and regression trees | CART | CART | Boston housing | Good CART description, few experiments, but well focused on the data | 7 |
| 25 | Ardian Kristanto Poernomo | Data Understanding for Iris Plant Data | C45 + 1R + PCA | WEKA | Iris | Very good description of data and new features; quite interesting experiments for such simple data; why PCA fails if Fig. 4 shows clear separability? | 8 |
| 12 | Bramandia Ramadhana | Decision Tree for Glass Type Identification | DT form DTreg | DTReg | Glass | General DT intro, but not DTreg-specific; fine description of data, interesting experiments; few unclear statement or undefined concepts (primary feature) | 9 |
| 42 | Chen Tze Chiang | ----------- | -- | ---- | ----- | --------- | 0 |
| 2 | Cheung Eng Yeow | Nomograms Visualization of Naive Bayes Classification on Liver Disorders Data | Naive Bayesian Classifier | Orange | Liver Disorder Data set | Good intro to Naive Bayes, detailed dataset description, interesting software and method, exploratory data analysis, ROC curves and nomograms, very nice! | 10 |
| 1 | Chia Yong Sang, Alex | Analyzing the Creditability of Borrowers using the SSV Decision Tree | SSV Tree | Ghostminer 3.0 | Credit scoring | Well organized and written; very good theory, data description and knowledge extraction in several experiments, great paper. | 10 |
| 23 | Chong Yee Seng | Classification and Regression Tree for Spam prediction | CART DT | CART | UCI Spam e-Mail Database | Eq (1) should be i>j, otherwise fine description, interesting data and experiments | 9 |
| 38 | Dang Xuan Hong | Heart disease analysis using Reduced-Error Pruning Tree | DT with reduced-error pruning | Weka REPtree | heart disease | Very good theory, intro to Weka, good analysis of extracted knowledge | 9 |
| 3 | Han Shuguo | Analysis of Thyroid disease based on C4.5 | C4.5 | WEKA | Thyroid disease | Very good in all aspects! | 10 |
| 26 | Hu Meiqun | Predictive Modeling on Mushroom Data | CHAID DT | SAS Enterprise Miner 5.2 | Mushrooms (UCI) | some refs left as [?], no formulas, unexplained properties (ex. Bonferroni), explicit rules not given, but quite detailed, serious work. | 7 |
| 18 | Hu Meishan | Caravan Insurance Policy customer properties | Naive Bayes + Feature Sel | Weka 3.5 | Insurance Company benchmark (COIL Challenge 2000 Data) | fine NB although short on feature selection; very good on software side; altogether nice analysis of feature importance using 3 methods | 9 |
| 13 | Huang Dong | Decision Trees on Lymphography Domain | C4.5 | WEKA | Lymphography Data | good theory and description of data; why to paste Weka output as graphics? More on pruning than extracting knowledge from data | 7 |
| 24 | Huang Yi | Naive Bayes for Grayscale document images | NB + Preprocessing | Matlab and dipum_1.1.4 toolbox | Grayscale document images | Theory OK, interesting data, some feature selection experiments and detailed analysis, no variance but accuracy given to 0.0001%, still a very good work | 8 |
| 8 | Iti Chaturvedi | Kernel ICA on PET data | Kernel ICA | Matlab | PET single-subject motor experiment | Brief intro to ICA in kernel canonical version; undefined acronyms (ex HRF), a bit in a hurry | 7 |
| 17 | John Felix Charles Joseph | Kernel space approximation and kernel target alignment for detecting network intrusions | Least Squares – SVM based Kernel approx | LS-SVM toolbox for Matlab | Non-Symbolic Features of KDD Cup 1999 | Great theory; excellent paper on predictive mining but nothing on data understanding! | 7 |
| 27 | Koh Chin Wei, Eugene | Maximum Likelihood Classification of Audio Segments with Gaussian Mixture Models | J48 | WEKA | Audio Segments | Quite ambitious paper, in all respects! | 10 |
| 28 | Le Minh Nghia | Learning from data using RIPPER rules | RIPPER | Weka | Diabetes + US income | Fine description of REP, IREP, RIPPER rule extraction methods; Google "Logical rules extracted from data" to see some rules for Diabetes, two experiments only | 6 |
| 15 | Maggy Anastasia Suryanto | Predicting the Choice of Contraceptive Method using QUEST | QUEST | SPSS | Contraceptive Method Choice (CMC) | Great intro to Quest; interesting data and past usage; very nice interpretation | 10 |
| 33 | Mohamad Hirwan | ------------ | --- | --- | --- | --- | 0 |
| 16 | Nah Hock Choon, Edwin | Classification of Intensive Care Unit patient by Decision Tree | J48 | WEKA | Intensive Care Unit | Theory a bit disordered, base rate is 80% but not mentioned, single rule CRE =< 1 not found ... | 6 |
| 34 | Nai Hong Hwa Francis | ------------- | --- | --- | --- | traveling abroad ... | 0 |
| 29 | Nguwi Yok Yen | Wisconsin Breast Cancer Data with Decision Tree | Decision Tree | DTreg | Wisconsin Breast Cancer | Not much on theory except photos of trees ... good description of software; no interpretation of rules | 7 |
| 30 | Nguyen Luong Dong | Finding logical rules in a mushroom database using C4.5 Decision Tree | C4.4 DT | Weka | Mushrooms | Standard theory + Weka description, past usage copied verbatim, long output copies, what is sensibility? | 7 |
| 40 | Nguyen Minh Nhut | Car Evaluation Classification using Classification and Regression Tree | CART DT | SPSS 13 | Car Evaluation Database | Good DT intro; interesting hierarchical data; only CART used, although other DT are in the same package, too many complex trees, but good analysis | 8 |
| 31 | Nguyen Trung Hieu | Analyzing rule-based spam filtering system | PART | Weka | Spam Database | Good PART intro; interesting analysis of experiments ('font' is probably left from html) | 8 |
| 32 | Pham Manh Tung | C4.5 Decision Tree Algorithm on Wisconsin Diagnostic Breast Cancer | C4.5 DT | SBSS Software | WBCD Breast Cancer | Average description and experiments | 7 |
| 6 | Phua Si Jie | SVM on Pen-Based Handwritten Digits | SVM SMO | SPRT Tools | Pen-Based Handwritten Digits | Great theory; many experiments, good feature ranking; poor conclusions | 9 |
| 5 | Puah Wee Choo | Allele frequencies | EM algorithm | Biological ESTEEM module | ABO allele frequencies in SE-Asia | Quite interesting report; theory could be extended | 8 |
| 33 | Ronny | Colon Cancer Data Analysis | CS4 DT ensemble | SBSS Software | Colon Cancer Data Analysis | Intro to DT meta-improvements, interesting CS4 method, but has not been used, equations pasted look bad, no knowledge analyzed | 7 |
| 34 | Sim Sian Hui, Kelvin | Using Naive Bayesian Classifier to Classify Stocks Based on Financial Ratios | NB+C4.5 | Weka | Stocks Based on Financial Ratios | Many misunderstandings and not much knowledge extracted ... | 6 |
| 35 | Song Hengjie | C4.5 Tree on labor-negotiations data | C4.5 | Yale | Labor-negotiations | Good intro to trees, one simple experiment | 7 |
| 9 | Tan Wi-Meng, Javan | Prediction of income based on US Census Data | Naive Bayes | TANAGRA | 1994 US Census dataset | Brief on theory and software, good data description, but no attempt has been made on feature selection or data understanding, the topic of this assignment | 7 |
| 21 | Teng Teck Hou | Forest cover type analysis of Colorado's Roosevelt National Forest | C4.5 + AdaBoost M1 | Weka, Matlab | Forest cover type from UCI KDD Repository | Good intro to methods used and Weka, many nice experiments, interesting knowledge found | 10 |
| 20 | Tu Tong | Analysis on Internet Advertisements with C4.5 Decision Tree | C4.5 tree | Weka J48 | Internet Advertisements | Many errors, "majar classifier"? Not much on theory; no experiments with buckets in One-R; good attempt to analyze rules from C4.5 | 7 |
| 11 | Umair Rafique | LMT for Glass and Image Segmentation | LMT (Logistic Model Trees) | WEKA | Glass data, Image segmentation | Good intro to logistic model trees; good choice, many experiments, a bit too long | 9 |
| 36 | Wan Kong Wah | Uncovering Discriminative Features in Text and Images | Feature ranking + NB/DT/SVM | Weka | Reuters document categorization + Pascal 06 image challenge | Nice combination of feature selection and methods; good remarks on methods; text and image data used, nice experiments | 9 |
| 4 | Wang Di | Mortgage rate prediction using multivariate adaptive regression splines | MARS | MARS@Salford Systems | Federal Reserve Economic Data | Interesting algorithm with detailed description; several experiments with MARS regression but still a few problems, still a worthy effort | 9 |
| 19 | Wang Lin | Visualization of PCA and FDA on Waveform Data | C4.5 | Weka | house-votes-84 | Very detailed theory; great software description; many experiments but focus on accuracy, not knowledge from data | 9 |
| 37 | Woo Huizhen | Automated Classification of Protein Sequences based on Domain Architecture | Smith-Waterman and Partitioning Around Medoids | JACOP | Protein Sequences | Not much theory; some conclusions are drawn but it would be better to see actual JACOP probes | 7 |
| 7 | Wu Min | Decision Tree for Animal Classification | C 4.5 | Yale | Zoo Data | Quite brief theoretical intro; interesting experiments to discover class descriptors, good educational value. | 8 |
| 22 | Yeong Sui Sum | Clustering for Protein Localization Sites Discovery | Cluster + Java Treeview | ?? | Yeast localization signal | General intro to clustering but experiments limited and not much learned | 6 |
| 10 | Zhang Xuejie | SSV Tree on Mushroom and Glass Identification Databases | SSV Tree | GhostMiner 3.0 | Mushroom from UCI | Good description of method and software, well focused on knowledge, small errors | 9 |
| 14 | Zhao Guopeng | Data Analysis with Fuzzy Inference Systems | ANFIS, FIS | Matlab Fuzzy Toolbox | new-thyroid (UCI) | Fine intro to fuzzy rules and ANFIS, a lot of work, should be nice to see crisp rules from decision tree as a reference | 10 |
| 41 | Name | Paper | Method | Software | Data | Remarks | Mark |