Wlodzislaw Duch
H6429 fall 2006 course, second assignment.



We are ready for the second assignment, due on November 29, this is almost a whole month. Please note that this cannot be extended as after the exam there is little time to enter all marks into the system and I shall need some time for marking examination scripts and marking your assignments. The first assignment was on visualization, the second is on the data understanding. Please read the comments below carefully.
Do not focus your work on accuracy only, or on new methods and their improvement - assignment is not a research report! The main goal for this assignment is to learn something about the data, not about the methods. Try to find a simplest description: most informative features, rules from trees, associations between variables, clusters, understanding of category structure. Accuracy is of secondary importance as long as it is not very bad. Fashionable methods like SVMs may provide some knowledge if small number of support vectors is left and they are carefully analyzed, or if SVM is used with different subsets of features etc, but we should not be happy with a high accuracy black box that does not explain its decisions.

  1. Pick up your favorite data analysis method; either one that I have presented or others that you find in the WWW links.
    Here you will find some links to various data analysis methods, and here more links to software.
    You may also find some new methods browsing the Internet on your own.
  2. To avoid duplication of the same methods applied to similar problems please let me know what you try to do; send me an email with your name, software name, method and data, and I shall put it on this page and let the others know that it has been taken.
  3. Describe the method in details in your report or on a Web page; description should be sufficiently detailed to explain the algorithm, understand what the software is doing and how to interpret it.
  4. Find software that implements the method and describe it briefly. If you choose a larger software package select one or two methods, don't describe the whole package!
  5. Find some interesting data: either generate some data yourself, or grab some data from the network. Some interesting data sets for visualization are here; a good place containing many small datasets is the UCI repository. You may use the same data as for the first assignment if you are not bored with it.
  6. Prepare your data (change of format or standardization may be needed), and analyze it paying attention to the following points:
    • Try to understand the data, not only predict the values; this favors rule-based methods, but discrimination, SVM and nearest neighbor methods may also teach you something.
    • What are the most important factors that influence the decision?
    • What is the expected accuracy of your method? How confident are you in the answer?
    • Evaluate usefulness of your results.

The deadline should be on the 30th of November - we should finish before the exam!

Electronic copy submission is sufficient: please zip or compress all files, give the file your name, write in the email Your-Name, Paper Title, Method Used, Software Used, Data Used (this will be placed in the Table below), and send it to me. Please keep this format to make it easy for me, renaming all your files from "assignment2" etc. to your names is not fun.
If more than one file is send please zip or compress them, give the file YOUR NAME and send it to me.
Try to minimize the size of the file, I want to attach them to this page and I do not have much space in WWW!

Q & A:

1. Is a formal report required? If so, how long should it be?
Your paper is a report. The length should be sufficient for others that have taken this course to be able to understand the method and interpret the results.

2. What must we do to score high marks for this assignment? Do we have to study the methods in much greater depth than what was covered in the lectures?
Find an interesting method, perhaps new variant of one of the methods I talked about, find interesting data, provide interpretation of the results, describe what can we learn from it, mention potential applications.

Great book with pointers on how to understand data is here.

Topics taken for the second assignment in 2006. Max. number of points is 10

NoNamePaperMethodSoftwareData RemarksMark
39Annamalai Parimal Analyzing Boston housing data using classification and regression treesCARTCARTBoston housing Good CART description, few experiments, but well focused on the data 7
25Ardian Kristanto Poernomo Data Understanding for Iris Plant DataC45 + 1R + PCAWEKAIris Very good description of data and new features; quite interesting experiments for such simple data; why PCA fails if Fig. 4 shows clear separability?8
12Bramandia Ramadhana Decision Tree for Glass Type Identification DT form DTregDTRegGlass General DT intro, but not DTreg-specific; fine description of data, interesting experiments; few unclear statement or undefined concepts (primary feature)9
42Chen Tze Chiang -------------------------------0
2Cheung Eng Yeow Nomograms Visualization of Naive Bayes Classification on Liver Disorders DataNaive Bayesian ClassifierOrangeLiver Disorder Data set Good intro to Naive Bayes, detailed dataset description, interesting software and method, exploratory data analysis, ROC curves and nomograms, very nice!10
1Chia Yong Sang, Alex Analyzing the Creditability of Borrowers using the SSV Decision Tree SSV TreeGhostminer 3.0Credit scoring Well organized and written; very good theory, data description and knowledge extraction in several experiments, great paper.10
23Chong Yee Seng Classification and Regression Tree for Spam prediction CART DTCARTUCI Spam e-Mail Database Eq (1) should be i>j, otherwise fine description, interesting data and experiments9
38Dang Xuan Hong Heart disease analysis using Reduced-Error Pruning Tree DT with reduced-error pruningWeka REPtreeheart disease Very good theory, intro to Weka, good analysis of extracted knowledge9
3Han Shuguo Analysis of Thyroid disease based on C4.5 C4.5WEKAThyroid disease Very good in all aspects!10
26Hu Meiqun Predictive Modeling on Mushroom DataCHAID DTSAS Enterprise Miner 5.2Mushrooms (UCI) some refs left as [?], no formulas, unexplained properties (ex. Bonferroni), explicit rules not given, but quite detailed, serious work.7
18Hu Meishan Caravan Insurance Policy customer properties Naive Bayes + Feature SelWeka 3.5Insurance Company benchmark (COIL Challenge 2000 Data)fine NB although short on feature selection; very good on software side; altogether nice analysis of feature importance using 3 methods9
13Huang Dong Decision Trees on Lymphography DomainC4.5WEKALymphography Data good theory and description of data; why to paste Weka output as graphics? More on pruning than extracting knowledge from data7
24Huang Yi Naive Bayes for Grayscale document images NB + PreprocessingMatlab and dipum_1.1.4 toolboxGrayscale document images Theory OK, interesting data, some feature selection experiments and detailed analysis, no variance but accuracy given to 0.0001%, still a very good work8
8Iti Chaturvedi Kernel ICA on PET dataKernel ICAMatlabPET single-subject motor experiment Brief intro to ICA in kernel canonical version; undefined acronyms (ex HRF), a bit in a hurry7
17John Felix Charles Joseph Kernel space approximation and kernel target alignment for detecting network intrusionsLeast Squares – SVM based Kernel approx LS-SVM toolbox for MatlabNon-Symbolic Features of KDD Cup 1999 Great theory; excellent paper on predictive mining but nothing on data understanding!7
27Koh Chin Wei, Eugene Maximum Likelihood Classification of Audio Segments with Gaussian Mixture Models J48WEKAAudio Segments Quite ambitious paper, in all respects!10
28Le Minh Nghia Learning from data using RIPPER rules RIPPERWekaDiabetes + US income Fine description of REP, IREP, RIPPER rule extraction methods; Google "Logical rules extracted from data" to see some rules for Diabetes, two experiments only6
15Maggy Anastasia Suryanto Predicting the Choice of Contraceptive Method using QUESTQUESTSPSSContraceptive Method Choice (CMC) Great intro to Quest; interesting data and past usage; very nice interpretation10
33Mohamad Hirwan ------------------------0
16Nah Hock Choon, Edwin Classification of Intensive Care Unit patient by Decision TreeJ48WEKAIntensive Care Unit Theory a bit disordered, base rate is 80% but not mentioned, single rule CRE =< 1 not found ...6
34Nai Hong Hwa Francis ----------------------traveling abroad ... 0
29Nguwi Yok Yen Wisconsin Breast Cancer Data with Decision TreeDecision TreeDTregWisconsin Breast Cancer Not much on theory except photos of trees ... good description of software; no interpretation of rules7
30Nguyen Luong Dong Finding logical rules in a mushroom database using C4.5 Decision TreeC4.4 DTWekaMushrooms Standard theory + Weka description, past usage copied verbatim, long output copies, what is sensibility?7
40Nguyen Minh Nhut Car Evaluation Classification using Classification and Regression TreeCART DTSPSS 13Car Evaluation Database Good DT intro; interesting hierarchical data; only CART used, although other DT are in the same package, too many complex trees, but good analysis8
31Nguyen Trung Hieu Analyzing rule-based spam filtering systemPARTWekaSpam Database Good PART intro; interesting analysis of experiments ('font' is probably left from html)8
32Pham Manh Tung C4.5 Decision Tree Algorithm on Wisconsin Diagnostic Breast CancerC4.5 DTSBSS SoftwareWBCD Breast Cancer Average description and experiments7
6Phua Si Jie SVM on Pen-Based Handwritten Digits SVM SMOSPRT ToolsPen-Based Handwritten Digits Great theory; many experiments, good feature ranking; poor conclusions9
5Puah Wee Choo Allele frequenciesEM algorithmBiological ESTEEM moduleABO allele frequencies in SE-Asia Quite interesting report; theory could be extended8
33Ronny Colon Cancer Data AnalysisCS4 DT ensembleSBSS SoftwareColon Cancer Data Analysis Intro to DT meta-improvements, interesting CS4 method, but has not been used, equations pasted look bad, no knowledge analyzed7
34Sim Sian Hui, Kelvin Using Naive Bayesian Classifier to Classify Stocks Based on Financial RatiosNB+C4.5WekaStocks Based on Financial Ratios Many misunderstandings and not much knowledge extracted ... 6
35Song Hengjie C4.5 Tree on labor-negotiations dataC4.5YaleLabor-negotiations Good intro to trees, one simple experiment7
9Tan Wi-Meng, Javan Prediction of income based on US Census DataNaive BayesTANAGRA1994 US Census dataset Brief on theory and software, good data description, but no attempt has been made on feature selection or data understanding, the topic of this assignment7
21Teng Teck Hou Forest cover type analysis of Colorado's Roosevelt National Forest C4.5 + AdaBoost M1Weka, MatlabForest cover type from UCI KDD Repository Good intro to methods used and Weka, many nice experiments, interesting knowledge found10
20Tu Tong Analysis on Internet Advertisements with C4.5 Decision TreeC4.5 treeWeka J48Internet Advertisements Many errors, "majar classifier"? Not much on theory; no experiments with buckets in One-R; good attempt to analyze rules from C4.5 7
11Umair Rafique LMT for Glass and Image Segmentation LMT (Logistic Model Trees)WEKAGlass data, Image segmentation Good intro to logistic model trees; good choice, many experiments, a bit too long9
36Wan Kong Wah Uncovering Discriminative Features in Text and ImagesFeature ranking + NB/DT/SVMWekaReuters document categorization + Pascal 06 image challenge Nice combination of feature selection and methods; good remarks on methods; text and image data used, nice experiments9
4Wang Di Mortgage rate prediction using multivariate adaptive regression splines MARS MARS@Salford SystemsFederal Reserve Economic Data Interesting algorithm with detailed description; several experiments with MARS regression but still a few problems, still a worthy effort9
19Wang Lin Visualization of PCA and FDA on Waveform Data C4.5Wekahouse-votes-84Very detailed theory; great software description; many experiments but focus on accuracy, not knowledge from data9
37Woo Huizhen Automated Classification of Protein Sequences based on Domain ArchitectureSmith-Waterman and Partitioning Around MedoidsJACOPProtein Sequences Not much theory; some conclusions are drawn but it would be better to see actual JACOP probes7
7Wu Min Decision Tree for Animal Classification C 4.5YaleZoo Data Quite brief theoretical intro; interesting experiments to discover class descriptors, good educational value.8
22Yeong Sui Sum Clustering for Protein Localization Sites Discovery Cluster + Java Treeview??Yeast localization signal General intro to clustering but experiments limited and not much learned6
10Zhang Xuejie SSV Tree on Mushroom and Glass Identification Databases SSV TreeGhostMiner 3.0Mushroom from UCI Good description of method and software, well focused on knowledge, small errors9
14Zhao Guopeng Data Analysis with Fuzzy Inference SystemsANFIS, FISMatlab Fuzzy Toolboxnew-thyroid (UCI) Fine intro to fuzzy rules and ANFIS, a lot of work, should be nice to see crisp rules from decision tree as a reference10
41Name PaperMethodSoftwareData RemarksMark