You have to pick one of the proposed subjects and write a report at least 7500 words long.
The idea is that you begin by reading the proposed material and then look up for other relevant papers over the internet about the subjectt. With this material you have to write a brief introductory paper describing what is the problem and its motivation, what different approaches exist to the problem, giving a brief explanation of each, and commenting if there are better approches than others or what advantages present in front of the others. Include in your paper all relevant bibliography that you collect during the research of the subject.
Think of this coursework as if you were to write an entry on Wikipedia about the subject.
The deadline for this report is January 11th. You can either deliver a hardcopy of your report to my mailbox (office S202b, omega-K2M building) or deliver your report in electronic format by e-mail to bejar@lsi.upc.edu
Subject 1: Cluster ensembles/consensus
The goal of cluster combination is to obtain a more accurate clustering of a dataset by combining the results of a set of clusterings. The different approaches can embedded in the clustering process or work only with the resulting partitions.Papers
- Cluster Ensembles for High Dimensional Clustering: An Empirical Study, Fern, Brodley
- Clustering Ensembles: Models of Consensus and Weak Partitions, Topchy, Jain, Punch
- Cluster
Ensembles { A Knowledge Reuse Framework for Combining Multiple
Partitions Strehl, Ghosh
Subject 2: Graph clustering
Graph Clustering is an specific area of clustering that deals with the finding of groups in data that can be represented as a graph. There are many applications for this algorithms as for example the analysis of sociological data, vision, social networks or web pages analysis.Papers
- Graph Clustering
S. Schaeffer
- A tutorial on spectral clustering U. von Luxburg
Subject 3: Unsupervised attribute selection
Attibute selection is a preprocess step needed in usupervised knowledge discovery in order to reduce the number of irrelevant attributes that obfuscate the data.
Papers
- Feature selection for unsupervised learning J. Dy, C. Brodley
- Unsupervised feature selection using feature similarity P. Mitra, C. Murthy, S. Pal
Subject 4: Clustering of datastreams
An important problem in knowledge discovery is when the data that we have is a continuous stream of data. This means that all the dataset is not available to process at the begining, The goal is to develop algoritms that can incrementaly build a model of the data. This model has to adapt to any changes of the concepts described by the datastream.
Papers
- Mining high speed datastreams P. Domingos, G. Hulten
- Clustering Datastreams P. Gusha, N. Mishra, R. Motwani, L. O'Callaghan
- Mining time changing datastreams P. Geoff Hulten, Laurie Spencer, Pedro Domingos
Subject 5: Frequent trees/graphs discovery
The next step in knowledge discovery is to used structured datasets in the discovery processf. A lot of data cam be represented as trees or graphs, the discovery of frequent substructures pretends to extend the research on association rules to structures data
Papers
- Frequent subtree mining: an overview Y. Chi, R. Muntz, S. Nijssen, J. Kook
- Frequent Subgraph Discovery Michihiro Kuramochi and George Karypis
- On canonical forms for frequent graph mining C. Borgelt
Subject 6: Clustering in bioinformatics
The particularities of the data in the bioinformatics area needs for particular clustering methodologies. The data mining of DNA and proteins has yield new problemas and a new kind of clustering algoritms.
Papers
- Cluster Analysis for Gene Expression Data: A Survey Daxin Jiang and Aidong Zhang
- Algorithmic approaches to cluster gene expresion data R. Shamir, R. Sharan
Subject 7: Clustering of documents
One of the application of clustering algorithms is the organization of large corpus of documents. This area is in between of data mining and documento retrieval.
Papers
- Hierarchical document clustering Benjamin C. M. Fung, Ke Wang, and Martin Ester
- Web Document Clustering: A Feasibility Demonstration Oren Zamir and Oren Etzioni
- Evaluation of Hierarchical Clustering Algorithms for Document Datasets Ying Zhao, George Karypis
- Hierarchical
Clustering Algorithms
for Document Datasets Ying Zhao, George Karypis
Subject 8: Parallel/Distributed Clustering
The need of cluster huge amount of data has bring some algorithms able to reduce the computational cost by dividing the task. There are two different approaches, on one hand the algorithms that used parallel processing and use multiple threads that need to comunicate to maintain cluster informations and on the other hand algorithms that use the map/reduce paradigm that merge the result of the same clustering algorithm on different partitions of the dataset
Papers
- Clustering
Very Large Multi-dimensional Datasets with MapReduce
Robson L. F. Cordeiro,Caetano Traina Jr.,Agma J. M. Traina, Julio
López, U Kang, Christos Faloutsos
- Efficient
Clustering of HighDimensional Data Sets with Application to Reference
Matching Andrew McCallum, Kamal Nigamy,
- Parallel Clustering Algorithm for Large Data Sets with Applications in Bioinformatics Victor Olman, Fenglou Mao, Hongwei Wu, and Ying Xu
- Parallel Spectral Clustering Yangqiu Song, Wen-Yen Chen, Hongjie Bai, Chih-Jen Lin, and Edward Y. Chang
Subject 9: One-class classification
Sometimes we are only interested in a model/representation of an specific class and we do not have more information of the examples from other classes or we have only a very small subtet of them compared with the data from the target class. The goal is to have a model that allows to classify up to a confidence factor new examples as members or non members of the only class.
Papers
- One-Class
Classification David Tax, Phd Thesis