TAVA/AAAC Trabajos

You have to pick one of the proposed subjects and write a report at least 7500 words long.

The idea is that you begin by reading the proposed material and then look up for other relevant papers over the internet about the subjectt. With this material you have to write a brief introductory paper describing what is the problem and its motivation, what different approaches exist to the problem, giving a brief explanation of each, and commenting if there are better approches than others or what advantages present in front of the others. Include in your paper all relevant bibliography that you collect during the research of the subject.

Think of this coursework as if you were to write an entry on Wikipedia about the subject.

The deadline for this report is January 11th. You can either deliver a hardcopy of your report to my mailbox (office S202b, omega-K2M building) or deliver your report in electronic format by e-mail to bejar@lsi.upc.edu

Subject 1: Cluster ensembles/consensus

The goal of cluster combination is to obtain a more accurate clustering of a dataset by combining the results of a set of clusterings. The different approaches can embedded in the clustering process or work only with the resulting partitions.

Papers

Cluster Ensembles for High Dimensional Clustering: An Empirical Study, Fern, Brodley
Clustering Ensembles: Models of Consensus and Weak Partitions, Topchy, Jain, Punch
Cluster Ensembles { A Knowledge Reuse Framework for Combining Multiple Partitions Strehl, Ghosh

Subject 2: Graph clustering

Graph Clustering is an specific area of clustering that deals with the finding of groups in data that can be represented as a graph. There are many applications for this algorithms as for example the analysis of sociological data, vision, social networks or web pages analysis.

Papers

Graph Clustering S. Schaeffer
A tutorial on spectral clustering U. von Luxburg

Subject 3: Unsupervised attribute selection

Attibute selection is a preprocess step needed in usupervised knowledge discovery in order to reduce the number of irrelevant attributes that obfuscate the data.

Papers

Feature selection for unsupervised learning J. Dy, C. Brodley
Unsupervised feature selection using feature similarity P. Mitra, C. Murthy, S. Pal

Subject 4: Clustering of datastreams

An important problem in knowledge discovery is when the data that we have is a continuous stream of data. This means that all the dataset is not available to process at the begining, The goal is to develop algoritms that can incrementaly build a model of the data. This model has to adapt to any changes of the concepts described by the datastream.

Papers

Mining high speed datastreams P. Domingos, G. Hulten
Clustering Datastreams P. Gusha, N. Mishra, R. Motwani, L. O'Callaghan
Mining time changing datastreams P. Geoff Hulten, Laurie Spencer, Pedro Domingos

Subject 5: Frequent trees/graphs discovery

The next step in knowledge discovery is to used structured datasets in the discovery processf. A lot of data cam be represented as trees or graphs, the discovery of frequent substructures pretends to extend the research on association rules to structures data

Papers

Frequent subtree mining: an overview Y. Chi, R. Muntz, S. Nijssen, J. Kook
Frequent Subgraph Discovery Michihiro Kuramochi and George Karypis
On canonical forms for frequent graph mining C. Borgelt

Subject 6: Clustering in bioinformatics

The particularities of the data in the bioinformatics area needs for particular clustering methodologies. The data mining of DNA and proteins has yield new problemas and a new kind of clustering algoritms.

Papers

Cluster Analysis for Gene Expression Data: A Survey Daxin Jiang and Aidong Zhang
Algorithmic approaches to cluster gene expresion data R. Shamir, R. Sharan

Subject 7: Clustering of documents

One of the application of clustering algorithms is the organization of large corpus of documents. This area is in between of data mining and documento retrieval.

Papers

Hierarchical document clustering Benjamin C. M. Fung, Ke Wang, and Martin Ester
Web Document Clustering: A Feasibility Demonstration Oren Zamir and Oren Etzioni
Evaluation of Hierarchical Clustering Algorithms for Document Datasets Ying Zhao, George Karypis
Hierarchical Clustering Algorithms for Document Datasets Ying Zhao, George Karypis

Subject 8: Parallel/Distributed Clustering

The need of cluster huge amount of data has bring some algorithms able to reduce the computational cost by dividing the task. There are two different approaches, on one hand the algorithms that used parallel processing and use multiple threads that need to comunicate to maintain cluster informations and on the other hand algorithms that use the map/reduce paradigm that merge the result of the same clustering algorithm on different partitions of the dataset

Papers

Clustering Very Large Multi-dimensional Datasets with MapReduce Robson L. F. Cordeiro,Caetano Traina Jr.,Agma J. M. Traina, Julio López, U Kang, Christos Faloutsos

Efficient Clustering of HighDimensional Data Sets with Application to Reference Matching Andrew McCallum, Kamal Nigamy,

Parallel Clustering Algorithm for Large Data Sets with Applications in Bioinformatics Victor Olman, Fenglou Mao, Hongwei Wu, and Ying Xu
Parallel Spectral Clustering Yangqiu Song, Wen-Yen Chen, Hongjie Bai, Chih-Jen Lin, and Edward Y. Chang

Subject 9: One-class classification

Sometimes we are only interested in a model/representation of an specific class and we do not have more information of the examples from other classes or we have only a very small subtet of them compared with the data from the target class. The goal is to have a model that allows to classify up to a confidence factor new examples as members or non members of the only class.

Papers

One-Class Classification David Tax, Phd Thesis

Técnicas Avanzadas de Aprendizaje (TAVA)

Máster en Inteligencia Artificial

Aprendizaje Automático y Adquisición del Conocimiento

Doctorado en Inteligencia Artificial

2011 Fall Term