This is the first option for the coursework of the unsupervised learning part of the course.
You have to pick one of the proposed subjects and write a state of the art report of at least 7500 words long.
You can pick any of the topics, there is no problem if there is more than one student working on the same topic. The only constraint is that the work has to be done individually.
If you want to propose another topic related to the course contact me and we will talk about it.
The work
One of the first things that is needed to do when you are beginning to work in a research topic is to look for recent (and no so recent) papers written on the topic to have an idea of the state of the art.
This involves to collect information about the problems that define the area, the formal definitions of the problems, the main solutions proposed, the different extensions to the solutions and the open problems.
The idea of this assignment is to search for material that defines the state of the art of one of the proposed subject and write a report summarizing all your findings. The report has to be organized and clear, so anyone that is interested in the subject can use it as a starting point
The report has to describe what are the problems involved in the topic and their motivation, what different approaches exist to the problem, giving a brief explanation of each, and commenting if there are better approaches than others or what advantages present in front of the others.
So, the main task of the assignment are
- To look for papers related to the topic
- To collect relevant bibliography on the topic
- To choose the more relevant papers
- To summarize the problems described and to summarize the approaches presented on those papers
- To situate the problems in the area and to describe the relation
of the topic
with other areas - To describe the main areas of application of the topic
There you have two papers of the state of the art in two different topics, so you can have an idea of what a paper of this kind looks like:
- S. B. Kotsiantis, Feature selection for machine learning classification problems: a recent overview
- Lior Rokach, Ensemble-based classifiers
In order to look for papers in a subject you can use different sources of information, but mainly you can use different sites specialized on scientific bibliographical search like:
- The Collection of Computer Science Bibliographies
- Google Scholar
- Citeseer
- The DBLP Computer Science Bibliography
- IEEEXplore
- Springerlink
- Science Direct
The deadline for this report is January 9th.
You have a document posted in the Raco explaining the assignment and an evaluation rubric that details what is expected from the report.
You have to upload the report in PDF format in the Racó following the instructions that will be posted at the beginning of January.
Subject 1: Cluster ensembles/consensus
The goal of cluster combination is to obtain a more accurate clustering of a dataset by combining the results of a set of clusterings. The different approaches can embedded in the clustering process or work only with the resulting partitions.
Keywords: Consensus clustering, unsupervised ensembles, cluster aggregation, cluster diversity, consensus measures, multiple partitions
Subject 2: Graph clustering
Graph Clustering is an specific area of clustering that deals with the finding of groups in data that can be represented as a graph or datasets where the instances are graphs. There are many applications for this algorithms as for example the analysis of sociological data, vision, social networks or web pages analysis.
Keywords: Graph partitioning, spectral graph methods, graph distances, graph conectivity
Subject 3: Unsupervised attribute selection
Attibute selection is a preprocess step needed in usupervised knowledge discovery in order to reduce the number of irrelevant attributes that obfuscate the data. There are many methods for supervised attibute selections, the unsupervised methods use very different approaches.
Keywords: Cluster Feature weighting, feature salience, attribute clustering, laplacian score, PCA feature selection, model clustering feature selection
Subject 4: Clustering of datastreams
An important problem in knowledge discovery is when the data that we have is a continuous stream of data. This means that all the dataset is not available to process at the begining, The goal is to develop algoritms that can incrementaly build a model of the data. This model has to adapt to any changes of the concepts described by the datastream.
Keywords: Evolving datastreams, microclusters, sliding window clustering, hierarchical stream clustering
Subject 5: Frequent trees/graphs discovery
The next step in knowledge discovery is to used structured datasets in the discovery processf. A lot of data cam be represented as trees or graphs, the discovery of frequent substructures pretends to extend the research on association rules to structures data
Keywords: Frequent graphs, maximal subgraphs, closed subgraphs, canonical graph representation
Subject 6: Semisupervised Clustering
Sometime we have more information about our data and this can be used to obtain better results. Semisupervised clustering assumes that we can obtain from the domain information about some of the examples, mainly if they have to be put together in the same cluster or they must belong to different clusters. This information can be also used to learn distance functions that warp the space of instances and allow for a better representation of their relations
Keywords: Metric Learning, kernel clustering, feature projection, constrained clustering
Subject 7: One-class classification
Sometimes we are only interested in a model/representation of an specific class and we do not have more information of the examples from other classes or we have only a very small subtet of them compared with the data from the target class. The goal is to have a model that allows to classify up to a confidence factor new examples as members or non members of the only class.
Keywords: One class Learning, One class SVM, density one class classification, boundary one class classification, autoencoders,
Subject 8: Time Series Motifs, Novelty Detection, Burst
detection
When analyzing time series some times we are interested in detecting specific episodes that appear frequently or in detecting novel patterns that represent episodes that are very different from the rest of the series behaviour. The goal is to extract this episodes efficiently from univariant or multivariant time series.
Keywords: Time series Motifs, Time series segmentation, Time series representation, surprising motifs, time series distances, time series feature extraction
Subject 9: Subspace Clustering/Coclustering
Sometimes clusters belong to smaller attribute spaces, so using all the attributes to describe them make more difficult the discovery process. Including a selection/weighting attributes process during the clustering allows to discover the specific subspaces. Usually a grid or model based approach is the better option because that selection/weighting schema comes natually. More specific applications like bioinformatics have developed related algorithms like biclustering and coclustering algorithms
Keywords: Subspace clustering, Correlation clustering, BiClustering, coclustering
Subject 10: Document clustering
Document clustering is an automatic grouping of text documents into clusters so that documents within a cluster have high similarity in comparison to one another, but are dissimilar to documents in other clusters. This task can be used for different problems involving text mining and web mining and it is closelly related to the information retrieval area.
Keywords: Hierarchical clustering, conceptual clustering, information retrieval, text/web mining
Subject 11: Unsupervised Deep Learning
Deep learning is currently a hot topic, but most of the applications use supervised learning. There is other side of deep learning that are focused on unsupervised methods for extracting features from data without an specific target goal. This methods include Autoencoders and all its variations (sparse, variational) and its relationship with PCA. Other methods include Restricted Boltzman Machines as part of some architectures as Deep Belief Networks.
Keywords: Deep learning, Autoencoders, Sparse autoencoders, Restricted Boltzman Machines, Deep Belief networks
Subject 12: Clustering and Image Processing
Clustering is an important method for image processing, from the extraction of characteristics, to the segmentation and extraction of the elements in an image and the organization of large datasets of images for retrieval and organization
Different methods can be used for each one of those problems, but there are some specificities that have to be addressed and the current size of images repositories make scalability a real challenge (for example for search engines).
Keywords: Image Segmentation, Image Clustering, Image Organization and retrieval.
Subject 13: Clustering and Microarray Data
The analysis of microarray data presents a real challenge for machine learning methods. The main problem is to have an imbalance between the number of attributes and the number of examples. Several methods have been proposed for this kind of datasets, some are variations of classical algorithms, but other are related to the coclustering and biclustering areas.
Keywords: Microarray Data, Coclustering, biclustering.
Subject 14: Distributed Clustering
The problem of scaling algorithms to large scale datasets has resulted on diferent variations of classical algorithms and new strategies to merge distributed results. There are different strategies for distributed clustering, including map-reduce approaches or P2P networks of clustering process. Some solutions only guarantee an approximation of the one process algorithms others use clever strategies to obtain the same results.
Keywords: Map reduce, P2P clustering, consensus algorithms.