Your Course Project
Your class project is an opportunity for you to explore an interesting machine learning problem of your choice in the context of a realworld data set. Below, you will find some project ideas, but the best idea would be to combine machine learning with problems in your own research area. Your class project must be about new things you have done this semester; you can't use results you have developed in previous semesters.
Projects can be done by you as an individual, or in teams of two students. Each project will also be assigned a 701 instructor as a project consultant/mentor. They will consult with you on your ideas, but of course the final responsibility to define and execute an interesting piece of work is yours. Your project will be worth 20% of your final class grade, and will have two final deliverables:

a writeup in the format of a NIPS paper (8 pages maximum, 6 pages minimum, in NIPS format, including references; this page limit is strict), due to December 9 (emailed to the instructors list, NO LATE SUBMISSION ACCEPTED since we need to get your grades in), worth 60% of the project grade, and

a poster presenting your work for a special ML class poster session (December 4, 3pm6pm, in the NSH atrium) worth 20% of the project grade.
In addition, you must turn in a midway progress report (5 pages maximum in NIPS format, including references) describing the results of your first experiments by the milestone due date (Nov 12th, by 10:30pm, through email to your project TA), worth 20% of the project grade. Note that, as with any conference, the page limits are strict! Papers over the limit will not be considered.
You must
turn in a brief project proposal (1page maximum) on Oct 7th, in class.
A list of suggested projects and data sets are posted below. Read the list
carefully. You are encouraged to use one of the suggested
data sets, because we know that they have been successfully used for
machine learning in the past. If you prefer to use a
different data set, we will consider your proposal, but you must have
access to this data already, and present a clear proposal for what you
would do with it.
Project proposal format: Proposals should be one page maximum. Include the following information:

Your name and andrew ID on top of the page

Project title

Data set

Project idea.This should be approximately two paragraphs.

Software you will need to write.

Papers to read. Include 13 relevant papers. You will probably want to read at least one of them before submitting your proposal.

Teammate: will you have a teammate? If so, whom? Maximum team size is two students.

What will you complete by the project milestone due date? Experimental results of some kind are expected here.
Here are some details on the poster format.
 The two most common ways to make your poster are:
 You can create a huge slide (in powerpoint, using beamer, etc) and print it out at one of the poster printers available (for SCS, details below). You'd want the shorter dimension to be less than 36'' since the special poster printers use paper rolls of width 36''.
 You can create a bunch of "normal" presentation slides, print out each one on a piece of (lettersized) paper, and put them all together on a poster board. This is somewhat messier (and not as pretty) but you don't need to wait for the special poster printer to print.
 We will provide you with foam boards onto which you can tack on your poster (or your slides for the poster). The foam boards are 32'' by 40''.
 Printing: Unfortunately, we don't have a budget to pay for printing.
If you are an SCS student,
SCS has a poster printer you can use which prints on a 36" wide
roll of paper. You have to call Operations and request to be added to
the queue and they will call you back and tell you when you are able
to print. It takes a while to print however especially if there are
people ahead of you already printing so you can't wait until the last
hour (or last day). But they are open 24/7. The printer sometimes runs
out of paper, and you should really start early!!! The number of
operations is 82608 or you can call the help desk (84231) to get the
number.
If you are a student outside SCS, you will need to check with your department to see if there are printing facilities for big posters (we're not sure what is offered outside SCS), or print a set of regular sized pages.
Datasets and project suggestions
You can see the datasets and the project suggestions of the 2007 class by clicking here. Here are some project ideas and datasets for this year:CMU Multi Modal Activity Database (KATE)
Sequence Labeling in Natural Language Processing (SHAY)
Most natural language processing models involve structured prediction, where we have a complex structure that we are trying to predict, usually, based on a natural language sentence.
The simplest of all of these structures is a chain, as it relates to sequence labeling. Such sequence labeling requires labeling each word in a sentence with some label, which describe its properties as it relates to the whole sentence or to the whole corpus of sentences.
In this project, you will choose one of the common NLP sequence labeling data sets, and implement a method to do learning and prediction on that data set.
For a short survey of sequence labeling algorithms, see: Comparisons of Sequence Labeling Algorithms and Extensions, N. Nguyen and Y. Guo, Proceedings of ICML, 2007
 Named Entity Recongition Data  from the CoNLL 2003 shared task. With named entity recongition, you are required to label each word in a sentence with its entity type, such as organization name, a person name, or a location. If the word does not belong to one of these categories, you label it with a null label. See Wikipedia's entry about NER for more information.
 PartofSpeech Tagging Data  With partofspeech tagging, you are required to label each word with its partofspeech, such as noun, verb, etc. (usually a more finegrained classes are devised.) For this, you would need data such as the WSJ Penntree bank, which is available from the LDC consortium. Because of copyright issues, we cannot put the data online. However, if you want to get the data and cannot access it, contact us at the instructors list, and we will send it to you. If you want to do partofspeech tagging in different languages, and not just English, check out this page. Although it contains data sets for a different task, it also has data in many languages with sentences and their partofspeech tags.
 Semantic Role Labeling Data  from the CoNLL 2005 shared task. Semantic role labeling tries to dive deeper into the semantics of language, and label each word as being potentially an argument to a verb or not. SRL requires a deeper knowledge of NLP (because it usually requires parse trees), but at the bottom line, it is also resembles a sequence labeling task.
 NP bracketing  NP bracketing is an NLP problem which is a little bit more advanced than sequence labeling. For this task, you need to recognize nonoverlapping parts of sentences that constitue noun phrases.
Dimensionality Reduction using Spectral Methods and Fractal Dimension (BABIS)
High dimensional data typically are not high dimensional! Precisely, they are not intrinsically high dimensional. Assume that you have a dataset of n different points lying in a high dimensional Euclidean space. Even if the dimensionality of the ambient space is high, in many cases the data lie on some low dimensional subspace. A toy dataset is shown in the image below, and is known as the swiss roll. How do you learn the swiss role underlying geometry
by a sufficiently large set of points sampled from it?
A wealth of dimensionality reduction techniques, linear and nonlinear exist and the goal of this project is to experiment with those, learn about them, understand them and apply them to a dataset of your interest. Such methods are the following: Principal Component Analysis and Metric Multidimensional Scaling (Linear), ISOMAP, Laplacian Eigenmaps, Hessian Eigenmaps, Maximum Variance Unfolding, Locally Linear Embeddings, Kernel PCA.
You will also have the opportunity to learn about fractal dimension. Even if the notion of intrinsic dimensionality seems clear, i.e., the intrinsic dimensionality of a random vector should be the minimum number of variables to describe it, it is not. A generalization of this is the fractal dimension. You will learn tools (e.g., correlation plot) which compute the intrinsic dimension of a dataset and more on fractals.
Datasets
Just an indicative list of what type of dataset you can use:
 UCI repository
 Flowingdata  30 Resources to Find the Data You Need (thanks to Brendan O`Connor)
 Amazon Web Services Public Data Sets (thanks to Wolfgang Richter)
 Rawdad  A Community Resource for Archiving Wireless Data At Dartmouth (thanks to Wolfgang Richter)
 Digit and face datasets be downloaded using this link.
Related Papers:
 Hessian Eigenmaps
 Laplacian Eigenmaps
 ISOMAP
 Kernel PCA
 Maximum Variance Unfolding
 Nonlinear Dimensionality Reduction by Locally Linear Embedding
 A paper which surveys most of the above methods is Spectral Methods for Dimensionality Reduction
 Fractal Dimension for Data Mining
 For your information, a nice book which describes the aforementioned methods is Nonlinear Dimensionality Reduction by Lee, Verleysem.
Related Videos
Graph methods and geometry of data
Mikhail Belkin
Semisupervised Learning, Manifold Methods
Mikhail Belkin
Geometric Methods and Manifold Learning
Partha Niyogi, Mikhail Belkin
2 videos
Learning using Tensors (BABIS)
Many real world processes generate data which are multiaspect. Few such processes are the following:
A) In a sensor network we have a set of sensors which measure some types of measurements over time. These types can be voltage of the battery of the sensor, humidity, temperature and light intensity. This data is naturally modeled by a three way tensor, that is a matrix with three different aspects: time, sensor id and measurement id.
B) fMRI brain scans result also in multiaspect data. For example consider having a set if subjects (people) performing a set of different tasks for several times (trials). During each experiment we measure the activation of voxels over time. Again this type of data is naturally modeled by a tensor voxels x subjects x trials x task conditions x timeticks.
C) One of the big successes in Machine Learning is digit classification. US Post offices heavily rely on machine learning to classify among others digits. One way to model this type of data is to use 10 different tensors one for each digit. Each tensor i has three aspects: pixels (horizontically) x pixels (vertically) x #images of digit i.
D) Consider taking pictures of a set of people under different light conditions, asking them to perform a set of different face expressions. Again this data can be modeled as a tensor: person id x face expression x light conditions.
The goal of this project is to learn more about tensor decompositions and apply them in a dimensionality reduction problem and for learning (e.g. classifying digits).
Datasets
 fMRI data
 3 Mode Company with several datasets and related papers.
 Rasmus Bro web page with several datasets.
 Sensor Network Data
Related Papers
 Survey on Tensors
 Two heads better than one: pattern discovery in timeevolving multiaspect data
 Handwritten digit classification using higher order singular value decomposition
 TensorCUR decompositions for tensorbased data Another interesting application of CUR is to apply it for dimensionality reduction, i.e., choose a good set of features out of a large number of them. Many times, choosing a set of existing features rather than a linear combination is desired since there is a straightforward interpretation of them.
 MACH: Fast Randomized Tensor Decompositions
The next lines are for those interested in randomized algorithms in the context of tensors. These algorithms rougly speaking sacrifice some accuracy for the sake of speed. But they also have other advantages that you will have the chance to explore.
Related Videos
Multilinear (tensor) manifold data modeling (Vasilescu)
Graph Embeddings for Learning (BABIS)
Many of the spectral methods that are used for nonlinear dimensionality reduction have a first step, which creates a graph from the underlying set of points. The performance of the method depends on how ''good'' the graph is. Recently a paper in ICML proposes a way to create such graphs, which is very interesting from a combinatorial point of view too.
The goal in this project is to investigate their method and compare it to other standard methods of constructing a graph (k nearest neigbors, epsilon balls). By investigative we mean the following: you will use the different methods to create a graph and then you will input those graphs to 13 algorithms (e.g., spectral clustering, Laplacian Eigenmaps etc) for performing different machine learning tasks such as clustering, classification and regression and compare the performance.
Datasets
 UCI repository , see related paper for the datasets used therein to get an idea.
Related Papers
Fitting a Graph to Vector Data
Related Videos
Fitting a Graph to Vector Data
Daniel A. Spielman
Fitting a Graph to Vector Data
Samuel I. Daitch