Your Course Project

Your class project is an opportunity for you to explore an interesting machine learning problem of your choice in the context of a real-world data set.  Below, you will find some project ideas, but the best idea would be to combine machine learning with problems in your own research area. Your class project must be about new things you have done this semester; you can't use results you have developed in previous semesters.

Projects can be done by you as an individual, or in teams of two students.   Each project will also be assigned a 701 instructor as a project consultant/mentor.   They will consult with you on your ideas, but of course the final responsibility to define and execute an interesting piece of work is yours.  Your project will be worth 20% of your final class grade, and will have two final deliverables:

  1. a writeup in the format of a NIPS paper (8 pages maximum, 6 pages minimum, in NIPS format, including references; this page limit is strict), due to December 9 (emailed to the instructors list, NO LATE SUBMISSION ACCEPTED since we need to get your grades in), worth 60% of the project grade, and

  2. a poster presenting your work for a special ML class poster session (December 4, 3pm-6pm, in the NSH atrium) worth 20% of the project grade. 

 In addition, you must turn in a midway progress report (5 pages maximum in NIPS format, including references) describing the results of your first experiments by the milestone due date (Nov 12th, by 10:30pm, through email to your project TA), worth 20% of the project grade. Note that, as with any conference, the page limits are strict! Papers over the limit will not be considered.


Project Proposal

You must turn in a brief project proposal (1-page maximum) on Oct 7th, in class.  A list of suggested projects and data sets are posted below. Read the list carefully.  You are encouraged to use one of the suggested data sets, because we know that they have been successfully used for machine learning in the past.   If you prefer to use a different data set, we will consider your proposal, but you must have access to this data already, and present a clear proposal for what you would do with it. 

Project proposal format:  Proposals should be one page maximum.  Include the following information:



Here are some details on the poster format.


Datasets and project suggestions    

You can see the datasets and the project suggestions of the 2007 class by clicking here.  

Here are some project ideas and datasets for this year:

CMU Multi Modal Activity Database (KATE)
Description of this data set is available here.

Sequence Labeling in Natural Language Processing (SHAY)

Most natural language processing models involve structured prediction, where we have a complex structure that we are trying to predict, usually, based on a natural language sentence.

The simplest of all of these structures is a chain, as it relates to sequence labeling. Such sequence labeling requires labeling each word in a sentence with some label, which describe its properties as it relates to the whole sentence or to the whole corpus of sentences.

In this project, you will choose one of the common NLP sequence labeling data sets, and implement a method to do learning and prediction on that data set.

For a short survey of sequence labeling algorithms, see:
The following data sets are available, but you are welcome to find data sets for sequence labeling on your own:

Dimensionality Reduction using Spectral Methods and Fractal Dimension (BABIS)

High dimensional data typically are not high dimensional! Precisely, they are not intrinsically high dimensional. Assume that you have a dataset of n different points lying in a high dimensional Euclidean space. Even if the dimensionality of the ambient space is high, in many cases the data lie on some low dimensional subspace. A toy dataset is shown in the image below, and is known as the swiss roll. How do you learn the swiss role underlying geometry by a sufficiently large set of points sampled from it?

A wealth of dimensionality reduction techniques, linear and non-linear exist and the goal of this project is to experiment with those, learn about them, understand them and apply them to a dataset of your interest. Such methods are the following: Principal Component Analysis and Metric Multidimensional Scaling (Linear), ISOMAP, Laplacian Eigenmaps, Hessian Eigenmaps, Maximum Variance Unfolding, Locally Linear Embeddings, Kernel PCA.
You will also have the opportunity to learn about fractal dimension. Even if the notion of intrinsic dimensionality seems clear, i.e., the intrinsic dimensionality of a random vector should be the minimum number of variables to describe it, it is not. A generalization of this is the fractal dimension. You will learn tools (e.g., correlation plot) which compute the intrinsic dimension of a dataset and more on fractals.

Just an indicative list of what type of dataset you can use:

Related Papers:

Related Videos

Graph methods and geometry of data

Mikhail Belkin

Semi-supervised Learning, Manifold Methods

Mikhail Belkin

Geometric Methods and Manifold Learning

Partha Niyogi, Mikhail Belkin
2 videos

Learning using Tensors (BABIS)
Many real world processes generate data which are multi-aspect. Few such processes are the following:

A) In a sensor network we have a set of sensors which measure some types of measurements over time. These types can be voltage of the battery of the sensor, humidity, temperature and light intensity. This data is naturally modeled by a three way tensor, that is a matrix with three different aspects: time, sensor id and measurement id.

B) fMRI brain scans result also in multiaspect data. For example consider having a set if subjects (people) performing a set of different tasks for several times (trials). During each experiment we measure the activation of voxels over time. Again this type of data is naturally modeled by a tensor voxels x subjects x trials x task conditions x timeticks.

C) One of the big successes in Machine Learning is digit classification. US Post offices heavily rely on machine learning to classify -among others- digits. One way to model this type of data is to use 10 different tensors one for each digit. Each tensor i has three aspects: pixels (horizontically) x pixels (vertically) x #images of digit i.

D) Consider taking pictures of a set of people under different light conditions, asking them to perform a set of different face expressions. Again this data can be modeled as a tensor: person id x face expression x light conditions.

The goal of this project is to learn more about tensor decompositions and apply them in a dimensionality reduction problem and for learning (e.g. classifying digits).


Related Papers

Related Videos

Multilinear (tensor) manifold data modeling (Vasilescu)

Graph Embeddings for Learning (BABIS)
Many of the spectral methods that are used for non-linear dimensionality reduction have a first step, which creates a graph from the underlying set of points. The performance of the method depends on how ''good'' the graph is. Recently a paper in ICML proposes a way to create such graphs, which is very interesting from a combinatorial point of view too.
The goal in this project is to investigate their method and compare it to other standard methods of constructing a graph (k nearest neigbors, epsilon balls). By investigative we mean the following: you will use the different methods to create a graph and then you will input those graphs to 1-3 algorithms (e.g., spectral clustering, Laplacian Eigenmaps etc) for performing different machine learning tasks such as clustering, classification and regression and compare the performance.


Related Papers

Fitting a Graph to Vector Data

Related Videos

Fitting a Graph to Vector Data

Daniel A. Spielman

Fitting a Graph to Vector Data

Samuel I. Daitch