Computational Intelligence Lab 2020

This laboratory course teaches fundamental concepts in computational science and machine learning based on matrix factorization,

X is approximately A times B

where a data matrix X is (approximately) factorized into two matrices A and B. Based on the choice of approximation quality measure and the constraints on A and B, the method provides a powerful framework of numerical linear algebra that encompasses many important techniques, such as dimension reduction, clustering and sparse coding.

Course Catalogue Info

News

17.02.2020	The Q&A forum Piazza is online.
20.02.2020	The first exercise sessions will take place on 20./21.02, i.e. in the first week of the semester.
27.02.2020	Kaggle competitions for the projects are launched.
11.03.2020	In light of the CoV-2 pandemic, we are switching to remote teaching via Zoom. The Zoom id for the lectures is https://ethz.zoom.us/j/6976230715. The Zoom id for the exercise sessions is https://zoom.us/j/2313680826.
17.03.2020	The lecture & exercise session recordings can be downloaded from our Polybox. The PW can be found on piazza.
16.06.2020	The deadline to hand in the project reports as well as the deadline for the kaggle competition is extended until 31. July 23.59

Schedule

Week	Topic	Lecture	Exercises
8	Introduction / Linear Autoencoder	Lecture00 Lecture01	Exercise01 Notebooks
9	Principal Component Analysis	Lecture02	Exercise02 Solutions02 Notebook02
10	Matrix Reconstruction	Lecture03	Exercise03 Dataset Notebook Template Notebook Solution Notebook Gaussian Isolines Solutions03 Tutorial03
11	Non-Negative Matrix Factorization	Lecture04	Exercise04 Solutions04 Hand notes
12	Word embeddings	Lecture05	Exercise05 Tutorial05 (Part 1)Tutorial05 (Part 2)Notebook: Shrinkage Op Solutions05
13	Clustering Mixture Models	Lecture06	Exercise06 Solutions06 Tutorial06
14	Neural Networks	Lecture07	Exercise07 Data Solutions07 (pdf)Solutions07 (notebook)Tutorial07
15	Easter Break (no lecture)
16	Easter Break (no lecture)
17	Generative models	Lecture08	Exercise08 Notebook Solutions08 (pdf)Solutions08 (notebook)Tutorial08
18	Labour Day (no lecture)
19	Sparse Coding	Lecture09	Exercise09 Notebook Solutions09 (pdf)Solutions09 (notebook)Tutorial09
20	Dictionary Learning	Lecture10	Exercise10 Solutions10 Tutorial10
21	TBD		Exercise11 Tutorial11 Solutions11

Classes

Lectures	Fri 8-10	HG E 7
Exercises	Thu 14-16	CHN C 14
	Fri 15-17	CAB G 61
Presence Hour	Mo 11-12	CAB H 53

Video Recordings

Video Recordings of the 2020 Lecture, Video Recordings of the 2019 Lecture, Video Recordings of the 2018 Lecture

Piazza Q&A Forum

Please pose questions on the Q&A forum Piazza You can sign up here using the lecture Id 263-0008-00L. You are more than welcome to participate in the discussion of your peers' questions.

Piazza Live Q&A

Prof. Hofmann's Live Q&A Notes). The Live Q&A recording can be downloaded from the CIL polybox.

Exercises

Exercise sheets provide pen-and-paper as well as implementation problems, which help you to solidify the theory presented in the lecture and identify areas of lack of understanding. It is highly recommended that you actively participate in the exercise classes.

Written Exam

The written exam takes 120 minutes. The language of examination is English. NO WRITTEN AIDS ALLOWED.

Old Exams

cil-exam-2010.pdf, cil-exam-2012.pdf, cil-exam-2015.pdf, cil-exam-2016.pdf, cil-exam-2017.pdf, cil-exam-2018.pdf

Grade

Your final grade will be determined by the written final exam (70% weight) and the semester project (30% weight). The project must be passed on its own and has a bonus/penalty function. Failing the project results in a failing grade for the overall examination of the CIL course.

Binding performance assessment information can be found in the Course Catalog

Semester Project

The semester project is an integral part of the CIL course. Participation is mandatory. Failing the project results in a failing grade for the overall CIL course.

You work in groups of three to four students (no more, no less) to develop novel solutions to one of four topics. You may use piazza to find team members or join an existing team.

Building on the implementations you develop during the semester, you and your teammates create a novel solution by combining and extending previous work. You compare your solution to at least two baselines and submit it to the online ranking system for competitive evaluation. Finally, you write up your methodology and present your experimental results in the form of a short scientific paper.

Project Reports are due on Friday 31. July 23.59. Competitive submission deadlines are given on kaggle.

Students repeating the course can either carry over their grade from the previous year's project as-is, in which case they have to inform us in advance. Alternatively, they can also resubmit a new project in a (new) regular group.

As part of the semester project, you and your teammates are expected to

Develop a novel solution, e.g. by combining and extending methods from the programming exercises.
Compare your novel solution to at least two baseline algorithms.
Submit your novel solution for evaluation to the kaggle online ranking system.
Write-up your findings in a short scientific paper.

As a rough guide, you may approach the problem as follows: (i) Study the project description sheet. (ii) Download the training data and implement the baselines. (iii) Develop, debug and optimise your novel solution on the training data. (iv) Submit your solution for online evaluation on test data. (v) See where you stand in a ranking of all submissions.

Developing a Novel Solution

You are free to exploit any idea you have, provided it is not identical to any other group submission or existing implementation of an algorithm on the internet.

Comparison to Baselines

You must compare your solution to at least two baseline algorithms. For the baselines, you can use the implementations you developed as part of the programming exercises or come up with your own relevant baselines.

Ranking of Novel Solution

You must submit your novel algorithm to the kaggle online ranking system. See project descriptions.

Scientific Report

For instructions on how to write a scientific paper, see the following PDF, source.

The write-up must be a maximum of 4 pages long (excluding references).

Project Submission

To submit your report, please go to https://cmt3.research.microsoft.com/ETHZCIL2020, register and follow the instructions given there. You can resubmit any number of times until the deadline passes.

When finally uploading your report, you are also required to upload the Python code that you have used for your final kaggle submission. The code should be well-documented and generate the predictions in the required format as uploaded to kaggle. For reproducibility you should also include additional code which you have used to produce plots and additional experiments etc.
Include the name of your group in the header of the submitted PDF file, e.g: author{Author1, Author2 & Author3, group: cil_nerds, Department of Computer Science, ETH Zurich, Switzerland}
Attach the signed plagiarism form at the end of your paper (scan).

Project Grading

The project grade is composed of a competitive (30%) and a non-competitive (70%) part.

Competitive grade (30%): The ranks in the Kaggle competition system will be converted on a linear scale to a grade between 4 and 6.

Non-competitive grade: The following criteria are graded based on an evaluation by the teaching assistants: quality of paper (30%), creativity of solution (20%), quality of implementation (20%). Each project is graded by two independent reviewers. The grades of each reviewer are de-biased such that the aveage grade across all projects that the reviewer graded is comparable for each reviewer.

Computational infrastructure

Use ETH's new Leonhard cluster

Report Grading Guidlines

Your paper will be graded by two independent reviewers according to the following three criteria:

1) Quality of paper (30%)

6.0: Good enough for submission to an international conference.

5.5: Background, method, and experiment are clear. May have minor issues in one or two sections. Language is good. Scores and baselines are well documented.

5.0: Explanation of work is clear, and the reader is able to identify the novelty of the work. Minor issues in one or two sections. Minor problems with language. Has all the recommended sections in the howto-paper

4.5: Able to identify contribution. Major problems in presentation of results and or ideas and or reproducibility/baselines.

4.0: Hard to identify contribution, but still there. One or two good sections should get students a pass.

3.5: Unable to see novelty. No comparison with any baselines.

2) Creativity of solution (20%)

6.0: Elegant proposal, either making a useful assumption, studying a particular class of data, or using a novel mathematical fact.

5.5: A non-obvious combination of ideas presented in the course or published in a paper (Depending on the difficulty of that idea).

5.0: A novel idea or combination not explicitly presented in the course.

4.5: An idea mentioned in a published paper with small extensions / changes, but not so trivial to implement.

4.0: A trivial idea taken from a published paper.

3) Quality of implementation (20%)

6.0: Idea is executed well. The experiments done make sense in order to answer the proposed research questions. There are no obvious experiments not done that could greatly increase clarity. The submitted code and other supplementary material is understandable, commented, complete, clean and there is a README file that explains it and describes how to reproduce your results.

Subtractions from this grade will be made if:

the submitted code is unclear, does not run or experiments cannot be reproduced or there is no description of it
experiments done are useless to gain understanding or of unclear nature or obviously useful experiments have been left undone
comparison to baselines are not done

Project Option 1: Collaborative Filtering

A recommender system is concerned with presenting items (e.g. books on Amazon, movies at Movielens or music at lastFM) that are likely to interest the user. In collaborative filtering, we base our recommendations on the (known) preference of the user towards other items, and also take into account the preferences of other users.

Resources

All the necessary resources (including training data) are available at https://inclass.kaggle.com/c/cil-collab-filtering-2020

To participate, follow the link here.

Training Data

For this problem, we have acquired ratings of 10000 users for 1000 different items. All ratings are integer values between 1 and 5 stars.

Evaluation Metrics

Your collaborative filtering algorithm is evaluated according to the following weighted criteria:

prediction error, measured by root-mean-squared error (RMSE)

Project Option 2: Text Sentiment Classification

The use of microblogging and text messaging as a media of communication has greatly increased over the past 10 years. Such large volumes of data amplifies the need for automatic methods to understand the opinion conveyed in a text.

Resources

All the necessary resources (including training data) are available at https://inclass.kaggle.com/c/cil-text-classification-2020

To participate, follow the link here.

Training Data

For this problem, we have acquired 2.5M tweets classified as either positive or negative.

Evaluation Metrics

Your approach is evaluated according to the following criteria:

Classification Accuracy

Project Option 3: Road Segmentation

Segmenting an image consists in partitioning an image into multiple segments (formally one has to assign a class label to each pixel). A simple baseline is to partition an image into a set of patches and classify every patch according to some simple features (average intensity). Although this can produce reasonable results for simple images, natural images typically require more complex procedures that reason abut the entire image or very large windows.

Resources

All the necessary resources (including training data) are available at https://inclass.kaggle.com/c/cil-road-segmentation-2020

To participate, follow the link here.

Training Data

For this problem, we provide 100 aerial images acquired from GoogleMaps. We also provide groundtruth labels where each pixel gets assigned a probability in [0,1] that it is {road=1, background=0}. Your goal is to train a classifier to segment roads in these images, i.e. assign a probabilistic label {road=1, background=0} to each pixel.

Evaluation Metrics

Your approach is evaluated according to the following criteria:

prediction accuracy, measured by fraction of correctly predicted patches

Project Option 4: Galaxy Image Generation

In this project, you are given a mix of realistic cosmology images, corrupted cosmology images, and images which show other concepts like landscapes.

Most of the images have been scored according to their similarity to the concept of a prototypical 'cosmology image' according to our data-set. A similarity score like 2.61 means that the image almost co-incides with the prototypical cosmology image, a low similarity score like 0.00244 means that the image is a poor representative of a cosmology image -- probably because it has a different subject like a landscape or is corrupted. You can assume that similarity scores are valued in the interval [0.0, 8.0].

Beyond the scored images you are a given a smaller set of labeled images for which you can assume that they are drawn from the same distribution as the scored images. For these images, you are not given the similarity score, but you get labels: 1.0 means that the image is a real cosmology image, whereas 0.0 means it has been corrupted or shows another subject.

Task description

You are required to use the combination of scored/labeled images to build a generative model of the concept of 'realistic cosmology image', and then use this model to solve the following two tasks:

a) Generate a set of realistic cosmology images, i.e. they have a high similarity to the concept of 'cosmology image' according to our data-set. You are encouraged to submit a set of diverse images, i.e. you should not submit images that are perturbed versions of each other or perturbed versions of the scored images. This part of the competition is not judged via Kaggle but uses a custom submission at the end of the project.

b) For a set of query images assign a similarity score to each of them. This part of the competition is judged via Kaggle, you can submit solution CSV files and track your public leaderboard scores.

Project Resources

All the necessary resources (including training data) are available at https://inclass.kaggle.com/c/cil-cosmology-2020

To participate, follow the link here.

Evaluation Metrics

Your approach is evaluated according to the following criteria:

Similarity score prediction (MAE)

Reading Material

Here is a list of additional material if you want to read up on a specific topic of the course.

Linear Algebra

3Blue1Brown Essence of Linear Algebra (15 short youtube videos)

Introduction to Linear Algebra by Gilbert Strang (2016). See also his 35 Lectures on Linear Algebra (youtube)

Matrices

The Matrix Cookbook by Petersen & Pedersen, (2012). Contains useful formulas for derivatives w.r.t. vectors etc.

Machine Learning

Mathematics for Machine Learning by Deisenroth, A. Aldo Faisal, and Cheng Soon Ong.

Pattern Recognition and Machine Learning Christopher M. Bishop, Springer (2006).

Deep Learning

Deep Learning by Ian Goodfellow, Yoshua Bengio and Aaron Courville (2016)

People

Professor	Thomas Hofmann
Head TA	Kevin Roth
Head TA	Viktor Gal
TA	Yannic Kilcher
TA	Jonas Kohler
TA	Leonard Adolphs
TA	Antonio Orvieto
TA	Dario Pavllo
TA	Gregor Bachmann
TA	Calin Cruceru
TA	Giambattista Parascandolo