MVE441 / MSA220 Statistical learning for big data Spring 21

MVE441 / MSA220 Statistical learning for big data Spring 21

Planned structure

Please check out the plan for the course that details all important dates.

Student representatives

The following students have been appointed (master program in parentheses):

Leo Benson (MPENM)
Hanna Skytt (MPENM)
Arunachalam Narasimhan (MPCAS)
Oskar Thune (MPCAS)
Xinrong Zhao (MPDSC)

In short, the student representatives and I will meet up once during the course and once after the course is over to discuss how everything works/worked. If you have opinions concerning the course you are of course always welcome to contact me directly, but if you want you can also contact one of the student representatives and they will collect and bring this information to me

You can read more about the what a student representative is at the following link. The representatives can be reached through the messaging function in Canvas. Simply go to your inbox, new message, choose the course, and type the name of the student you want to reach.

Course PM

This page contains a description of the program of the course. Other information, such as learning outcomes, teachers, recommended course literature, project work and examination, are in a separate course PM.


Please make sure to take the time and read the course PM as it contains valuable information on the course setup.


The schedule of the course is in TimeEdit.

Contents (preliminary, subject to detail changes)

  • Model-based Classification
    • Logistic, probit and softmax regression
    • Nearest centroids/Naive Bayes
    • Linear, quadratic and diagonal discriminant analysis
  • Model Assessment for Predictive Learning / Model Selection through Cross-Validation
  • Tree-based methods
    • Classification and Regression Trees (CART)
    • Bagging and the bootstrap
    • Random Forests & Variable Importance
  • Data representations:
    • Singular Value Decomposition
    • Principal Component Analysis
    • Regularized Discriminant Analaysis
    • Factor analysis
    • Non-negative Matrix Factorization
    • Intro to kernels and the kernel trick
    • kernel-PCA
    • Other applications of the kernel trick: Kernel ridge regression
    • Multi-dimensional scaling, Isomap, tSNE
  • Clustering
    • Combinatorial Clustering
    • k-means
    • k-medoids/partition around medoids
    • Selection of Cluster Count
    • Hierarchical Clustering
    • Gaussian Mixture Models
    • Expectation Maximization and Clustering
    • Mixture Discriminant Analysis
    • Density-based clustering / DBSCAN
  • Penalized regression/classification methods
    • Regularization and Variable selection (Ridge Regression, Lasso)
    • Nearest Shrunken Centroids
    • Elastic Net
    • Group Lasso
    • Oracle estimators
    • SCAD
    • Graphical Lasso
    • sparse logistic regression
  • High-dimensional clustering:
    • Subspace clustering/co-clustering
    • Spectral clustering
  • Large sample methods
    • Randomized Projection
    • Randomized SVD
    • Divide and Conquer
    • Random Forests for big-n
    • m-out-of-n bootstrap
    • bag of little bootstraps
    • leveraging

Back to top

Course requirements

The official course specific prerequisites, as stated in the course plan, are:

The prerequisites for the course are a basic course in statistical inference and MVE190/MSG500 Linear Statistical Models. Students can also contact the course instructor for permission to take the course.

This means you should be familiar with the following:

  • Basic vector calculus and linear algebra (Matrices, vectors, gradients, ...)
  • Basic statistics (probability density and mass functions, cumulative distribution functions, expected value as an integral, (co-)variance, correlation, …)
  • Common distributions (Normal, Student-t, Gamma, Chi-Square, ...)
  • In terms of multivariate distributions, at least the multivariate Normal distribution
  • Parameter estimation in the framework of maximum likelihood
  • Knowledge about least squares methods and their statistical properties
  • Linear regression and how to interpret its results
  • Programming skills (Knowledge of basic control flow and ideally some basic knowledge of statistical programming, e.g. how to generate random numbers, how to perform simple simulations, ...; R or Python are recommended for this course)

Back to top

Course summary:

Date Details Due