Course syllabus

Course-PM

DAT346 / DIT346 Techniques for large-scale data lp4 VT21 (7.5 hp)

Course is offered by the department of Computer Science and Engineering

Contact details

Course purpose

The aim of this course is to deepen the students' knowledge and skills and familiarize them with the technical and technological side of data science, including relevant data models, and software respectively hardware environments.

In particular, the course will include
  • an overview of computer architectures, algorithmic approaches, and  high-performance computing infrastructures with a focus on limitations for processing large-scale data,
  • an introduction to relevant frameworks for cluster computing with large-scale data,
  • implementation of data analysis tools on a cluster using Python and appropriate software frameworks,
  • index structures, query processing and optimisation; concurrency, recovery,
  • an overview of non-relational database technologies,
  • semantic web and related technologies,
  • an overview of ethical questions regarding large-scale data.

Schedule

TimeEdit

Course literature

The following are recommended for consultation:

  • "The Data Science Design Manual", Steven S. Skiena, Springer, 2017, ISBN: 9783319554433, 9783319554440 (Chalmers Library eBook)
  • "Parallel programming for multicore and cluster systems", Second edition, Thomas Rauber and Gudula Rünger, Springer-Verlag, 2013, ISBN: 3642438067, 9783642438066, 3642378005, 9783642378003 (Chalmers Library eBook)
  • "Introduction to HPC with MPI for Data Science", Frank Nielsen, Springer, 2016, ISBN: 9783319219028, 9783319219035 (Chalmers Library eBook)
  • "Databases: the complete book", 2. ed., new internat. ed., Hector Garcia-Molina, Jeffrey D. Ullman and Jennifer Widom, Pearson Education Ltd, 2014, ISBN: 129202447X, 9781292024479 (Chalmers Library eBook)
  • "NoSQL distilled: a brief guide to the emerging world of polyglot persistence", Pramod J. Sadalage and Martin Fowler, Addison-Wesley, 2013, ISBN: 9780321826626, 0321826620 (Chalmers Library book)
  • "Graph Databases, 2nd Edition, New Opportunities for Connected Data", Ian Robinson, Jim Webber and Emil Eifrem, O'ReillyMedia, 2015 (Free eBook via Neo4j website)

There are links to the Chalmers and University of Gothenburg libraries' catalogue entries for these books on the separate Course Literature page.

Links to additional material will be made available through Canvas during the course.

Course design

The preliminary schedule for the course is as follows:

Preliminary schedule
Date Activity Topic
Mon 2021-03-22 Lecture Course organisation, overview, policies. The Scale of Data. Multiprocessing
Wed 2021-03-24 Lecture Relevant Aspects of Computer and System Architecture, Effects and Consequences of the Memory Hierarchy. Multiprocessing.
Wed 2021-03-24 Lab Introduction to working on Bayes. Supervised work on Assignment 1.
Mon 2021-03-29 No Lecture
Wed 2021-03-31 Lecture Explicit Multiprocessing on shared and distributed memory computer systems. Cleaning Data.
Wed 2021-03-31 Lab Supervised work on Assignments 1 and 2.
Easter
Mon 2021-04-12 No Lecture
Wed 2021-04-14 Lecture The Map-Reduce paradigm: implicit parallelism.
Wed 2021-04-14 Lab Supervised work on Assignments 2 and 3.
Mon 2021-04-19 Lecture Cluster Computing Frameworks: Map-Reduce, Apache Spark
Wed 2021-04-21 Lecture Cluster Computing Frameworks cont.: Apache Spark. Algorithmic approaches to large-scale data. Bloom filters.
Wed 2021-04-21 Lab Supervised work on Assignment 3.
Mon 2021-04-26 Lecture Approximate computation (Sketching), Spatial indexes, ML methods implemented using indexes.
Wed 2021-04-28 Lecture Database system architecture; relational database management systems
Wed 2021-04-28 Lab Supervised work on Assignment 3.
Mon 2021-05-03 Lecture Query processing and optimisation; database design
Wed 2021-05-05 Lecture Semantic Web; semantic data modelling; RDF; RDF Schema; SPARQL
Wed 2021-05-05 Lab

Supervised work on Assignment 3.

Supervised work: getting started with SPARQL. In this lab you can experiment with DBpedia, Wikidata and the database management system GraphDB.

Mon 2021-05-10 Lecture Ontologies
Wed 2021-05-12 Lecture NoSQL systems; aggregation-orientation; CAP theorem
Wed 2021-05-12 Lab Getting started with Protégé. Supervised work on Assignment 4.
Mon 2021-05-17 Lecture Graph databases; labelled property graph model; querying graph databases
Wed 2021-05-19 Lecture "Ten simple rules"; discussing ethical issues
Wed 2021-05-19 Lab Supervised work on Assignment 5.
Mon 2021-05-24 No class
Wed 2021-05-26 Lecture Revision
Wed 2021-05-26 Lab

All classes (lectures and labs) will be held online using Zoom. Zoom provides facilities for sharing screens, chat, interactive polls and breakout discussions. To participate fully in online classes, we recommend using a device with a microphone (and camera, if available).

Lab classes will also be held online. Teachers and/or teaching assistants will be available online to provide help; they will not be present in the lab rooms. Chalmers’ campuses, however, are not closed. Lecture halls and classrooms are locked, but group rooms and computer labs are open to students with pass cards, until further notice.  Lab room bookings (made before  the move to online teaching was confirmed) are listed for lab sessions in TimeEdit and you may use these rooms (following any instructions on the Chalmers web site) if this is your best option.  However, we  recommend that you  work from home where possible.

Zoom links for online classes will be given on the course Home page.

Lecture slides, supplementary material and assignment task descriptions will be made available through the Canvas system. Canvas will also be used for assignment submissions.

Changes made since the last occasion

No major changes have been made since the last occasion.

Learning objectives and syllabus

Learning objectives:

On successful completion of the course the student will be able to:

Knowledge and understanding

  • discuss important technological aspects when designing and implementing analysis solutions for large-scale data,
  • describe index structures and discuss their utility,
  • describe data models and software standards for sharing data on the web.

Skills and abilities

  • implement applications for transforming and analyzing large-scale data with appropriate software frameworks,
  • provide access and utilize structured data over the web with appropriate datamodels and software tools.

Judgement and approach

  • suggest appropriate computational infrastructures for analysis tasks and discuss their advantages and drawbacks,
  • discuss mechanisms for concurrency and recovery in database systems,
  • discuss the efficiency of query plans,
  • discuss large-scale data processing from an ethical point of view.

Link to the syllabus: Chalmers

Link to the syllabus: GU

Examination form

The course is examined by an individual written exam carried out in an examination hall, as well as mandatory written assignments, some of which will be carried out individually and some of which will be carried out in groups of up to 4 students. There will be non-obligatory individual assignments which grant bonus points for the written exam. These bonus points are valid for the whole academic year.

Course summary:

Date Details Due