Course syllabus
Course-PM
DAT346 / DIT346 Techniques for large-scale data lp4 VT21 (7.5 hp)
Course is offered by the department of Computer Science and Engineering
Contact details
- Graham Kemp (Examiner), kemp@chalmers.se
- Shirin Tavara (Teacher), tavara@chalmers.se
- Fazeleh Sadat Hoseini (Teaching Assistant), fazeleh@chalmers.se
- Simon Johansson (Teaching Assistant), simoj@chalmers.se
- Denitsa Saynova (Teaching Assistant), saynova@chalmers.se
Course purpose
The aim of this course is to deepen the students' knowledge and skills and familiarize them with the technical and technological side of data science, including relevant data models, and software respectively hardware environments.
- an overview of computer architectures, algorithmic approaches, and high-performance computing infrastructures with a focus on limitations for processing large-scale data,
- an introduction to relevant frameworks for cluster computing with large-scale data,
- implementation of data analysis tools on a cluster using Python and appropriate software frameworks,
- index structures, query processing and optimisation; concurrency, recovery,
- an overview of non-relational database technologies,
- semantic web and related technologies,
- an overview of ethical questions regarding large-scale data.
Schedule
Course literature
The following are recommended for consultation:
- "The Data Science Design Manual", Steven S. Skiena, Springer, 2017, ISBN: 9783319554433, 9783319554440 (Chalmers Library eBook)
- "Parallel programming for multicore and cluster systems", Second edition, Thomas Rauber and Gudula Rünger, Springer-Verlag, 2013, ISBN: 3642438067, 9783642438066, 3642378005, 9783642378003 (Chalmers Library eBook)
- "Introduction to HPC with MPI for Data Science", Frank Nielsen, Springer, 2016, ISBN: 9783319219028, 9783319219035 (Chalmers Library eBook)
- "Databases: the complete book", 2. ed., new internat. ed., Hector Garcia-Molina, Jeffrey D. Ullman and Jennifer Widom, Pearson Education Ltd, 2014, ISBN: 129202447X, 9781292024479 (Chalmers Library eBook)
- "NoSQL distilled: a brief guide to the emerging world of polyglot persistence", Pramod J. Sadalage and Martin Fowler, Addison-Wesley, 2013, ISBN: 9780321826626, 0321826620 (Chalmers Library book)
- "Graph Databases, 2nd Edition, New Opportunities for Connected Data", Ian Robinson, Jim Webber and Emil Eifrem, O'ReillyMedia, 2015 (Free eBook via Neo4j website)
There are links to the Chalmers and University of Gothenburg libraries' catalogue entries for these books on the separate Course Literature page.
Links to additional material will be made available through Canvas during the course.
Course design
The preliminary schedule for the course is as follows:
Date | Activity | Topic |
---|---|---|
Mon 2021-03-22 | Lecture | Course organisation, overview, policies. The Scale of Data. Multiprocessing |
Wed 2021-03-24 | Lecture | Relevant Aspects of Computer and System Architecture, Effects and Consequences of the Memory Hierarchy. Multiprocessing. |
Wed 2021-03-24 | Lab | Introduction to working on Bayes. Supervised work on Assignment 1. |
Mon 2021-03-29 | No Lecture | |
Wed 2021-03-31 | Lecture | Explicit Multiprocessing on shared and distributed memory computer systems. Cleaning Data. |
Wed 2021-03-31 | Lab | Supervised work on Assignments 1 and 2. |
Easter | ||
Mon 2021-04-12 | No Lecture | |
Wed 2021-04-14 | Lecture | The Map-Reduce paradigm: implicit parallelism. |
Wed 2021-04-14 | Lab | Supervised work on Assignments 2 and 3. |
Mon 2021-04-19 | Lecture | Cluster Computing Frameworks: Map-Reduce, Apache Spark |
Wed 2021-04-21 | Lecture | Cluster Computing Frameworks cont.: Apache Spark. Algorithmic approaches to large-scale data. Bloom filters. |
Wed 2021-04-21 | Lab | Supervised work on Assignment 3. |
Mon 2021-04-26 | Lecture | Approximate computation (Sketching), Spatial indexes, ML methods implemented using indexes. |
Wed 2021-04-28 | Lecture | Database system architecture; relational database management systems |
Wed 2021-04-28 | Lab | Supervised work on Assignment 3. |
Mon 2021-05-03 | Lecture | Query processing and optimisation; database design |
Wed 2021-05-05 | Lecture | Semantic Web; semantic data modelling; RDF; RDF Schema; SPARQL |
Wed 2021-05-05 | Lab |
Supervised work on Assignment 3. Supervised work: getting started with SPARQL. In this lab you can experiment with DBpedia, Wikidata and the database management system GraphDB. |
Mon 2021-05-10 | Lecture | Ontologies |
Wed 2021-05-12 | Lecture | NoSQL systems; aggregation-orientation; CAP theorem |
Wed 2021-05-12 | Lab | Getting started with Protégé. Supervised work on Assignment 4. |
Mon 2021-05-17 | Lecture | Graph databases; labelled property graph model; querying graph databases |
Wed 2021-05-19 | Lecture | "Ten simple rules"; discussing ethical issues |
Wed 2021-05-19 | Lab | Supervised work on Assignment 5. |
Mon 2021-05-24 | No class | |
Wed 2021-05-26 | Lecture | Revision |
Wed 2021-05-26 | Lab |
All classes (lectures and labs) will be held online using Zoom. Zoom provides facilities for sharing screens, chat, interactive polls and breakout discussions. To participate fully in online classes, we recommend using a device with a microphone (and camera, if available).
Lab classes will also be held online. Teachers and/or teaching assistants will be available online to provide help; they will not be present in the lab rooms. Chalmers’ campuses, however, are not closed. Lecture halls and classrooms are locked, but group rooms and computer labs are open to students with pass cards, until further notice. Lab room bookings (made before the move to online teaching was confirmed) are listed for lab sessions in TimeEdit and you may use these rooms (following any instructions on the Chalmers web site) if this is your best option. However, we recommend that you work from home where possible.
Zoom links for online classes will be given on the course Home page.
Lecture slides, supplementary material and assignment task descriptions will be made available through the Canvas system. Canvas will also be used for assignment submissions.
Changes made since the last occasion
No major changes have been made since the last occasion.
Learning objectives and syllabus
Learning objectives:
On successful completion of the course the student will be able to:
Knowledge and understanding
- discuss important technological aspects when designing and implementing analysis solutions for large-scale data,
- describe index structures and discuss their utility,
- describe data models and software standards for sharing data on the web.
Skills and abilities
- implement applications for transforming and analyzing large-scale data with appropriate software frameworks,
- provide access and utilize structured data over the web with appropriate datamodels and software tools.
Judgement and approach
- suggest appropriate computational infrastructures for analysis tasks and discuss their advantages and drawbacks,
- discuss mechanisms for concurrency and recovery in database systems,
- discuss the efficiency of query plans,
- discuss large-scale data processing from an ethical point of view.
Link to the syllabus: Chalmers
Link to the syllabus: GU
Examination form
The course is examined by an individual written exam carried out in an examination hall, as well as mandatory written assignments, some of which will be carried out individually and some of which will be carried out in groups of up to 4 students. There will be non-obligatory individual assignments which grant bonus points for the written exam. These bonus points are valid for the whole academic year.
Course summary:
Date | Details | Due |
---|---|---|