Overview
In this course we study different technologies and algorithms to work with massive datasets. The course starts with an introduction to Big Data, next we focus on distributed technologies to work with semi-structured data. After that we do a brief introduction to graphs, and we finish the course covering algorithms and techniques for mining data streams.
Course topics
Part 1. Introduction to Big Data
- The general overview of the course
- Hadoop and spark introduction
- Pig and Pyspark comparison
- Public (big) datasets and resources
Part 2. Processing Structured and Semi-structured large data
- Preprocessing User Generated Content and other semi-structured data
- Introduction to PySpark
- Scikit-learn at large scale
- Basics of machine learning with pySpark
Part 3. Graphs
- Introduction to graphs
- Temporal graphs representation
- Centrality measures for graphs (PageRank, Hits, etc)
- Large graphs with NoSql databases. Communities / Clusters / Cliques in Graphs
Part 4. Working with Data Streams
- Data Streams, how are they different from static data?
- Online vs Offline Algorithms
- Applied ML in (large) streams