Mining Massive Databases
Course topics
Part 1. Introduction to Big Data
- The general overview of the course
- Hadoop and spark introduction
- Pig and Pyspark comparison
- Public (big) datasets and resources
Part 2. Processing Structured and Semi-structured large data
- Preprocessing User Generated Content and other semi-structured data
- Introduction to PySpark
- Scikit-learn at large scale
- Basics of machine learning with pySpark
Part 3. Graphs
- Introduction to graphs
- Temporal graphs representation
- Centrality measures for graphs (PageRank, Hits, etc)
- Large graphs with NoSql databases. Communities / Clusters / Cliques in Graphs
Part 4. Working with Data Streams
- Data Streams, how are they different from static data?
- Online vs Offline Algorithms
- Applied ML in (large) streams