Mining Massive Databases

Course topics

Part 1. Introduction to Big Data

  • The general overview of the course
  • Hadoop and spark introduction
  • Pig and Pyspark comparison
  • Public (big) datasets and resources

Part 2. Processing Structured and Semi-structured large data

  • Preprocessing User Generated Content and other semi-structured data
  • Introduction to PySpark
  • Scikit-learn at large scale
  • Basics of machine learning with pySpark

Part 3. Graphs

  • Introduction to graphs
  • Temporal graphs representation
  • Centrality measures for graphs (PageRank, Hits, etc)
  • Large graphs with NoSql databases. Communities / Clusters / Cliques in Graphs

Part 4. Working with Data Streams

  • Data Streams, how are they different from static data?
  • Online vs Offline Algorithms
  • Applied ML in (large) streams

Prerequisites