Mining Massive Databases

Overview

In this course we study different technologies and algorithms to work with massive datasets. The course starts with an introduction to Big Data, next we focus on distributed technologies to work with semi-structured data. After that we do a brief introduction to graphs, and we finish the course covering algorithms and techniques for mining data streams.

Course topics

Part 1. Introduction to Big Data

  • The general overview of the course
  • Hadoop and spark introduction
  • Pig and Pyspark comparison
  • Public (big) datasets and resources

Part 2. Processing Structured and Semi-structured large data

  • Preprocessing User Generated Content and other semi-structured data
  • Introduction to PySpark
  • Scikit-learn at large scale
  • Basics of machine learning with pySpark

Part 3. Graphs

  • Introduction to graphs
  • Temporal graphs representation
  • Centrality measures for graphs (PageRank, Hits, etc)
  • Large graphs with NoSql databases. Communities / Clusters / Cliques in Graphs

Part 4. Working with Data Streams

  • Data Streams, how are they different from static data?
  • Online vs Offline Algorithms
  • Applied ML in (large) streams

Prerequisites