Introduction to Natural Language Processing 2016

Introduction to Natural Language Processing 2016

Course topics

  • Sequential models (Hidden Markov Models, Conditional Random Fields)
  • Part-of-Speech tagging
  • Shallow syntactic parsing
  • Deep syntactic parsing
  • Knowledge representation (taxonomies, componential semantics, frame semantics, supersenses)
  • Text classification (supervised, weakly supervised, semi-supervised, and unsupervised)
  • Topic modeling
  • Word-sense disambiguation
  • Sentiment analysis
  • Machine translation
  • Grammatical inference
  • Word embeddings

Course tools

Python libraries (we will not be looking at all of these in detail but during the course we will be borrowing methods and classes from all of them):

  • NLTK
  • gensim (word2vec, topic modelling)
  • CLiPS pattern
  • sklearn
  • scipy
  • spacy
  • numpy


Raw text corpora (provided by lecturer)

  • Full list to be confirmed soon


Annotated corpora (provided by lecturer)

  • Crowdflower’s public sentiment analysis dataset
  • Word-sense-annotated corpus
  • PoS-tagged annotated corpus


  • Good level of English.
  • Familiarity with mathematical notation and scientific formalization.
  • Familiarity with basic probability theory and Bayesian statistics.
  • Familiarity with basic concepts of information retrieval (precision and recall).
  • Strong problem formalization skills, particularly probabilistic factorization (for instance, as applied to a company’s expected sales volume: “if each unit sells for x euro, and y units are sold in a given period, if some ratio r1 of all units sold are returned, and if some ratio r2 are damaged, and if the actually sold items result in an average ratio r3 euro of additional sales per quarter, and if the bad reviews from damaged items result in an average ratio of r4 lost sales per quarter, what is the total expected profit for a quarter where 1,000 units were sold?”


Mr. Jordi Carrera Ventura
Computational Linguist from Barcelona with many years of experience working on industrial NLP applications

Affiliation: Quarizmi AdTech / AAA Group / Sumplify, Catalonia, Spain

I started my career working on automatic knowledge extraction and semantic annotation from unstructured text as a means to improve syntactic parsing. I then joined a US-based machine translation company for one year, where I helped localize their technology into Spanish.

After that year, I spent two more years working on sentiment analysis and document classification, where I focused on semantically-driven pattern matching and clustering algorithms for short-text classification. Since then, I have worked for one big e-commerce localization company and several small start-ups developing a variety of applications, from automatic spellchecking to general-purpose linguistic APIs and ad analytics.

Most of my work focuses on statistical semantic models and knowledge extraction, and my passion is to turn unstructured data into structured data.

Fields of interests: Natural Language Understanding, Semantic Vector Space Models, Statistical Modelling, Syntactic Parsing, Chunking, Grammatical Inference, Clustering, Semantic Labeling, Taxonomies, Text Classification