From Natural Language to Data Science

  • Course codes: COMP 345 & LING 345 (Winter 2024)
  • Instructors: Siva Reddy[office hours: Tuesdays 3pm ENGMC 104N], Morgan Sonderegger
  • Teaching Assistants: Vaibhav Adlakha, Gaurav Kamath, Aayush Kapur, Arkil Patel
  • Classroom: McConnell Engineering Building 13
  • Time: Tuesdays and Thursdays, 10:05 am – 11:25 am
  • Links: MyCourses: announcements, slides

Description

Course Format: The course format is in person unless notified.

The last decades have seen phenomenal increases in our scientific and engineering understanding of language from a computational perspective. A large part of this success has been the rapid and unprecedented expansion of different kinds of language data as well as new computational tools for dealing with this data. This course provides an introduction to the data science of language. The emphasis will be on learning basic tools for working with language data for both engineering and scientific applications.

Goals

The goal of this course is to learn how to think about language data (predominantly raw text) and to work with it computationally. Along the way, we will learn a number of mathematical and computational tools for data collection, representation, processing, modeling, and analysis. The main emphasis of the course will be on transferring theory into practice. By the end of the course, you will know how to visualize large amounts of text, study the nature of words, build thesaurus, search for information, predict outcomes from social media text while also studying the ethical implications. And one of the side goals is to have fun with text!!

Students will

  • Learn to visualize text.
  • Learn how to analyze text data using simple statistical methods.
  • Learn how to build simply models of language domains.
  • Learn how to predict outcomes based on text.
  • Learn the biases of current NLP technology and their implications.
  • Learn how to process different kinds of language data using Python.
  • Learn how to query linguistic databases.

Prerequisites

Required: Programming background at the level of COMP 202 or equivalent.

Recommended: Programming background at the level of COMP 250 or higher. Mathematics background at the level of MATH 240. Basic calculus and linear algebra will be helpful but not critical.

Grading

Note that the details below are subject to change. There is no exam.

  • Warm-Up Problem Sets 1 and 2 (20%): These will be a warm-up to get used to implementing Python and submitting (10% each)
  • Problem Sets (80%): 4 problem sets equally weighted (20% each).

Topics (Tentative)

Language data and applications

  1. Data (Web documents, Reviews, Social Networks)
  2. Applications (Information retrieval/extraction, Sentiment analysis, Recommendation systems)

Searching through data

  1. Regular expressions and corpus query language
  2. Symbolic and distributional representations
  3. Tree regular expressions and semantic regular expressions
  4. Hashing

How to make sense of data

  1. Mutual information, collocation
  2. Keywords
  3. Vector space models
  4. Distributionally similar words
  5. Topic models

Language Modeling

  1. Spell checkers
  2. Detecting fake content
  3. Language generation

Language to decisions

  1. Feature-based models (logistic regression)
  2. Black-box models (neural)
  3. Sentiment analysis
  4. Robustness of models (build it and break it)

Information Retrieval

  1. Indexing
  2. Pagerank
  3. Reading comprehension systems

Information Extraction

  1. Knowledge representation
  2. Knowledge bases
  3. Question answering

Social Networks (Twitter and Facebook data)

  1. Representations
  2. Search through structured data
  3. Applications

Ethics

  1. Biases in data
  2. Biases in machine learning models
  3. Applications and ethical questions

Schedule

Lecture Date Topic Instructor Notes
1 Jan 4 Introduction Reddy, Sonderegger  
2 Jan 9 Regular Expressions Reddy Assignment 1 (Reddy) – Python basics – 10% Exercise, Advanced
3 Jan 11 Keywords, Association metrics Reddy  
4 Jan 16 Vector space model Reddy Add or drop deadline
5 Jan 18 Vector space model, LSA Reddy  
6 Jan 23 LSA / Word embeddings Reddy Assignment 1 due
Assignment 2 (Reddy) – Corpus Query Language – 10%
7 Jan 25 Compositionality / Sentence Representations Reddy  
8 Jan 30 Document Representation / Topic Models / Contextuality Reddy  
9 Feb 1 Contextuality Reddy  
10 Feb 6 Classification and Regression Models Sonderegger Assignment 2 due
Assignment 3 (Reddy)
– Vector space model, topic models – 20%
11 Feb 8 Classification and Regression Models Sonderegger  
12 Feb 13 Classification and Regression Models Sonderegger  
13 Feb 15 Classification and Regression Models Sonderegger  
14 Feb 20 Classification and Regression Models Sonderegger  
15 Feb 22 Classification and Regression Models Sonderegger  
16 Feb 27 Unsupervised learning Sonderegger Assignment 3 due
Assignment 4 (Sonderegger) 
– Data analysis and regression – 20%
17 Feb 29 Unsupervised learning Sonderegger  
18 Mar 5 Reading week    
19 Mar 7 Reading week    
20 Mar 12 Unsupervised learning Sonderegger  
21 Mar 14 Unsupervised learning Sonderegger Assignment 4 due
Assignment 5 (Sonderegger)
– clustering, language phylogeny – 20%
22 Mar 19 Language Modeling Reddy  
23 Mar 21 Language Modeling Reddy  
24 Mar 26 Dialogue Systems/Semantic Parsing Reddy  
25 Mar 28 Ethics and bias Reddy  
26 Apr 2 TBD Reddy (At Barbados Workshop) Assignment 5 due
Assignment 6 (Reddy)
– language modeling, semantic parsing, information retrieval, and bias – 20%
27 Apr 4 Data Science for Speech Sonderegger  
28 Apr 9 Neural Networks Reddy Assignment 6 due date is Apr 13

Language of Submission

In accord with McGill University’s Charter of Students’ Rights, students in this course have the right to submit in English or in French any written work that is to be graded.

Academic Integrity

McGill University values academic integrity. Therefore, all students must understand the meaning and consequences of cheating, plagiarism and other academic offences under the Code of Student Conduct and Disciplinary Procedures (see www.mcgill.ca/students/srr/honest/ for more information)

Inclusivity

As the instructor of this course I endeavor to provide an inclusive learning environment. However, if you experience barriers to learning in this course, do not hesitate to discuss them with me or the Office for Students with Disabilities.