From Natural Language to Data Science

Course codes: COMP 345 & LING 345 (Winter 2025)
Instructors:
- Siva Reddy[office hours: Thursdays 1:00pm–2:00pm ENGMC 104N (immediately after the lecture)]
- Morgan Sonderegger [office hours: TBA, starting late February]
Teaching Assistants: Parishad BehnamGhader, Mehar Bhatia, Gaurav Kamath, Arkil Patel, Gandharv Patil
Classroom: Macdonald-Harrington Building G-10 (MDHAR G-10)
Time: Tuesdays and Thursdays, 11:35 am – 12:55 pm
Links: MyCourses: announcements, slides

Description

Course Format: The course format is in person unless notified.

The last decades have seen phenomenal increases in our scientific and engineering understanding of language from a computational perspective. A large part of this success has been the rapid and unprecedented expansion of different kinds of language data as well as new computational tools for dealing with this data. This course provides an introduction to the data science of language. The emphasis will be on learning basic tools for working with language data for both engineering and scientific applications.

Goals

The goal of this course is to learn how to think about language data (predominantly raw text) and to work with it computationally. Along the way, we will learn a number of mathematical and computational tools for data collection, representation, processing, modeling, and analysis. The main emphasis of the course will be on transferring theory into practice. By the end of the course, you will know how to visualize large amounts of text, study the nature of words, build thesauruses, search for information, and predict outcomes from social media text while also studying the ethical implications. An important side goals is to have fun with text!

Students will

Learn to visualize text.
Learn how to analyze language data using simple statistical and machine learning methods.
Learn how to build simple models of language domains.
Learn how to predict outcomes based on text.
Learn the biases of current NLP technology and their implications.
Learn how to process different kinds of language data using Python.
Learn how to query linguistic databases.

Prerequisites

Required: Programming background at the level of COMP 202 or equivalent.

Recommended: Programming background at the level of COMP 250 or higher. Mathematics background at the level of MATH 240. Basic calculus and linear algebra will be helpful but not critical.

Grading

There are six problem sets and no exam.

Warm-Up Problem Sets 1 and 2 (20%): These will be a warm-up to get used to implementing Python and submitting (10% each).
Problem Sets (80%): 4 problem sets equally weighted (20% each).

Because they are the primary method of assessment, successful completion of each of problem sets 3-6 is required to pass the course. In other words, if you do not submit or receive an F on one of these problem sets, you may fail the course.

The details above are subject to change. If there are changes to the means of assessment after the add/drop date, you will have the option to decide whether the original or modified means of assessment will determine your final grade.

Topics (Tentative)

Language data and applications

Data (Web documents, Reviews, Social Networks)
Applications (Information retrieval/extraction, Sentiment analysis, Recommendation systems)

Searching through data

Regular expressions and corpus query language
Symbolic and distributional representations
Tree regular expressions and semantic regular expressions
Hashing

How to make sense of data

Mutual information, collocation
Keywords
Vector space models
Distributionally similar words
Topic models

Language Modeling

Spell checkers
Detecting fake content
Language generation

Language to decisions

Feature-based models (logistic regression)
Black-box models (neural)
Sentiment analysis
Robustness of models (build it and break it)

Information Retrieval

Indexing
Pagerank
Reading comprehension systems

Information Extraction

Knowledge representation
Knowledge bases
Question answering

Social Networks (Twitter and Facebook data)

Representations
Search through structured data
Applications

Ethics

Biases in data
Biases in machine learning models
Applications and ethical questions

Data analysis: supervised

Classification
Regression
Ensenble methods

Data analysis: unsupervised

Probabilistic clustering
Hierarchical clustering
Dimensionality reduction

Speech and information

Information theory
Speech technology: overview
Speech data science

Schedule

Lecture	Date	Topic	Instructor	Notes
1	Jan 7	Introduction	Reddy, Sonderegger
2	Jan 9	Regular Expressions	Reddy	Assignment 1 (Reddy) – Python basics – 10% Exercise, Advanced
3	Jan 14	Keywords, Association metrics	Reddy	Add or drop deadline
4	Jan 16	Vector space model	Reddy
5	Jan 18	Vector space model, LSA	Reddy
6	Jan 21	LSA / Word embeddings	Reddy
7	Jan 23	Compositionality / Sentence Representations	Reddy	Assignment 1 due Assignment 2 (Reddy) – Corpus Query Language – 10%
8	Jan 28	Document Representation / Topic Models / Contextuality	Reddy
9	Feb 30	Contextuality	Reddy
10	Feb 4	Language Modeling	Reddy
11	Feb 6	Language Modeling	Reddy	Assignment 2 due Assignment 3 (Reddy) – Vector space model, topic models - 20%
12	Feb 11	Dialogue Systems/Semantic Parsing	Reddy
13	Feb 13	Ethics and bias	Reddy
14	Feb 18	Neural Networks	Reddy
15	Feb 20	Neural Networks	Reddy	Assignment 3 due Assignment 4 (Reddy) – language modeling, semantic parsing, information retrieval, and bias – 20%
16	Feb 25	Classification and Regression Models	Sonderegger
17	Feb 27	Classification and Regression Models	Sonderegger
18	Mar 4	Reading week		Reading week
19	Mar 6	Reading week		Reading week
20	Mar 11	Classification and Regression Models	Sonderegger
21	Mar 13	Classification and Regression Models	Sonderegger	Assignment 4 due Assignment 5 (Sonderegger) – Data analysis and regression – 20%
22	Mar 18	Classification and Regression Models	Sonderegger
23	Mar 20	Unsupervised learning	Sonderegger
24	Mar 25	Unsupervised learning	Sonderegger
25	Mar 27	Unsupervised learning	Sonderegger	Assignment 5 due Assignment 6 (Sonderegger) – clustering, language phylogeny – 20%
26	Apr 1	Information Theory	Sonderegger
27	Apr 3	Neural Nets	Sonderegger (guest lecture by Gaurav Kamath)
28	Apr 8	Speech Technology	Sonderegger
29	Apr 10	Data Science for Speech	Sonderegger	Assignment 6 due

Generative AI Policy

If you use any Generative AI tool for your submitted work (e.g., ChatGPT, Github Copilot, Claude), you must cite it and submit a detailed statement describing its use, as well as a log of the chat where you used it.

You may not use AI tools for:

Copying AI-generated prose or non-trivial code chunks – this is plagiarism
Writing whole assignments or code files
Writing large chunks of an assignment or code
Using AI without citing it in your assignment

Inappropriate use of Generative AI may result in penalties on grades or referral to disciplinary authorities. If you have any question about appropriate use of AI applications for course work, please contact the instructors in their office hours.

Language of Submission

In accord with McGill University’s Charter of Students’ Rights, students in this course have the right to submit in English or in French any written work that is to be graded.

Academic Integrity

McGill University values academic integrity. Therefore, all students must understand the meaning and consequences of cheating, plagiarism and other academic offences under the Code of Student Conduct and Disciplinary Procedures (see www.mcgill.ca/students/srr/honest/ for more information)

Inclusivity

As the instructors of this course we endeavor to provide an inclusive learning environment. However, if you experience barriers to learning in this course, do not hesitate to discuss them with one of us, or the Student Accessibility and Achievement office. Note that accommodation requests for a given problem set must be made before the problem set is due.

Extraordinary Circumstances

In the event of extraordinary circumstances beyond the control of McGill University, assessment tasks in a course are subject to change, provided students are sent adequate and timely communications regarding the change.