From Natural Language to Data Science

  • Course codes: COMP 345 & LING 345 (Winter 2026)
  • Instructor: Gaurav Kamath [Office Hours: Thursday 12:00-13:00, Rm 304, 1085 Av. Dr. Penfield] [Email: gaurav<dot>kamath<at>mcgill<dot>ca]

    • Note: Please add the [NL2DS] tag to any email subject line, to help your instructor respond to you more quickly.
  • Teaching Assistants:
    • Jay Gala [jay<dot>gala<at>mila<dot>quebec]
    • Fengyuan Liu [fengyuan<dot>liu<at>mila<dot>quebec]
    • Jessica Ojo [jessica<dot>ojo<at>mail<dot>mcgill<dot>ca]
    • Mehar Bhatia [mehar<dot>bhatia<at>mila<dot>quebec]
    • Fahad Amik [fahad<dot>amik<at>mail<dot>mcgill<dot>ca]
  • Classroom: McConnell Engineering Building 204
  • Time: Tuesdays and Thursdays, 10:05 am – 11:25 am
  • Links:

Description

Course Format: The course format is in person unless notified. But lecture recordings will be available on MyCourses.

The last decades have seen phenomenal increases in our scientific and engineering understanding of language from a computational perspective. A large part of this success has been the rapid and unprecedented expansion of different kinds of language data as well as new computational tools for dealing with this data. This course provides an introduction to the data science of language. The emphasis will be on learning basic tools for working with language data for both engineering and scientific applications.

Goals

The goal of this course is to learn how to think about language data (predominantly raw text) and to work with it computationally. Along the way, we will learn a number of mathematical and computational tools for data collection, representation, processing, modeling, and analysis. The main emphasis of the course will be on transferring theory into practice. By the end of the course, you will know how to process large amounts of text, study the nature of words, search for information, examine statistical relationships between linguistic data, and apply these tools to study scientific questions about language. And, hopefully, along the way, you’ll also just have fun :)

Students will

  • Learn how to process text data using Python.
  • Learn the fundamentals of data science, including different statistical models.
  • Learn how to analyze text data using these statistical methods.
  • Learn how to build simple models of language domains.
  • Learn how to predict outcomes based on text.
  • Learn about the biases of current NLP systems and their implications.

Prerequisites

Required: Programming background at the level of COMP 202 or equivalent.

Recommended: Programming background at the level of COMP 250 or higher. Mathematics background at the level of MATH 240. Basic calculus and linear algebra will be helpful but not critical.

Grading

Note that the details below are subject to change.

  • Midterm Exam (25%)
  • Final Exam (25%)
  • Problem Sets (50%): 4 problem sets equally weighted (12.5% each).

Why does a course like this have exams? AI tools are increasingly integrated into coding interfaces, and how we write code more generally. This is mostly a good thing: tools like Copilot or ChatGPT can make us drastically more productive, and even offer personalized help when learning to code. But it’s still important that you understand the fundamentals yourself, to understand whether a coding agent is implementing what you actually want/need, and to distinguish between good and bad AI-generated code. Unfortunately, in-person exams are the most straightforward way of testing these fundamentals in a way that can’t be gamed using AI tools.

What will the exams be like? They will be multiple-choice, won’t be hard, and won’t be designed to contain any trick questions. The idea is simple: if you show up to class regularly, and understand how to implement the code through this course, you should find them extremely easy. If you never show up to class, and completely outsource all coding to AI tools, they’ll be hard.

Schedule

Lecture Date Topic Notes
1 Jan 6 Introduction  
2 Jan 8 Tokenization, Regular Expressions & Syntactic Parses  
3 Jan 13 Keywords & Association Metrics  
4 Jan 15 Classification & Regression Models Assignment 1 Released
5 Jan 20 Classification & Regression Models Add or drop deadline
6 Jan 22 Classification & Regression Models  
7 Jan 27 Classification & Regression Models  
8 Jan 29 Neural Nets Assignment 1 Due
Assignment 2 Released
9 Feb 3 Vector Representations of Words, Sentences & Documents  
10 Feb 5 Vector Representations of Words, Sentences & Documents  
11 Feb 10 Vector Representations of Words, Sentences & Documents  
12 Feb 12 Language Modelling  
13 Feb 17 Language Modelling  
14 Feb 19 Language Modelling Assignment 2 Due
Assignment 3 Released
15 Feb 24 Recap Class  
16 Feb 26 Midterm Exam  
17 Mar 3 Reading week (No Class)  
18 Mar 5 Reading week (No Class)  
19 Mar 10 Decision Trees & Random Forests  
20 Mar 12 Unsupervised Learning Algorithms  
21 Mar 17 Unsupervised Learning Algorithms  
22 Mar 19 Unsupervised Learning Algorithms  
23 Mar 24 Information Theory Assignment 3 Due
Assignment 4 Released
24 Mar 26 Statistical Hypothesis Testing  
25 Mar 31 Ethics & Bias  
26 Apr 2 Multilingual NLP  
27 Apr 7 Digital Humanities & Computational Social Sciences  
28 Apr 9 Recap Class (Final Class)  
29 Apr 14 No Class (as per McGill Academic Calendar) Assignment 4 Due


Generative AI Policy

You may use any Generative AI tool (e.g., ChatGPT, GitHub Copilot, Claude) that helps you better engage with the content of the course, as long as you (i) use them to better engage with course materials, rather than using them do all of your assignments for you; and (ii) declare what tools you used, where you used them, and what you used them for.

Feel free to use any AI tools to:

  • Help yourself understand and debug code.
  • Write / auto-complete small portions of code.
  • Dig deeper on any of the topics covered in class.

But you may NOT:

  • Use AI tools to write whole assignments or code files for you.
  • Submit AI-generated prose in your assignments.
  • Use AI tools without citing them in your assignment.

Inappropriate use of Generative AI may result in penalties on grades or referral to disciplinary authorities. (You’ll also just struggle on the exams if you try to completely outsource assignments to AI tools.) If you have any question about appropriate use of AI applications for course work, please contact the instructor during his office hours.

Language of Submission

In accord with McGill University’s Charter of Students’ Rights, students in this course have the right to submit in English or in French any written work that is to be graded.

Academic Integrity

McGill University values academic integrity. Therefore, all students must understand the meaning and consequences of cheating, plagiarism and other academic offences under the Code of Student Conduct and Disciplinary Procedures (see www.mcgill.ca/students/srr/honest/ for more information)

Inclusivity

As the instructor of this course I endeavor to provide an inclusive learning environment. However, if you experience barriers to learning in this course, do not hesitate to discuss them with me or the Office for Students with Disabilities.

Extraordinary Circumstances

In the event of extraordinary circumstances beyond the control of McGill University, assessment tasks in a course are subject to change, provided students are sent adequate and timely communications regarding the change.

Territory Acknowledgment

McGill University is on land which has long served as a site of meeting and exchange amongst Indigenous peoples, including the Haudenosaunee and Anishinabeg nations. This history remains relevant to this day, particularly in terms of the current state of Indigenous languages. And since COMP / LING 345 is about computational approaches to language, it’s also specifically relevant to our course. I’m trying to organize a guest lecture on some of the language technology work being done in Canada for Indigenous languages: more to come!