From Natural Language to Data Science
- Course codes: COMP 345 & LING 345 (Winter 2025)
- Instructors:
- Siva Reddy[office hours: Thursdays 1:00pm–2:00pm ENGMC 104N (immediately after the lecture)]
- Morgan Sonderegger [office hours: TBA, starting late February]
- Teaching Assistants: Parishad BehnamGhader, Mehar Bhatia, Gaurav Kamath, Arkil Patel, Gandharv Patil
- Classroom: Macdonald-Harrington Building G-10 (MDHAR G-10)
- Time: Tuesdays and Thursdays, 11:35 am – 12:55 pm
- Links: MyCourses: announcements, slides
Description
Course Format: The course format is in person unless notified.
The last decades have seen phenomenal increases in our scientific and engineering understanding of language from a computational perspective. A large part of this success has been the rapid and unprecedented expansion of different kinds of language data as well as new computational tools for dealing with this data. This course provides an introduction to the data science of language. The emphasis will be on learning basic tools for working with language data for both engineering and scientific applications.
Goals
The goal of this course is to learn how to think about language data (predominantly raw text) and to work with it computationally. Along the way, we will learn a number of mathematical and computational tools for data collection, representation, processing, modeling, and analysis. The main emphasis of the course will be on transferring theory into practice. By the end of the course, you will know how to visualize large amounts of text, study the nature of words, build thesauruses, search for information, and predict outcomes from social media text while also studying the ethical implications. An important side goals is to have fun with text!
Students will
- Learn to visualize text.
- Learn how to analyze language data using simple statistical and machine learning methods.
- Learn how to build simple models of language domains.
- Learn how to predict outcomes based on text.
- Learn the biases of current NLP technology and their implications.
- Learn how to process different kinds of language data using Python.
- Learn how to query linguistic databases.
Prerequisites
Required: Programming background at the level of COMP 202 or equivalent.
Recommended: Programming background at the level of COMP 250 or higher. Mathematics background at the level of MATH 240. Basic calculus and linear algebra will be helpful but not critical.
Grading
There are six problem sets and no exam.
- Warm-Up Problem Sets 1 and 2 (20%): These will be a warm-up to get used to implementing Python and submitting (10% each).
- Problem Sets (80%): 4 problem sets equally weighted (20% each).
Because they are the primary method of assessment, successful completion of each of problem sets 3-6 is required to pass the course. In other words, if you do not submit or receive an F on one of these problem sets, you may fail the course.
The details above are subject to change. If there are changes to the means of assessment after the add/drop date, you will have the option to decide whether the original or modified means of assessment will determine your final grade.
Topics (Tentative)
Language data and applications
- Data (Web documents, Reviews, Social Networks)
- Applications (Information retrieval/extraction, Sentiment analysis, Recommendation systems)
Searching through data
- Regular expressions and corpus query language
- Symbolic and distributional representations
- Tree regular expressions and semantic regular expressions
- Hashing
How to make sense of data
- Mutual information, collocation
- Keywords
- Vector space models
- Distributionally similar words
- Topic models
Language Modeling
- Spell checkers
- Detecting fake content
- Language generation
Language to decisions
- Feature-based models (logistic regression)
- Black-box models (neural)
- Sentiment analysis
- Robustness of models (build it and break it)
Information Retrieval
- Indexing
- Pagerank
- Reading comprehension systems
Information Extraction
- Knowledge representation
- Knowledge bases
- Question answering
Social Networks (Twitter and Facebook data)
- Representations
- Search through structured data
- Applications
Ethics
- Biases in data
- Biases in machine learning models
- Applications and ethical questions
Data analysis: supervised
- Classification
- Regression
- Ensenble methods
Data analysis: unsupervised
- Probabilistic clustering
- Hierarchical clustering
- Dimensionality reduction
Speech and information
- Information theory
- Speech technology: overview
- Speech data science
Schedule
Lecture | Date | Topic | Instructor | Notes |
---|---|---|---|---|
1 | Jan 7 | Introduction | Reddy, Sonderegger | |
2 | Jan 9 | Regular Expressions | Reddy | Assignment 1 (Reddy) – Python basics – 10% Exercise, Advanced |
3 | Jan 14 | Keywords, Association metrics | Reddy | Add or drop deadline |
4 | Jan 16 | Vector space model | Reddy | |
5 | Jan 18 | Vector space model, LSA | Reddy | |
6 | Jan 21 | LSA / Word embeddings | Reddy | |
7 | Jan 23 | Compositionality / Sentence Representations | Reddy | Assignment 1 due Assignment 2 (Reddy) – Corpus Query Language – 10% |
8 | Jan 28 | Document Representation / Topic Models / Contextuality | Reddy | |
9 | Feb 30 | Contextuality | Reddy | |
10 | Feb 4 | Language Modeling | Reddy | |
11 | Feb 6 | Language Modeling | Reddy | Assignment 2 due Assignment 3 (Reddy) – Vector space model, topic models - 20% |
12 | Feb 11 | Dialogue Systems/Semantic Parsing | Reddy | |
13 | Feb 13 | Ethics and bias | Reddy | |
14 | Feb 18 | Neural Networks | Reddy | |
15 | Feb 20 | Neural Networks | Reddy | Assignment 3 due Assignment 4 (Reddy) – language modeling, semantic parsing, information retrieval, and bias – 20% |
16 | Feb 25 | Classification and Regression Models | Sonderegger | |
17 | Feb 27 | Classification and Regression Models | Sonderegger | |
18 | Mar 4 | Reading week | Reading week | |
19 | Mar 6 | Reading week | Reading week | |
20 | Mar 11 | Classification and Regression Models | Sonderegger | |
21 | Mar 13 | Classification and Regression Models | Sonderegger | Assignment 4 due Assignment 5 (Sonderegger) – Data analysis and regression – 20% |
22 | Mar 18 | Classification and Regression Models | Sonderegger | |
23 | Mar 20 | Unsupervised learning | Sonderegger | |
24 | Mar 25 | Unsupervised learning | Sonderegger | |
25 | Mar 27 | Unsupervised learning | Sonderegger | Assignment 5 due Assignment 6 (Sonderegger) – clustering, language phylogeny – 20% |
26 | Apr 1 | Information Theory | Sonderegger | |
27 | Apr 3 | Dimensionality Reduction | Sonderegger | |
28 | Apr 8 | Speech Technology | Sonderegger | |
29 | Apr 10 | Data Science for Speech | Sonderegger | Assignment 6 due |
Generative AI Policy
If you use any Generative AI tool for your submitted work (e.g., ChatGPT, Github Copilot, Claude), you must cite it and submit a detailed statement describing its use, as well as a log of the chat where you used it.
You may not use AI tools for:
- Copying AI-generated prose or non-trivial code chunks – this is plagiarism
- Writing whole assignments or code files
- Writing large chunks of an assignment or code
- Using AI without citing it in your assignment
Inappropriate use of Generative AI may result in penalties on grades or referral to disciplinary authorities. If you have any question about appropriate use of AI applications for course work, please contact the instructors in their office hours.
Language of Submission
In accord with McGill University’s Charter of Students’ Rights, students in this course have the right to submit in English or in French any written work that is to be graded.
Academic Integrity
McGill University values academic integrity. Therefore, all students must understand the meaning and consequences of cheating, plagiarism and other academic offences under the Code of Student Conduct and Disciplinary Procedures (see www.mcgill.ca/students/srr/honest/ for more information)
Inclusivity
As the instructors of this course we endeavor to provide an inclusive learning environment. However, if you experience barriers to learning in this course, do not hesitate to discuss them with one of us, or the Student Accessibility and Achievement office. Note that accommodation requests for a given problem set must be made before the problem set is due.
Extraordinary Circumstances
In the event of extraordinary circumstances beyond the control of McGill University, assessment tasks in a course are subject to change, provided students are sent adequate and timely communications regarding the change.