From Natural Language to Data Science
- Course codes: COMP 345 & LING 345 (Winter 2024)
- Instructors: Siva Reddy[office hours: Tuesdays 3pm ENGMC 104N], Morgan Sonderegger
- Teaching Assistants: Vaibhav Adlakha, Gaurav Kamath, Aayush Kapur, Arkil Patel
- Classroom: McConnell Engineering Building 13
- Time: Tuesdays and Thursdays, 10:05 am – 11:25 am
- Links: MyCourses: announcements, slides
Description
Course Format: The course format is in person unless notified.
The last decades have seen phenomenal increases in our scientific and engineering understanding of language from a computational perspective. A large part of this success has been the rapid and unprecedented expansion of different kinds of language data as well as new computational tools for dealing with this data. This course provides an introduction to the data science of language. The emphasis will be on learning basic tools for working with language data for both engineering and scientific applications.
Goals
The goal of this course is to learn how to think about language data (predominantly raw text) and to work with it computationally. Along the way, we will learn a number of mathematical and computational tools for data collection, representation, processing, modeling, and analysis. The main emphasis of the course will be on transferring theory into practice. By the end of the course, you will know how to visualize large amounts of text, study the nature of words, build thesaurus, search for information, predict outcomes from social media text while also studying the ethical implications. And one of the side goals is to have fun with text!!
Students will
- Learn to visualize text.
- Learn how to analyze text data using simple statistical methods.
- Learn how to build simply models of language domains.
- Learn how to predict outcomes based on text.
- Learn the biases of current NLP technology and their implications.
- Learn how to process different kinds of language data using Python.
- Learn how to query linguistic databases.
Prerequisites
Required: Programming background at the level of COMP 202 or equivalent.
Recommended: Programming background at the level of COMP 250 or higher. Mathematics background at the level of MATH 240. Basic calculus and linear algebra will be helpful but not critical.
Grading
Note that the details below are subject to change. There is no exam.
- Warm-Up Problem Sets 1 and 2 (20%): These will be a warm-up to get used to implementing Python and submitting (10% each)
- Problem Sets (80%): 4 problem sets equally weighted (20% each).
Topics (Tentative)
Language data and applications
- Data (Web documents, Reviews, Social Networks)
- Applications (Information retrieval/extraction, Sentiment analysis, Recommendation systems)
Searching through data
- Regular expressions and corpus query language
- Symbolic and distributional representations
- Tree regular expressions and semantic regular expressions
- Hashing
How to make sense of data
- Mutual information, collocation
- Keywords
- Vector space models
- Distributionally similar words
- Topic models
Language Modeling
- Spell checkers
- Detecting fake content
- Language generation
Language to decisions
- Feature-based models (logistic regression)
- Black-box models (neural)
- Sentiment analysis
- Robustness of models (build it and break it)
Information Retrieval
- Indexing
- Pagerank
- Reading comprehension systems
Information Extraction
- Knowledge representation
- Knowledge bases
- Question answering
Social Networks (Twitter and Facebook data)
- Representations
- Search through structured data
- Applications
Ethics
- Biases in data
- Biases in machine learning models
- Applications and ethical questions
Schedule
Lecture | Date | Topic | Instructor | Notes |
---|---|---|---|---|
1 | Jan 4 | Introduction | Reddy, Sonderegger | |
2 | Jan 9 | Regular Expressions | Reddy | Assignment 1 (Reddy) – Python basics – 10% Exercise, Advanced |
3 | Jan 11 | Keywords, Association metrics | Reddy | |
4 | Jan 16 | Vector space model | Reddy | Add or drop deadline |
5 | Jan 18 | Vector space model, LSA | Reddy | |
6 | Jan 23 | LSA / Word embeddings | Reddy | Assignment 1 due Assignment 2 (Reddy) – Corpus Query Language – 10% |
7 | Jan 25 | Compositionality / Sentence Representations | Reddy | |
8 | Jan 30 | Document Representation / Topic Models / Contextuality | Reddy | |
9 | Feb 1 | Contextuality | Reddy | |
10 | Feb 6 | Classification and Regression Models | Sonderegger | Assignment 2 due Assignment 3 (Reddy) – Vector space model, topic models – 20% |
11 | Feb 8 | Classification and Regression Models | Sonderegger | |
12 | Feb 13 | Classification and Regression Models | Sonderegger | |
13 | Feb 15 | Classification and Regression Models | Sonderegger | |
14 | Feb 20 | Classification and Regression Models | Sonderegger | |
15 | Feb 22 | Classification and Regression Models | Sonderegger | |
16 | Feb 27 | Unsupervised learning | Sonderegger | Assignment 3 due Assignment 4 (Sonderegger) – Data analysis and regression – 20% |
17 | Feb 29 | Unsupervised learning | Sonderegger | |
18 | Mar 5 | Reading week | ||
19 | Mar 7 | Reading week | ||
20 | Mar 12 | Unsupervised learning | Sonderegger | |
21 | Mar 14 | Unsupervised learning | Sonderegger | Assignment 4 due Assignment 5 (Sonderegger) – clustering, language phylogeny – 20% |
22 | Mar 19 | Language Modeling | Reddy | |
23 | Mar 21 | Language Modeling | Reddy | |
24 | Mar 26 | Dialogue Systems/Semantic Parsing | Reddy | |
25 | Mar 28 | Ethics and bias | Reddy | |
26 | Apr 2 | TBD | Reddy (At Barbados Workshop) | Assignment 5 due Assignment 6 (Reddy) – language modeling, semantic parsing, information retrieval, and bias – 20% |
27 | Apr 4 | Data Science for Speech | Sonderegger | |
28 | Apr 9 | Neural Networks | Reddy | Assignment 6 due date is Apr 13 |
Language of Submission
In accord with McGill University’s Charter of Students’ Rights, students in this course have the right to submit in English or in French any written work that is to be graded.
Academic Integrity
McGill University values academic integrity. Therefore, all students must understand the meaning and consequences of cheating, plagiarism and other academic offences under the Code of Student Conduct and Disciplinary Procedures (see www.mcgill.ca/students/srr/honest/ for more information)
Inclusivity
As the instructor of this course I endeavor to provide an inclusive learning environment. However, if you experience barriers to learning in this course, do not hesitate to discuss them with me or the Office for Students with Disabilities.