This course offers a comprehensive study of Natural Language Processing (NLP). We will delve into word representations, the construction of language models, the application of deep learning in NLP, and Large Language Models (LLMs). Specifically, we will discuss the architectural engineering, data engineering, prompt engineering, training techniques, and efficiency enhancements of LLMs. Students will gain insights into the application of NLP and LLMs in various domains, such as text mining, search engines, human-machine interaction, medical-legal consulting, low-resource languages, and AI for Science, as well as how to handle issues of data privacy, bias, and ethics. The course will also investigate the limitations of NLP and LLMs, such as challenges in alignment. The curriculum includes guest lectures on advanced topics and in-class presentations to stimulate practical understanding. This course is ideal for anyone seeking to master the use of NLP and LLMs in their field.
这门课程提供了对自然语言处理(NLP)的全面研究。我们将深入探讨词表示、语言模型的构建、深度学习在NLP中的应用和大型语言模型(LLMs)。特别地,我们将讨论大型语言模型的架构工程、数据工程、提示工程、训练技巧以及效率提升。学生将了解到自然语言处理和大型语言模型在各个领域的应用,如文本挖掘、搜索引擎、人机交互、医疗法律咨询、低资源语言和AI for Science,以及如何处理数据隐私、偏见和伦理问题。课程还将研究NLP和LLMs的局限性以及模型对齐问题。该课程包括高级主题的客座讲座和课堂演示。

Teaching team


Instructor
Benyou Wang

Benyou Wang is an assistant professor in the School of Data Science, The Chinese University of Hong Kong, Shenzhen. He has achieved several notable awards, including the Best Paper Nomination Award in SIGIR 2017, Best Explainable NLP Paper in NAACL 2019, Best Paper in NLPCC 2022, Marie Curie Fellowship, Huawei Spark Award. His primary focus is on large language models.

Logistics


Course Information


This comprehensive course on Natural Language Processing (NLP) offers a deep dive into the field, providing students with the knowledge and skills to understand, design, and implement NLP systems. Starting with an overview of NLP and foundational linguistic concepts, the course moves on to word representation and language modeling, essential for understanding text data. It explores how deep learning, from basic neural networks to advanced transformer models, has revolutionized NLP and its diverse applications, such as text mining, information extraction, and machine translation. The course emphasizes large language models (LLMs), their scaling laws, emergent abilities, training strategies, and associated knowledge representation and reasoning. Students will apply their learning in final projects, for example, exploring NLP beyond text with multi-modal LLMs, AI for Science, vertical applications and agents. There are guest lectures and in-class paper discussions that could learn the cut-edge research. The course also concludes with an examination of NLP's limitations and ethical considerations. In particular, the topics include:

  • Introduction to NLP
  • Linguistics and Word Embeddings
  • Language Models
  • Deep Learning in NLP
  • Large Language Models (LLMs)
  • Prompt Engineering
  • LLM Agents
  • Training Large Language Models
  • Final Project Introduction and Research Sharing
  • Multimodal Learning
  • LLM Reasoning and Guest Lecture
  • LLM Applications and Guest Lecture
  • Project Presentations (Part 1)
  • Project Presentations (Part 2)

Prerequisites

Learning Outcomes

Grading Policy

Assignments (40%)

  • Assignment 1 (10%): Training word vector.
  • Assignment 2 (15%): Using API for testing prompt engineering and LLM agents.
  • Assignment 3 (15%): Training NLP model with SFT and RLHF.
  • Both assignments need a report and code attachment if it has coding. See the relevant evaluation criterion as the final project.

Final project (55%)

The project could be done by a group but each indivisual is separately evaluated. You need to write a project report (max 6 pages) for the final project. Here is the report template. You are also expected to make a project poster presentation. After the final project deadline, feel free to make your project open source; we appreciate if you acknowledge this course

  • Project poster (10%): Your poster presentation will be rated by other groups and TAs. The average rating will be the final credit.
    • Poster quality (1%): We all like well-presented posters.
    • Oral presentation (4%): Presenters are encouraged to speak clearly and with enthusiasm.
    • Overall subjective assesment (5%): Although subjective assesment might be biased, it happens everywhere!
  • Project report (45%): The project report will be publicly available after the final poster session. Please let us know if you don't wish so.
    • Technical excitement (10%): It is encouraged to do something that is either interesting or useful! Part of them is about the problem to be solved and others are about the solution itself.
    • Technical soundness (15%): A) discuss the motivation on why you work this project and your algorithm or approach. Even you are reproducing a published paper, you should have your own motivation. B) Cite existing related work. C) Present your algorithms or systems for your project. Provide key information for reviewers to judge whether it is technically correct. D) Provide reasonable evaluation protocol, it should be detailed to contexualize your results; E)Report quantitative results and include qualitative evaluation. Analyze and understand your system by inspecting key outputs and intermediate results. Discuss how it works, when it succeeds and when it fails, and try to interpret why it works and why not.
    • Clarity in writing (15%): The report is written in a precise and concise manner so the report can be easily understood.
    • Individual contribution (5%): This is based on individual contribution, probably on a subjective basis.
  • Bonus and penalty Note that the project credit is capped at 40%
    • TA favorites (2%): If one of TAs nominates the project as his/her favorite, the involved students would get 1% bonus credit. Each TA could nominate one and he or she could reserve his/her nomination. This credict could only be obtained once.
    • Instructor favorites (1%): If the instructor nominates the project as his/her favorite, the involved students would get 1% bonus credit. Instructor could nominate at most three projects. One could get both TA favorites and Instructor favorites.
    • Project early-bird bonus (2%): If you submit the project report by the early submission due date, 2% bonus credit will be entitled.
    • Code reproducibility bonus (1%): One could obtain this If TAs think they could easily reproduce your results based on the provide material.
    • Ethics concerns (-1%): If there are any serious ethics concerns by the ethics committee (The instructor and all TAs), the project would get 1% penalty.

Participation (5%)

Here are some ways to earn the participation credit, which is capped at 5%.

  • Attending guest lectures: In the second half of the course, we have invited speakers. We encourage students to attend the guest lectures and participate in Q&A. All students get 1.5% per guest lecture (in total 3%) for either attending in person, or by writing a guest lecture report if they attend remotely or watch the recording.
  • Completing feedback surveys: We will send out two feedback surveys during the semester to evaluate the course and teaching.
  • User Study: Students are welcone to conduct user study upon their interest; this is not mandatory (thus it does not affect final marks).
  • Course and Teaching Evaluation (CTE): The school will send requests for CTE to all students. The CTE is worth 1% credit.
  • Volunteer credit (1%): TAs/instuctor can nominate students for a volunteer credit for those who help the poster session organization, or help answer questions from other students (not writing assignments).

Late Policy

The penalty is 20% off the assignment grade for each late day.

Schedule


Date Topics Recommended Reading Pre-Lecture Questions Lecture Note Coding Events Deadlines
Sept. 1-4 Warmup Tutorial 0: GitHub, LaTeX, Colab, and ChatGPT API OpenAI's blog
LaTeX and Overleaf
Colab
GitHub
Sept. 5th Lecture 1: Introduction to NLP Hugging Face NLP Course
Course to get into NLP with roadmaps and Colab notebooks.
LLM-Course On the Opportunities and Risks of Foundation Models
Sparks of Artificial General Intelligence: Early experiments with GPT-4
What is NLP? [Phoenix]
Sept. 12th Lecture 2: Basics of Linguistics and Word Representation Universal Stanford Dependencies: A cross-linguistic typology
Insights between NLP and Linguistics
End-to-end Neural Coreference Resolution
Efficient Estimation of Word Representations in Vector Space (original word2vec paper)
Evaluation methods for unsupervised word embeddings
What is structure of language (string of words)? How to model language and the inside words? [Linguistics repo] Assignment 1 out
Sept. 19th Lecture 3: Language Modeling A Neural Probabilistic Language Model
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A Neural Probabilistic Language Model
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
What is a language model and how does it work in natural language processing? [ BERT ]
Sept. 19th Tutorial 1: Introduction to Overleaf, GitHub, Python, and Pytorch
Sept. 26th Lecture 4: Deep Learning in NLP Attention Is All You Need
HuggingFace's course on Transformers
Scaling Laws for Neural Language Models
The Transformer Family Version 2.0
On Position Embeddings in BERT
How to better compose words semantically as language? [Transformer]
Sept. 26th Tutorial 2: Training word embeddings [Colab]
Oct. 10th Lecture 5: Large Language Models (LLMs) Training language models to follow instructions with human feedback
LLaMA: Open and Efficient Foundation Language Models
Llama 2: Open Foundation and Fine-Tuned Chat Models
OpenAI's blog Qwen2.5 Technical Report
DeepSeek-V3 Technical Report
what are LARGE language models and why LARGE? [Fine-tune Llama 2] Assignment 1 due (11:59pm) Assignment 2 out
Oct. 17th Lecture 6: Prompt Engineering Best practices for prompt engineering with OpenAI API
prompt engineering
How to better prompt LLMs? [Prompt_engineer]
Oct. 17th Tutorial 3: Prompt Engineering [Colab]
Oct. 24th Lecture 7: LLMs as agents ToolBench
AgentBench
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
LLM Powered Autonomous Agents
Mobile Agent
Cline
Roo-Cline
How to make LLMs more useful?
Oct. 24th Tutorial 4: Agent
Oct. 31th Lecture 8: Training Large Language Models FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
How to train LLMs from scratch? [HuatuoGPT]
[LLMZoo]
Assignment 3 out
Assignment 2 due (11:59pm)
Oct. 31th Tutorial 5: train your own LLMs Are you ready to train your own LLMs? [LLMZoo], [nanoGPT], [LLMFactory]
Nov. 7th Lecture 9: Final Projects and Research Sharing What are the current research topics in NLP? Final Project out
Nov. 14th Lecture 10: NLP and Beyond NLP Blog post: Generalized Visual Language Models
Can large models speak, see and perform actions ? [NExT-GPT]
Nov. 14th Tutorial 6: RLHF: Reinforcement Learning from Human Feedback How to further improve LLMs?
Nov. 21th Lecture 11: LLM Reasoning OpenAI's O1 DeepSeek-R1-Lite-Preview Qwen-32B-Preview HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs How to improve LLMs' reasoning? Assignment 3 due (11:59pm) Final Project Proposal due (11:59pm)
Nov. 21th Lecture 12: Guest Lecture
Nov. 28th Lecture 13: LLM Applications and Future Large Language Models Encode Clinical Knowledge Survey of Hallucination in Natural Language Generation
Superalignment
GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models
What are the future of LLMs?
Dec. 5th Final Project Presentation: Section 1 How to solve real-world problems using LLMs Final Project Poster due (17:00pm)
Dec. 12nd Final Project Presentation: Section 2 How to solve real-world problems using LLMs

Acknowledgement

We borrowed some concepts and the website template from [CSC3160/MDS6002] where Prof. Zhizheng Wu is the instructor.

Website github repo is [here] .