This course offers a comprehensive study of Natural Language Processing (NLP). We will delve into word representations, the construction of language models, the application of deep learning in NLP, and Large Language Models (LLMs). Specifically, we will discuss the architectural engineering, data engineering, prompt engineering, training techniques, and efficiency enhancements of LLMs. Students will gain insights into the application of NLP and LLMs in various domains, such as text mining, search engines, human-machine interaction, medical-legal consulting, low-resource languages, and AI for Science, as well as how to handle issues of data privacy, bias, and ethics. The course will also investigate the limitations of NLP and LLMs, such as challenges in alignment. The curriculum includes guest lectures on advanced topics and in-class presentations to stimulate practical understanding. This course is ideal for anyone seeking to master the use of NLP and LLMs in their field.
这门课程提供了对自然语言处理(NLP)的全面研究。我们将深入探讨词表示、语言模型的构建、深度学习在NLP中的应用和大型语言模型(LLMs)。特别地,我们将讨论大型语言模型的架构工程、数据工程、提示工程、训练技巧以及效率提升。学生将了解到自然语言处理和大型语言模型在各个领域的应用,如文本挖掘、搜索引擎、人机交互、医疗法律咨询、低资源语言和AI for Science,以及如何处理数据隐私、偏见和伦理问题。课程还将研究NLP和LLMs的局限性以及模型对齐问题。该课程包括高级主题的客座讲座和课堂演示。

Teaching team


Instructor
Benyou Wang

Benyou Wang is an assistant professor in the School of Data Science, The Chinese University of Hong Kong, Shenzhen. He has achieved several notable awards, including the Best Paper Nomination Award in SIGIR 2017, Best Explainable NLP Paper in NAACL 2019, Best Paper in NLPCC 2022, Marie Curie Fellowship, Huawei Spark Award. His primary focus is on large language models.

Logistics


Course Information


This comprehensive course on Natural Language Processing (NLP) offers a deep dive into the field, providing students with the knowledge and skills to understand, design, and implement NLP systems. Starting with an overview of NLP and foundational linguistic concepts, the course moves on to word representation and language modeling, essential for understanding text data. It explores how deep learning, from basic neural networks to advanced transformer models, has revolutionized NLP and its diverse applications, such as text mining, information extraction, and machine translation. The course emphasizes large language models (LLMs), their scaling laws, emergent abilities, training strategies, and associated knowledge representation and reasoning. Students will apply their learning in final projects, for example, exploring NLP beyond text with multi-modal LLMs, AI for Science, vertical applications and agents. There are guest lectures and in-class paper discussions that could learn the cut-edge research. The course also concludes with an examination of NLP's limitations and ethical considerations. In particular, the topics include:

  • Introduction to NLP
  • Linguistics and Word Embeddings
  • Language Models
  • Deep Learning in NLP
  • Large Language Models (LLMs)
  • Prompt Engineering
  • LLM Agents
  • Training Large Language Models
  • Final Project Introduction and Research Sharing
  • Multimodal Learning
  • LLM Reasoning and Guest Lecture
  • LLM Applications and Guest Lecture
  • Project Presentations (Part 1)
  • Project Presentations (Part 2)

Prerequisites

Learning Outcomes

Grading Policy

This grading policy is for students in CSC4100, for students in CSC6052/DDA6307/MDS6002, please refer to the section below.

Assignments (50%)

  • Assignment 1 (15%): Training word vector.
  • Assignment 2 (15%): Using API for testing prompt engineering and LLM agents.
  • Assignment 3 (20%): Training NLP model with SFT and RLHF.
  • All assignments need a report and code attachment if it has coding. See the relevant evaluation criterion as the final project.

Final project (40%)

The project could be done by a group but each individual is separately evaluated. You need to write a project report (max 6 pages) for the final project. Here is the report template. You are also expected to make a project poster presentation. After the final project deadline, feel free to make your project open source; we appreciate if you acknowledge this course

  • Project poster (10%): Your poster presentation will be rated by other groups and TAs. The average rating will be the final credit.
    • Poster quality (1%): We all like well-presented posters.
    • Oral presentation (4%): Presenters are encouraged to speak clearly and with enthusiasm.
    • Overall subjective assesment (5%): Although subjective assesment might be biased, it happens everywhere!
  • Project report (30%): The project report will be publicly available after the final poster session. Please let us know if you don't wish so.
    • Technical excitement (7%): It is encouraged to do something that is either interesting or useful!
    • Technical soundness (10%):
    • Clarity in writing (10%):
    • Individual contribution (3%):
  • Bonus and penalty Note that the project credit is capped at 40%
    • TA favorites (2%): If one of TAs nominates the project as his/her favorite, the involved students would get 1% bonus credit. Each TA could nominate one and he or she could reserve his/her nomination. This credict could only be obtained once.
    • Instructor favorites (1%): If the instructor nominates the project as his/her favorite, the involved students would get 1% bonus credit. Instructor could nominate at most three projects. One could get both TA favorites and Instructor favorites.
    • Project early-bird bonus (2%): If you submit the project report by the early submission due date, 2% bonus credit will be entitled.
    • Code reproducibility bonus (1%): One could obtain this If TAs think they could easily reproduce your results based on the provide material.
    • Ethics concerns (-1%): If there are any serious ethics concerns by the ethics committee (The instructor and all TAs), the project would get 1% penalty.

    Participation (10%)

    Here are some ways to earn the participation credit, which is capped at 10%.

    • Attending guest lectures: In the second half of the course, we have invited speakers. We encourage students to attend the guest lectures and participate in Q&A. All students get 2% per guest lecture (in total 4%) for either attending in person, or by writing a guest lecture report if they attend remotely or watch the recording.
    • Attending Tutorials: Students are supporsed to attend tutorials, each tutorial account for 1% credit, you are free to be absent for 1 tutorial out of 6, this participation is capped at 5%.
    • Completing feedback surveys: We will send out two feedback surveys during the semester to evaluate the course and teaching.
    • User Study: Students are welcone to conduct user study upon their interest; this is not mandatory (thus it does not affect final marks).
    • Course and Teaching Evaluation (CTE): The school will send requests for CTE to all students. The CTE is worth 1% credit.
    • Volunteer credit (1%): TAs/instuctor can nominate students for a volunteer credit for those who help the poster session organization, or help answer questions from other students (not writing assignments).

    This grading policy is for students in CSC6052/DDA6307/MDS6002, for students in CSC4100, please refer to the section above.

    Assignments (40%)

    • Assignment 1 (10%): Training word vector.
    • Assignment 2 (15%): Using API for testing prompt engineering and LLM agents.
    • Assignment 3 (15%): Training NLP model with SFT and RLHF.
    • Both assignments need a report and code attachment if it has coding. See the relevant evaluation criterion as the final project.

    Final project (55%)

    The project could be done by a group but each indivisual is separately evaluated. You need to write a project report (max 6 pages) for the final project. Here is the report template. You are also expected to make a project poster presentation. After the final project deadline, feel free to make your project open source; we appreciate if you acknowledge this course

    • Project poster (10%): Your poster presentation will be rated by other groups and TAs. The average rating will be the final credit.
      • Poster quality (1%): We all like well-presented posters.
      • Oral presentation (4%): Presenters are encouraged to speak clearly and with enthusiasm.
      • Overall subjective assesment (5%): Although subjective assesment might be biased, it happens everywhere!
    • Project report (45%): The project report will be publicly available after the final poster session. Please let us know if you don't wish so.
      • Technical excitement (10%): It is encouraged to do something that is either interesting or useful! Part of them is about the problem to be solved and others are about the solution itself.
      • Technical soundness (15%): A) discuss the motivation on why you work this project and your algorithm or approach. Even you are reproducing a published paper, you should have your own motivation. B) Cite existing related work. C) Present your algorithms or systems for your project. Provide key information for reviewers to judge whether it is technically correct. D) Provide reasonable evaluation protocol, it should be detailed to contexualize your results; E)Report quantitative results and include qualitative evaluation. Analyze and understand your system by inspecting key outputs and intermediate results. Discuss how it works, when it succeeds and when it fails, and try to interpret why it works and why not.
      • Clarity in writing (15%): The report is written in a precise and concise manner so the report can be easily understood.
      • Individual contribution (5%): This is based on individual contribution, probably on a subjective basis.
    • Bonus and penalty Note that the project credit is capped at 40%
      • TA favorites (2%): If one of TAs nominates the project as his/her favorite, the involved students would get 1% bonus credit. Each TA could nominate one and he or she could reserve his/her nomination. This credict could only be obtained once.
      • Instructor favorites (1%): If the instructor nominates the project as his/her favorite, the involved students would get 1% bonus credit. Instructor could nominate at most three projects. One could get both TA favorites and Instructor favorites.
      • Project early-bird bonus (2%): If you submit the project report by the early submission due date, 2% bonus credit will be entitled.
      • Code reproducibility bonus (1%): One could obtain this If TAs think they could easily reproduce your results based on the provide material.
      • Ethics concerns (-1%): If there are any serious ethics concerns by the ethics committee (The instructor and all TAs), the project would get 1% penalty.

    Participation (5%)

    Here are some ways to earn the participation credit, which is capped at 5%.

    • Attending guest lectures: In the second half of the course, we have invited speakers. We encourage students to attend the guest lectures and participate in Q&A. All students get 1.5% per guest lecture (in total 3%) for either attending in person, or by writing a guest lecture report if they attend remotely or watch the recording.
    • Completing feedback surveys: We will send out two feedback surveys during the semester to evaluate the course and teaching.
    • User Study: Students are welcone to conduct user study upon their interest; this is not mandatory (thus it does not affect final marks).
    • Course and Teaching Evaluation (CTE): The school will send requests for CTE to all students. The CTE is worth 1% credit.
    • Volunteer credit (1%): TAs/instuctor can nominate students for a volunteer credit for those who help the poster session organization, or help answer questions from other students (not writing assignments).

    Late Policy

    The penalty is 0.5% off the final course grade for each late day.

    Schedule


    Date Topics Recommended Reading Pre-Lecture Questions Lecture Note Coding Events Deadlines
    Jan 6-8 Warmup Tutorial 0: GitHub, LaTeX, Colab, and ChatGPT API OpenAI's blog
    LaTeX and Overleaf
    Colab
    GitHub
    Jan. 9th Lecture 1: Introduction to NLP Hugging Face NLP Course
    Course to get into NLP with roadmaps and Colab notebooks.
    LLM-Course
    What is NLP? [slide] [Phoenix]
    Jan. 10th Lecture 2: Introduction to NLP Cont. On the Opportunities and Risks of Foundation Models
    Sparks of Artificial General Intelligence: Early experiments with GPT-4
    What is NLP? [slide]
    Jan. 16th Lecture 3: Basics of Linguistics Universal Stanford Dependencies: A cross-linguistic typology
    Insights between NLP and Linguistics
    End-to-end Neural Coreference Resolution
    What is structure of language (string of words)? [slide] [Linguistics repo]
    Jan. 17th Lecture 4: Word Representation Efficient Estimation of Word Representations in Vector Space (original word2vec paper)
    Evaluation methods for unsupervised word embeddings
    How to model language and the inside words? [slide] [ word2vec ] Assignment 1 out
    Feb. 13th Lecture 5: Language Modeling A Neural Probabilistic Language Model
    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
    What is a language model and how does it work in natural language processing? [slide] [ BERT ]
    Feb. 14th Lecture 6: Language Modeling Cont. A Neural Probabilistic Language Model
    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
    What is a language model and how does it work in natural language processing? [slide]
    Feb. 14th Tutorial 1: Introduction to Overleaf, GitHub, Python, and Pytorch
    Feb. 20th Lecture 7: Deep Learning in NLP Attention Is All You Need
    HuggingFace's course on Transformers
    Scaling Laws for Neural Language Models
    How to better compose words semantically as language? [slide] [Transformer]
    Feb. 21th Lecture 8: Deep Learning in NLP Cont. The Transformer Family Version 2.0
    On Position Embeddings in BERT
    How to better compose words semantically as language? [slide]
    Feb. 21th Tutorial 2: Training word embeddings [Colab]
    Feb. 27th Lecture 9: Large Language Models (LLMs) Training language models to follow instructions with human feedback
    LLaMA: Open and Efficient Foundation Language Models
    Llama 2: Open Foundation and Fine-Tuned Chat Models
    OpenAI's blog
    what are LARGE language models and why LARGE? [slide] [Fine-tune Llama 2] Assignment 1 due (11:59pm) Assignment 2 out
    Feb. 28th Lecture 10: Large Language Models (LLMs) Cont. Qwen2.5 Technical Report
    DeepSeek-V3 Technical Report
    what are LARGE language models and why LARGE? [slide]
    Mar. 6th Lecture 11: Prompt Engineering Best practices for prompt engineering with OpenAI API
    prompt engineering
    How to better prompt LLMs? [slide] [Prompt_engineer]
    Mar. 7th Lecture 12: Prompt Engineering Cont. How to better prompt LLMs? [slide]
    Mar. 7th Tutorial 3: Prompt Engineering [Colab]
    Mar. 13th Lecture 13: LLMs as agents ToolBench
    AgentBench
    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
    LLM Powered Autonomous Agents
    How to make LLMs more useful? [slide]
    Mar. 14th Lecture 14: LLMs as agents Cont. Mobile Agent
    Cline
    Roo-Cline
    How to make LLMs more useful? [slide]
    Mar. 14th Tutorial 4: Agent
    Mar. 20th Lecture 15: Training Large Language Models FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
    Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
    How to train LLMs from scratch? [HuatuoGPT]
    [LLMZoo]
    Assignment 3 out
    Mar. 21th Lecture 16: Training Large Language Models Cont. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
    Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
    How to train LLMs from scratch? [HuatuoGPT]
    [LLMZoo]
    Mar. 21th Tutorial 5: train your own LLMs Are you ready to train your own LLMs? [LLMZoo], [nanoGPT], [LLMFactory]
    Mar. 27th Lecture 17: Final Projects and Research Sharing What are the current research topics in NLP? Final Project out Assignment 2 due (11:59pm)
    Mar. 28th Lecture 18: Research Sharing What are the current research topics in NLP?
    Apr. 3th Lecture 19: NLP and Beyond NLP Blog post: Generalized Visual Language Models
    Can large models speak, see and perform actions ? [NExT-GPT]
    Apr. 3th Tutorial 6: RLHF: Reinforcement Learning from Human Feedback How to further improve LLMs?
    Apr. 10th Lecture 20: NLP and Beyond NLP Cont. Can large models speak, see and perform actions ?
    Apr. 11th Lecture 21: LLM Reasoning OpenAI's O1 DeepSeek-R1-Lite-Preview Qwen-32B-Preview HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs How to improve LLMs' reasoning? Assignment 3 due (11:59pm) Final Project Proposal due (11:59pm)
    Apr. 17th Lecture 22: Guest Lecture
    Apr. 18th Lecture 23: LLM Applications and Future Large Language Models Encode Clinical Knowledge Survey of Hallucination in Natural Language Generation
    Superalignment
    GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models
    What are the future of LLMs?
    Apr. 24th Lecture 24: LLM Applications and Future Cont. What are the future of LLMs? Final Project Poster due (17:00pm)
    Apr. 25th Final Project Presentation: Section 1 How to solve real-world problems using LLMs
    Apr. 27th Final Project Presentation: Section 2 How to solve real-world problems using LLMs
    May. 8th Final Project Presentation: Section 3 How to solve real-world problems using LLMs
    May. 9th Final Project Presentation: Section 4 How to solve real-world problems using LLMs Final Project Report due (11:59pm)

    Acknowledgement

    We borrowed some concepts and the website template from [CSC3160/MDS6002] where Prof. Zhizheng Wu is the instructor.

    Website github repo is [here] .