Option A in Assignment 6 in CSE 473 (Spring 2023)

A6 Project Option A: Language Model Building

CSE 473: Introduction to Artificial Intelligence
The University of Washington, Seattle, Spring 2023

Overview

This project option continues the exploration of large language models (LLMs) that started with Assignment 1.

In this option you will build and explore a language model that can predict the next word or character in a textual sequence. You'll use an N-gram structure for your conditional probability model. Other features such as embeddings are optional.

Suggested Resources

The starting point for this assignment is a Jupyter notebook prepared by Andrej Karpathy, a computer scientist who currently works at (and was a founding member of) OpenAI. The notebook accompanies a video by Karpathy on how to build your own Generatively Pretrained Transformer (GPT). The video can be accessed through this link: Video Link

The video is about 2 hours long. It is not required viewing, but it may be very helpful and it is a good introduction to how ML is used in language generation. The page with the video contains a detailed table of contents with links that will bring you to specific sections of the video.

While this assignment is not sufficient to turn you into ML experts, it should help you start to get familiar with the components and set-up of an ML model. As you work through this assignment, don't feel as though you need to understand every step of the process. Instead, try to get an overall sense of what the GPT is doing, and target your explanations to an interested, but not necessarily technical, audience.

In this notebook, you will be asked to explain certain steps that are being done in the GPT. It might be helpful to include screenshots or other images and links (especially useful for providing your sources) in your responses. If you are new to Jupyter notebooks, an optional resource section has been included at the end of this notebook that is intended to show you how to add links and figures to your notebooks.

Assignment Tasks

Download this Jupyter notebook and run it in a Jupyter session. (If your browser displays this as a page of JSON text, go ahead and download it and copy it to your Jupyter folder; if your browser changes the extension of the file from ".ipynb" to ".ipynb.txt" then manually remove the ".txt" part; then open it from Jupyter.) Please answer the questions after each section within the notebook. For some questions, screenshots might be useful. If you are running the notebook locally, you can include images of the screenshots in your notebook. If you are running the notebook in Google Colab, you may copy and paste the output into your notebook.

1. Introduction

In this assignment, you will use the starter code provided by Dr. Karpathy to learn more about how large language models such as Generatively Pretrained Transformers work.

Read the NYT article linked here: NYT Article (or access a PDF of it here).
Visit this YouTube page: "Let's build GPT: from scratch, in code, spelled out". Watch as much of the video as you feel motivated to. Remember, you can pick and choose sections of the video to watch with the table of contents provided.
Get a copy of the Jupyter notebook that accompanies the video. You can get a version to use locally from Andrej Karpathy's GitHub repo or you can use the Google Colab notebook he provides. Links to both are provided under the YouTube video.

Part 1 Questions:

According to the NYT article, how many rounds of training were required before BabyGPT was able to 'babble'?
According to the article, what is 'loss' when mentioned in the context of a machine- learning model?
Using just the NYT article as a reference, explain how BabyGPT works "behind the scenes" in your own words.
Ask ChatGPT how a GPT works and include the response in a new Markdown cell. Compare your response to 3 to ChatGPT's response. Which explanation do you think would be easier for someone who is a complete novice with respect to machine learning to understand?

2. Try Out a Large Language Model

Run the code in the Jupyter notebook created by Andrej Karpathy.

Part 2 Questions:

What data set is being used with the GPT in the notebook?
List some similarities and differences between what this notebook is doing and what you did with Shakespeare's sonnets in A1. Based on your observations, before looking closely at any of the GPT's output, do you expect the new text generated by the GPT to be better, worse, or about the same (in terms of resemblance to real language) as the Shakespeare text generator from A1?
Change the number of rounds of training, similar to what was described in the NYT article, but stopping at a maximum of 5000 iterations. Choose at least 5 different training levels and describe your findings. Questions to address:
- If we're interested in how a large language model's capabilities develop, should the levels chosen be equally distributed or would it make sense to have more observations at the lower (or higher) higher levels of training?
- When do recognizable words begin appearing?
- How good do you consider the GPT to be after 5000 rounds of training. Explain your response.
Discuss the effect of increasing levels of training, including screenshots of the output (or copy/paste text into your notebook) to illustrate the points you are making. Note: Be Selective! Don't include the entire output.
Try 30,000 rounds (this took about an hour running locally on my machine -- Google Colab would likely be faster) -- does this make a noticeable difference to the quality of the output text generated by the GPT? Include a sample (screenshot or copy & paste) and discuss the evidence supporting your determination. If the text does seem noticeably better, speculate as to why that might be. Note: Be Selective! Don't include the entire output.

3. Change the Block Size

See this section of the video to get an understanding of the function of block size: data loader: batches of chunks of data (clicking this item on the table of contents will bring you to the relevant part of the video).

Part 3 Questions

What is block size?
What is the significance of block size in a LLM?
Change the block size in the Jupyter notebook to be 8. What do you think the effect of this change will be in terms of speed, loss, and output (and why)? Run the code and examine your output. Does it back up your prediction? Discuss.

4. Change Training Data

Find an author on Project Gutenberg who has multiple books available for download. Create differently-sized data sets to use to train the GPT, starting from one book and increasing the number of books used to create the dataset. Create at least 3 different datasets, with the last containing at least 5 books. You should create your datasets programmatically using files downloaded from Project Gutenberg. Feel free to recycle and modify code from A1 for this purpose.

Part 4 Questions

Which author did you choose?
Describe the datasets you created: number of books, overall size of the file created.
Using screenshots/ copy & paste of illustrative examples of the GPT output, describe the effect of larger training sets on GPT output. Note: Be Selective! Don't include the entire output.

5. Finishing Up

In the Extra Credit portion of A1, you may have proposed some ways in which the text generator you created could be improved. Some of the suggestions made discussed having the text generator being programmed to have features that brought the rules of grammar into consideration. When we interacted with ChatGPT in class, it seemed to have a firm grasp of proper grammar. The GPT built in Karpathy's notebook clearly doesn't. Both this notebook and your A1 experience may cause you to think that character-level large language models -- but it might be too early to dismiss character_level LLMs. This article discusses some of the pros and cons of such models: Character Level NLP

Part 5 Questions

List some of the advantages of character-level models, according to the article linked.
Discuss your experiences with this assignment: What were the main challenges you encountered? Howdid having done A1 help with your understanding of this model?

To sum up, Dr. Andrej Karpathy has created a series of YouTube videos that give a detailed introduction to the development of language models through deep learning. He demonstrates the use of Python in Jupyter notebooks to implement code for language modeling.

He makes extensive use of the Python library PyTorch, which was originally developed at Facebook (now Meta) but has been open-sourced and is freely available.

Special Requirements for Option A

Rather than a Python code file, submit your Jupyter notebook file as a representation of your code. Any separate data files should be zipped up with the code, unless they are larger than 20 MB, in which case your notebook file should include a link to the data file(s).

In the Report

For the section described as "Any option-specific report requirements mentioned in those options' details", write "See our answers to the spec's questions within our Jupyter notebook."