A6 Project Option A: Language Model Building |
CSE 473: Introduction to Artificial Intelligence The University of Washington, Seattle, Spring 2023 |
OverviewThis project option continues the exploration of large language models (LLMs) that started with Assignment 1.In this option you will build and explore a language model that can predict the next word or character in a textual sequence. You'll use an N-gram structure for your conditional probability model. Other features such as embeddings are optional.
Suggested ResourcesThe starting point for this assignment is a Jupyter notebook prepared by Andrej Karpathy, a computer scientist who currently works at (and was a founding member of) OpenAI. The notebook accompanies a video by Karpathy on how to build your own Generatively Pretrained Transformer (GPT). The video can be accessed through this link: Video LinkThe video is about 2 hours long. It is not required viewing, but it may be very helpful and it is a good introduction to how ML is used in language generation. The page with the video contains a detailed table of contents with links that will bring you to specific sections of the video. While this assignment is not sufficient to turn you into ML experts, it should help you start to get familiar with the components and set-up of an ML model. As you work through this assignment, don't feel as though you need to understand every step of the process. Instead, try to get an overall sense of what the GPT is doing, and target your explanations to an interested, but not necessarily technical, audience. In this notebook, you will be asked to explain certain steps that are being done in the GPT. It might be helpful to include screenshots or other images and links (especially useful for providing your sources) in your responses. If you are new to Jupyter notebooks, an optional resource section has been included at the end of this notebook that is intended to show you how to add links and figures to your notebooks. Assignment TasksDownload this Jupyter notebook and run it in a Jupyter session. (If your browser displays this as a page of JSON text, go ahead and download it and copy it to your Jupyter folder; if your browser changes the extension of the file from ".ipynb" to ".ipynb.txt" then manually remove the ".txt" part; then open it from Jupyter.) Please answer the questions after each section within the notebook. For some questions, screenshots might be useful. If you are running the notebook locally, you can include images of the screenshots in your notebook. If you are running the notebook in Google Colab, you may copy and paste the output into your notebook. 1. IntroductionIn this assignment, you will use the starter code provided by Dr. Karpathy to learn more about how large language models such as Generatively Pretrained Transformers work.
Part 1 Questions:
2. Try Out a Large Language ModelRun the code in the Jupyter notebook created by Andrej Karpathy.Part 2 Questions:
3. Change the Block SizeSee this section of the video to get an understanding of the function of block size: data loader: batches of chunks of data (clicking this item on the table of contents will bring you to the relevant part of the video). Part 3 Questions
4. Change Training DataFind an author on Project Gutenberg who has multiple books available for download. Create differently-sized data sets to use to train the GPT, starting from one book and increasing the number of books used to create the dataset. Create at least 3 different datasets, with the last containing at least 5 books. You should create your datasets programmatically using files downloaded from Project Gutenberg. Feel free to recycle and modify code from A1 for this purpose.Part 4 Questions
5. Finishing UpIn the Extra Credit portion of A1, you may have proposed some ways in which the text generator you created could be improved. Some of the suggestions made discussed having the text generator being programmed to have features that brought the rules of grammar into consideration. When we interacted with ChatGPT in class, it seemed to have a firm grasp of proper grammar. The GPT built in Karpathy's notebook clearly doesn't. Both this notebook and your A1 experience may cause you to think that character-level large language models -- but it might be too early to dismiss character_level LLMs. This article discusses some of the pros and cons of such models: Character Level NLPPart 5 Questions
To sum up, Dr. Andrej Karpathy has created a series of YouTube videos that give a detailed introduction to the development of language models through deep learning. He demonstrates the use of Python in Jupyter notebooks to implement code for language modeling. He makes extensive use of the Python library PyTorch, which was originally developed at Facebook (now Meta) but has been open-sourced and is freely available. Special Requirements for Option ARather than a Python code file, submit your Jupyter notebook file as a representation of your code. Any separate data files should be zipped up with the code, unless they are larger than 20 MB, in which case your notebook file should include a link to the data file(s).In the ReportFor the section described as "Any option-specific report requirements mentioned in those options' details", write "See our answers to the spec's questions within our Jupyter notebook."Credit: This project specification was provided by Emilia Gan, (c) 2023. |