Large Language Models¶

In this lesson, we'll learn about the newest advances in large language models since the advent of the transformer architecture. By the end of this lesson, students will be able to:

Get familiar with some techniques for improving large language models.
Discuss the potential impact of model parameters on model performance and other kinds of impact (e.g. environmental, financial).

How do LLMs work?¶

Below are two videos from Prof. Steve Seitz's YouTube Channel "Graphics in 5 Minutes":

In [4]:

%%html
<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/lnA9DMvHtfI?si=QJRk0fEHZNbuKf8b" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

In [3]:

%%html
<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/YDiSFS-yHwk?si=KY34lWBCaoIzNCEW" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

LLMs are neural networks¶

Similar to any ML models we learned, LLMs also need to be trained and then can be used for predicting outputs. GPT in 60 Lines of Numpy by Jay Mody shows how one can replicate a tiny verion of GPT-2 in just 60 lines of python code. Let's play with this demo for a bit.

Techniques for Improving LLMs¶

Now we look at techniques that help better train LLMs or improve the performance during inference.

Instruction fine-tuning¶

Fine-tuning in the world of neural networks is to take an already trained network, and use another dataset to run through some training iterations again to update that trained network's parameters so that the fine-tuned version can perform better on a task related to that second dataset. Similar ideas work well for transformers too.

Fine-tune Gemma models in Keras using LoRA (colab notebook) or the doc version

Reinforcement learning from human feedback¶

Reinforcement learning refers to a kind of machine learning where a policy is learned so that an agent will follow this policy to take an action to respond to different states in order to achieve the maximum accumulated reward. The reinforcement aspect basically refers to the fact that the agent behavior is "guided" by the reward, and rewards in the desired behavior reinforces the agent's learned policy so that it will tend to choose actions that lead to greater rewards. For LLM agents, human-in-the-loop feedback is also very helpful.

RLHF Wikipedia page

Few-shot learning¶

Unlike the previous two techniques that improve LLM training, few-shot learning does not change model parameters but simply employ interesting prompting strategies by providing examples in the prompt so that the LLM can learn through its context window.

Language Models are Few-Shot Learners: "While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches."

Chain-of-thought prompting¶

Another interesting strategy is to include all the intermediate steps, which is a kind of few-show learning but brings it one step further.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models: "We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a 540B-parameter language model with just eight chain of thought exemplars achieves state of the art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier."

Tree of Thoughts: Deliberate Problem Solving with Large Language Models: "ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices."

Model Parameters¶

Training Compute-Optimal Large Language Models: "By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled."

Training language models to follow instructions with human feedback: "Making language models bigger does not inherently make them better at following a user's intent....In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters.... Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent."

Energy and Policy Considerations for Deep Learning in NLP

	Date of original paper	Energy consumption (kWh)	Carbon footprint (lbs of CO2e)	Cloud compute cost (USD)
GPT-2	Feb, 2019	-	-	$12,902-$43,008
Transformer (213M parameters)	Jun, 2017	201	192	$289-$981
BERT (110M parameters)	Oct, 2018	1,507	1,438	$3,751-$12,571
Transformer (65M parameters)	Jun, 2017	27	26	$41-$140
ELMo	Feb, 2018	275	262	$433-$1,472
Transformer (213M parameters) w/ neural architecture search	Jan, 2019	656,347	626,155	$942,973-$3,201,722

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜: "In this paper, we take a step back and ask: How big is too big? What are the possible risks associated with this technology and what paths are available for mitigating those risks? We provide recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values, and encouraging research directions beyond ever larger language models."

Looking ahead..¶

Now that we have access to LLMs, what's next?