Language Models¶
Inspired by Jeff Dean's talk, Exciting Trends in Machine Learning.
In this lesson, we'll apply what we know about neural networks toward the study of language models: probabilistic models of a natural (human) language, such as English. The goal of today is to become more conversational in language modeling techniques by studying recent natural language processing methods (rather than applications or implications). By the end of this lesson, students will be able to:
- Explain how unigram and bigram language models determine the probability of the next word.
- Explain how embedding spaces can provide a semantic, distributed representation for concepts.
- Explain the benefits of self-attention mechanisms and hidden states in a RNN.
import os
import random
import re
from collections import Counter
from typing import Any
def clean(token: str, pattern: re.Pattern[str] = re.compile(r"\W+")) -> str:
"""
Returns all the characters in the token lowercased and without matches to the given pattern.
>>> clean("Hello!")
'hello'
"""
return pattern.sub("", token.lower())
def sample(frequencies: dict[Any, float], k: int = 1) -> list[Any]:
"""
Returns a list of k randomly sampled keys from the given frequencies with replacement.
>>> sample({"test": 1})
['test']
>>> sample({"test": 1}, k=3)
['test', 'test', 'test']
"""
return random.choices(list(frequencies), weights=frequencies.values(), k=k)
Statistical models¶
In its simplest form, a language model is similar to the Document
class that we defined in the Search assessment consisting of a term-frequency dictionary. A unigram language model guesses the next word based on the term-frequency dictionary alone. In the document, doggos/doc1.txt
, each unique word appears once, so each word has a term frequency of 1.
terms = {
"dogs": 10,
"are": 1,
"the": 1,
"greatest": 5,
"pets": 2,
}
sample(terms, 20)
['pets', 'greatest', 'dogs', 'greatest', 'greatest', 'dogs', 'pets', 'greatest', 'dogs', 'greatest', 'dogs', 'dogs', 'pets', 'pets', 'greatest', 'dogs', 'pets', 'dogs', 'dogs', 'dogs']
Unigram models are simple but not particularly useful. For one, there's no notion of context: each word is sampled entirely independently from every other word. Large language models like ChatGPT combine learn from internet-scale training datasets and consider the preceding words (tokens) when determining the probability of the next word.
A bigram language model, for instance, takes the immediately preceding word to determine the probabilities for the next word. We use <s>
to indicate the start token. Modify this code snippet so that each prev
term (key) has a dictionary consisting of the number of times the curr
term appears afterwards.
os.chdir("assessments")
terms = {}
for filename in os.listdir("small_wiki"):
if filename.endswith(".html"):
with open(os.path.join("small_wiki", filename)) as f:
words = ["<s>"] + [clean(word) for word in f.read().split() if clean(word)]
for prev, curr in zip(words, words[1:]):
if prev not in terms:
terms[prev] = {}
if curr not in terms[prev]:
terms[prev][curr] = 0
terms[prev][curr] += 1
terms["<s>"] # Should be {"metadata": 70} to indicate all 70 documents start with "metadata"
{'metadata': 70}
terms["the"]
To generate a sequence of a given length, repeatedly use the last
word to sample the next word and append it to the result.
n_words = 20
result = "what is the best".split()
for _ in range(n_words):
last = result[-1]
result += sample(terms[last])
result
['what', 'is', 'the', 'best', 'selling', 'machines', 'learn', 'and', 'medical', 'procedure', 'that', 'prior', 'to', 'a', 'hrefwikijon_snow_character', 'titlejon', 'heinjon', 'heina', 'appraising', 'music', 'awards1995', 'mtv', 'choosing', 'to']
Word embeddings¶
Unigram and bigram language models are more generally known as n-gram language models. With larger context windows (n), n-gram models can produce more understandable results. But the approach has a fundamental limitation: it's sensitive to the exact words and the sequence they appeared in the training set.
To address the first problem of learning word meaning, how might a computer even learn the meaning of a word? Strings in Python are sequences of characters where each character is just a number. Or, if you recall how we handled the city names "NY" and "SF" to predict the location of a home in model evaluation, strings could also be represented as "dummy variables" or boolean categories.
beds | bath | year_built | sqft | price_per_sqft | elevation | city_NY | city_SF | |
---|---|---|---|---|---|---|---|---|
0 | 2.0 | 1.0 | 1960 | 1000 | 999 | 10 | True | False |
1 | 2.0 | 2.0 | 2006 | 1418 | 1939 | 0 | True | False |
2 | 2.0 | 2.0 | 1900 | 2150 | 628 | 9 | True | False |
3 | 1.0 | 1.0 | 1903 | 500 | 1258 | 9 | True | False |
4 | 0.0 | 1.0 | 1930 | 500 | 878 | 10 | True | False |
... | ... | ... | ... | ... | ... | ... | ... | ... |
487 | 5.0 | 2.5 | 1890 | 3073 | 586 | 76 | False | True |
488 | 2.0 | 1.0 | 1923 | 1045 | 665 | 106 | False | True |
489 | 3.0 | 2.0 | 1922 | 1483 | 1113 | 106 | False | True |
490 | 1.0 | 1.0 | 1983 | 850 | 764 | 163 | False | True |
491 | 3.0 | 2.0 | 1956 | 1305 | 762 | 216 | False | True |
Word2vec refers to a technique that can learn a word embedding (semantic representation) by guessing how to fill in blank from the immediate context. Words that tend to appear in similar contexts probably have similar meanings, so an algorithm can learn the meaning of human words by examining a word's favorite neighbors. Words with similar neighbors should have similar representations. What word could appear in the following blank?
The city of _________ has an oceanic climate.
This enables us to find synonyms as the TensorFlow Embedding Projector shows. But the authors also pointed out some interesting examples of the learned relationships, such as how, in the embedding space, Paris - France + Italy = Rome
.
Recurrent neural networks¶
Although word embeddings produce a semantic representation for the meaning of words, there still remains a challenge of how we might combine these word embeddings to form meaningful sentences. How might we use machine learning to combine word embeddings in a way that is sensitive to context?
Earlier, we learned two ways that neural networks could be used to classify handwritten digit images. But both approaches (scikit-learn MLPClassifier
and Keras Conv2D
) involved learning weights and biases only from the input pixel values. In language modeling where the possible sequences of words are infinite, it becomes much harder to train neural network weights and biases to directly handle every possibility.
Recurrent neural networks (RNNs) represent a different way of organizing a neural network by calculating the output of a neuron taking into account not only the inputs but also a hidden state that represents previously-generated outputs. Unlike hidden layers in a neural network, hidden states provide additional information about previously-generated steps to the current step. In other words, a RNN learns to predict the next word based on the current input as well as information obtained from previous words.
For example, we can use recurrent neural networks for language modeling by considering how they might generate a response sequence. Given an input X
of one or more words, a recurrent neural network generates an output O
one word at a time while considering the hidden state V
.
Example code using keras: character-level text generation with LSTM
Sequence-to-sequence framework¶
The seq2seq framework utilizes 2 RNNs to implement "sequence-to-sequence" tasks:
- An encoder that learns a machine representation for an input sequence, or context.
- A decoder that can take a machine representation (context) and decode it to a target sequence.
Originally, these models were used for machine translation, so the problem was framed as reading an input sentence "ABC" and producing "WXYZ" as output. By changing the training set to give the encoder context and have the decoder predict the expected reply, the seq2seq framework can be used to model conversations. How does this framework differ from using a single recurrent neural network for language modeling?
Example code using keras: character-level sequence-to-sequence modeling. For word-level tasks, we can use word embeddings to provide more information to the model about the meaning of each word.
Transformer architecture¶
With more data, the seq2seq approach can produce good results, but at a high computational cost because the hidden state needs to be updated on each step. Recurrent neural networks end up spending significant amounts of resources computing hidden states sequentially.
The transformer architecture addresses this by throwing out the RNN and instead utilizing a mechanism called self-attention to learn context without relying on hidden states. The goal of self-attention is to identify associations or relationships between different words in a sequence. Consider the sentence:
The animal didn't cross the street because it was too tired
What does "it" refer to—the animal or the street? Let's read about why this makes a difference in translation on the Google Research blog on the seminal work "Attention is all you need" and study the Tensor2Tensor Intro notebook to see for ourselves how attention heads learn different relationships between words.
Although the transformer architecture was not the first to introduce the idea of an attention mechanism, it has since become the most popular way to not only use attention but to define language models in general.