Make-Up Homework
In this homework, we will look at character-level language models. These models take a sequence of characters as input and predict the next character.
This assignment, as with all of the homework assignments, should be completed individually without sharing solutions, models, or ideas with other students. See details at the bottom of this page.
Starter Code and Data
Starter code for this assignment is available here. The starter code zip for this assignment includes the dataset you will need.
For this assignment, we will use the transcripts of several speeches by Barack Obama. (We chose Obama because there are more publicly available speech transcripts than other presidents.) We’ll be using a simplified character set consisting of only 26 letters (we do not distinguish between capital and lower-case letters), space and period.
Inside models.py
you will find the LanguageModel
class. This is an abstract
class representing the language modeling functionality you will be developing.
Our language models work by predicting 28 probabilities – one for each
character in our reduced character set. Effectively, the language model is a
classifier over our character set. The LanguageModel
class has two methods:
predict_all
takes a string s
of length n
and predicts the next character
for each substring s[:i]
for i
from 0 to n
. Note that this includes
predicting next characters for the empty string ''
and the whole string s
,
so this method should return n+1
values. The predict_next
method takes a
string and predicts just the next character for the entire string. This method
is already implemented and it just calls predict_all
.
As a first step, the starter code includes a fully-implemented Bigram model as well as an adjacent language model which favors characters which are close in the alphabet to the current character. These models provide baselines you can compare your network to and test your work.
You will also find utils.py
which includes some useful functionality. The
SpeechDataset
class loads the speech data and the one_hot
function converts
a string into a one-hot encoding of characters with shape (28, len(s))
. You
can create a one-hot encoded dataset by calling
SpeechDataset('data/train.txt', transform=one_hot)
.
The parts of this homework are largely indpendent, so feel free to skip around and do things in whatever order you like.
Log-likelihood of Text (10 pts)
In this section, you will implement log_likelihood
in language.py
. This
function should take a model and a string as input and return the probability
of generating that string under that model. You can use Bigram
or
AdjacentLanguageModel
to test your function:
python -m homework.language -m Bigram
Hints:
- Remember that the language model can take an empty string as input.
- Remember that the model’s
predict_all
method will return the log-probabilities of the next character of each prefix of the string. - The log-likelihood is the sum of the log-likelihoods of each individual character, not the average or the product.
Generating Text (10 pts)
Now, we will implement sample_random
in language.py
. This function takes a
language model as input and generates a string by repeatedly sampling according
to the distribution generated by the model. (This is what we called “sequential
sampling” in class.) The model should stop generating when max_length
characters are produced or if the model produces a period.
Hint: take a look at torch.distributions
for some useful functionality
related to sampling from distributions. The Categorical
class may be
particularly useful.
Note that the text you generate may not look very good. Here is one output sampled from the temporal convolutional network in the master solution: “some of a backnown my but or the understand thats why weve hardships not around work since there one they will be begin with consider daughters some as more a new but jig go atkeeral westedly.”
Beam Search (20 pts)
Here we will implement a beam search in the beam_search
function in
language.py
. Once again, the search terminates if the sequence reaches
max_length
characters or if a period is generated. The beam search should
keep the top beam_size
results at each step, and return the top n_results
sequences at the end of the search. Sequences should be scored on one of two
criteria: their overall log-likelihood (as in log_likelihood
) if
average_log_likelihood=False
or the average per-character log likelihood if
average_log_likelihood=True
. In general, the per-character average will
result in longer sequences while the overall log-likelihood will terminate
after a relatively small number of words.
Hint: Inside language.py
you will find TopNHeap
. This class will be useful
to keep track of the best sequences during your search. Here are a coupl of
samples from the master solution with average_log_likelihood=False
:
thats.
today.
in.
now.
And here are a few with average_log_likelihood=True
:
and we will continue to make sure that we will continue to the united states of american.
and we will continue to make sure that we will continue to the united states of the united states.
and we will continue to make sure that we will continue to the united states of america.
and thats why were going to make sure that will continue to the united states of america.
Temporal Convolutional Model (20 pts)
In this section you will define a temporal convolutional network (TCN) to use
for language modeling. For this, you will need to first define the
CausalConv1dBlock
, which combines a causal convolution with a nonlinearity.
The main TCN
function should use a stack of causal blocks with dilation to
build a complete model. At the end of the model, you’ll want to add a
convolutional layer with a window size of 1 to produce the output. Then
TCN.predict_all
should use TCN.forward
to predict the log-probabilies for
an input sequence.
Hints:
- Make sure
TCN.forward
works with batches of data. - Make sure
TCN.predict_all
returns log-probabilities, not logits. - Store the distribution over the first character in the sentence as a
parameter of the model using
torch.nn.Parameter
. - Try to keep the model relatively small. The master solution takes about 15 minutes to train on a pretty good GPU. Yours make take a little longer, but it shouldn’t need hours of training.
- Try adding a residual connection to the block.
Relevant documentation:
TCN Training (40 pts)
Train your TCN using train.py
. You may want to reuse some training code from
previous assignments.
Hint: You may find SGD works better than Adam in this case, but you may also need a large learning rate (around 0.1).
Grading
The validation grader can be run with
python -m grader homework -v
Note that this grader is using a different set of test data than we use for our grader, so there may be some difference between the two. The distributions will be the same so your grade should be similar between the two graders, unless the model is massively overfit to the validation set.
Submission
Once you are ready to submit, create a bundle by running
python bundle.py homework <eid>
then submit the resulting zip file on Canvas. Note that the grader has a maximum file size limit of 20MB. You shouldn’t run into this limit unless your models are much larger than they need to be. You can check that your homework was bundled properly by grading it again with
python -m grader <eid>.zip
Online Grader
The online grading system uses a slightly modified version of Python, so please make sure your code follows these rules:
- Do not use
exit
orsys.exit
as this will likely crash the grader. - Do not try to access files except for the ones provided with the homework zip file. Writing files is disabled.
- Network access is disabled in the grader. Make sure you are reading your
dataset from the
data
folder rather than from a network connection somewhere. - Do not fork or spawn new processes.
print
andsys.stdout.write
are ignored by the grader.
Honor Code
You should do this homework individually. You are allowed to discuss high-level ideas and general structure with each other, but not specific details about code, architecture, or hyperparameters. You may consult online sources, but don’t copy code directly from any posts you find. Cite any ideas you take from online sources in your code (include the full URL where you found the idea). You may refer to your own solutions or the master solutions to prior homework assignments from this class as well as any iPython notebooks from this class. Do not put your solution in a public place (e.g., a public GitHub repo).
Acknowledgements
This assignment is very lightly modified from one created by Philipp Krähenbühl.