Recurrent Neural Networks

RNN Theory

We introduce recurrent neural networks (RNNs) which is a neural network architecture used for machine learning on sequential data.

What are RNNs?

A type of artificial neural network designed for processing sequences of data.
Unlike traditional neural networks, RNNs have connections that form directed cycles, allowing information to persist.

The above figure shows an RNN architecture. The block \(A\) can be viewed as one stage of the RNN. \(A\) accepts as input \(x_t\) and outputs a value \(\hat{y}_t\). The loop with the hidden state \(h_{t-1}\) illustrates how information passes from one step of the network to the next.

Why Use RNNs?

Sequential Data: Ideal for tasks where data points are dependent on previous ones.
Memory: Capable of retaining information from previous inputs, making them powerful for tasks involving context. This is achieved from hidden states and feedback loops.

RNN Applications

Natural Language Processing (NLP): Language translation, sentiment analysis, and text generation.
Speech Recognition: Converting spoken language into text.
Time Series Prediction: Forecasting stock prices, weather, and other temporal data.
Music Generation: Creating a sequence of musical notes.

Outline

Basic RNN Architecture
LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Units)
Examples

RNN Architecture

Forward Propagation

The forward pass in the RNN architecture is given by the following operations

\(h_t = g_h(W_{hh} h_{t-1} + W_{hx} x_t + b_h)\)
\(\hat{y}_t = g_y(W_{yh} h_t + b_y)\)

The vector \(x_t\) is the \(t\)-th element of the input sequence. The vector \(h_t\) is the hidden state at the \(t\)-th point of the sequence. The dimension of \(h_t\) is a hyperparameter of the RNN.

The vector \(\hat{y}_t\) is the \(t\)-th output of the sequence. The functions \(g_h\) and \(g_y\) are nonlinear activation functions.

The weight matrices \(W_{hh}\), \(W_{hx}\), \(W_{yh}\), and biases \(b_h\), \(b_y\) are trainable parameters that are reused in each time step. Note that we must define the vector \(h_0\). A common choice is \(h_0 = 0\).

RNN Cell

Model Types

Depending on the application there many be varying number of outputs \(\hat{y}_t\). The figure below illustrates how the architecture can change.

One-to-one, is a regular feed forward neural network.
One-to-many, e.g., image captioning (1 image and output a sequence of words).
Many-to-one, e.g., sentiment analysis from a sequence of words or stock price prediction.
Many-to-many, e.g., machine translation or video frame-by-frame action classification.

Our subsequent illustrations of RNNs all display many-to-many architectures.

Stacked RNNs

It is also possible to stack multiple RNNs on top of each other. This is illustrated in the figure below.

Each layer has its own set of weights and biases.

Loss Calculation for Sequential Data

For each prediction in the sequence \(\hat{y}_t\) and corresponding true value \(y_t\) we calculate the loss \(\mathcal{L}_t(y_t, \hat{y}_t)\).

The total loss is the sum of all the individual losses, \(\mathcal{L} = \sum_t \mathcal{L}_t\). This is illustrated in the figure above.

For categorical data we use cross-entropy loss \(\mathcal{L}_t(y_t, \hat{y}_t) = -y_t\log(\hat{y}_t)\).

For continuous data we would use a mean-square error loss.

Vanishing Gradients

The key idea of RNNs is that earlier items in the sequence influence the more recent outputs. This is highlighted in blue in the figure below.

However, for longer sequences the influence can be significantly reduced. This is illustrated in red in the figure below.

Due to the potential for long sequences of data, during training you are likely to encounter small gradients. As a result, RNNs are prone to vanishing gradients.

RNN Limitations

Vanishing Gradients:
- During training, gradients can become very small (vanish), making it difficult for the network to learn long-term dependencies.
Long-Term Dependencies:
- RNNs struggle to capture dependencies that are far apart in the sequence, leading to poor performance on tasks requiring long-term memory.
Difficulty in Parallelization:
- The sequential processing of data in RNNs makes it challenging to parallelize training, leading to slower training times.
Limited Context:
- Standard RNNs have a limited ability to remember context over long sequences, which can affect their performance on complex tasks.
Bottleneck Problem:
- The hidden state is all that carries forward history and it can be a bottleneck in how expressive it can be.

Addressing RNN Limitations

Some proposed solutions for mitigating the previous issues are

LSTM (Long Short-Term Memory)
GRU (Gated Recurrent Units)

These are more advanced variants of RNNs designed to address some of these limitations by improving the ability to capture long-term dependencies and addressing gradient issues.

LSTM

Below is an illustration of the LSTM architecture.

The new components are:

The input cell state \(c_{t-1}\) and output cell state \(c_{t}\).
Yellow blocks consisting of neural network layers with sigmoid \(\sigma\) or tanh activation functions.
Red circles indicating point-wise operations.

Cell State

The cell state \(c_{t-1}\) is the input to the LSTM block.

This value then moves through the LSTM block.

It is modified by either a multiplication or addition interaction.

After these operations, the modified cell state is \(c_{t}\) is sent to the next LSTM block. In addition, \(c_t\) is added to the hidden state after tanh activation.

Forget Gate Layer

The forget layer computes a value

\[ f_t = \sigma(W_f[h_{t-1}, x_t] + b_f), \]

where \(\sigma(x) = \frac{1}{1 + e^{-x}}\) is the sigmoid function.

The output \(f_t\) is a vector of numbers between 0 and 1. These values multiply the corresponding vector values in \(c_{t-1}\).

A value of 0 says throw that component of \(c_{t-1}\) away. A value of 1 says keep that component of \(c_{t-1}\).

This operation tells us which old information in the cell state we should keep or remove.

Input Gate Layer

The input gate layer computes values

\[ \begin{align} i_t &= \sigma(W_i[h_{t-1}, x_t] + b_i), \\ \tilde{c}_{t} &= \operatorname{tanh}(W_c[h_t-1, x_t] + b_c). \end{align} \]

The value that is added to the cell state \(c_{t-1}\) is \(i_t \cdot \tilde{c}_t\).

This value tells us what new information to add to the cell state.

At this stage of the process, the cell state now has the formula

\[ c_t = f_t \cdot c_{t-1} + i_t \cdot \tilde{c}_t. \]

Output Layer

The output gate layer computes values

\[ \begin{align} o_t &= \sigma(W_o[h_{t-1}, x_t] + b_o), \\ h_{t} &=o_t * \operatorname{tanh}(c_t). \end{align} \]

The vector \(o_t\) from the sigmoid layer tells us what parts of the cell state we use for output. The output is a filtered version of the hyperbolic tangent of cell state \(c_t\).

The output of the block is \(\hat{y}_t\). It is the same as the hidden state \(h_t\) that is sent to the \(t+1\) LSTM block.

Weight Summary

For the LSTM architecture, we have the following sets of weights

\(W_f\) and \(b_f\) (forget layer),
\(W_i, W_c\) and \(b_i, b_c\) (input gate layer),
\(W_o\) and \(b_0\) (output layer).

Advantages of LSTMs

Long-term Dependencies: LSTMs can capture long-term dependencies in sequential data, making them effective for tasks like language modeling and time series prediction.
Avoiding Vanishing Gradient: The architecture of LSTMs helps mitigate the vanishing gradient problem, which is common in traditional RNNs.
Flexibility: LSTMs can handle variable-length sequences and are versatile for different types of sequential data.
Memory: They have a memory cell that can maintain information over long periods, which is crucial for tasks requiring context retention.

Disadvantages of LSTMs

Complexity: LSTMs are more complex than simpler models like traditional RNNs, leading to longer training times and higher computational costs.
Overfitting: Due to their complexity, LSTMs are prone to overfitting, especially with small datasets.
Resource Intensive: They require more computational resources and memory, which can be a limitation for large-scale applications.
Hyperparameter Tuning: LSTMs have many hyperparameters that need careful tuning, which can be time-consuming and challenging.
Bottleneck Problem: The hidden and cell states are all that carries forward history and they can be a bottleneck in how expressive it can be.

GRU

A variant of the LSTM architecture is the Gated Recurrence Unit.

It combines the forget and input gates into a single update gate. It also combines the cell-state and hidden state. The operations that are now performed are given below:

\[ \begin{align} z_t &= \sigma(W_z[h_{t-1}, x_t] + b_z),\\ r_t &= \sigma(W_r[h_{t-1}, x_t] + b_r), \\ \tilde{h}_t &= \sigma(W_{\tilde{h}}[h_{t-1}, x_t] + b_{\tilde{h}}), \\ h_t &= (1- z_t)\cdot h_{t-1} + z_t\cdot \tilde{h}_t. \end{align} \]

Advantages of GRUs

Simpler Architecture: GRUs have a simpler architecture compared to LSTMs, with fewer gates, making them easier to implement and train.
Faster Training: Due to their simpler structure, GRUs often train faster and require less computational power than LSTMs.
Effective Performance: GRUs can perform comparably to LSTMs on many tasks, especially when dealing with shorter sequences.
Less Prone to Overfitting: With fewer parameters, GRUs are generally less prone to overfitting compared to LSTMs.

Disadvantages of GRUs

Less Expressive Power: The simpler architecture of GRUs might not capture complex patterns in data as effectively as LSTMs.
Limited Long-term Dependencies: GRUs might struggle with very long-term dependencies in sequential data compared to LSTMs.
Less Flexibility: GRUs offer less flexibility in terms of controlling the flow of information through the network.
Bottleneck Problem: Like vanilaa RNNs, the hidden state is all that carries forward history and can be a bottleneck in how expressive it can be.

RNN Examples

Appliance Energy Prediction

In this example we will use a dataset containing the energy usage in Watt hours (Wh) of appliances in a low energy building.

Another example of Time Series Analysis.

In addition to the appliance energy information, the dataset includes the house temperature and humidity conditions. We will build an LSTM model that predicts the energy usage of the appliances.

Load the Data

The code cells below load the dataset.

import os

file_path = 'energydata_complete.csv'
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00374/energydata_complete.csv"

if os.path.exists(file_path):
    data = pd.read_csv(file_path)
else:
    data = pd.read_csv(url)
    data.to_csv(file_path, index=False)

data.head()

	date	Appliances	lights	T1	RH_1	T2	RH_2	T3	RH_3	T4	...	T9	RH_9	T_out	Press_mm_hg	RH_out	Windspeed	Visibility	Tdewpoint	rv1	rv2
0	2016-01-11 17:00:00	60	30	19.89	47.596667	19.2	44.790000	19.79	44.730000	19.000000	...	17.033333	45.53	6.600000	733.5	92.0	7.000000	63.000000	5.3	13.275433	13.275433
1	2016-01-11 17:10:00	60	30	19.89	46.693333	19.2	44.722500	19.79	44.790000	19.000000	...	17.066667	45.56	6.483333	733.6	92.0	6.666667	59.166667	5.2	18.606195	18.606195
2	2016-01-11 17:20:00	50	30	19.89	46.300000	19.2	44.626667	19.79	44.933333	18.926667	...	17.000000	45.50	6.366667	733.7	92.0	6.333333	55.333333	5.1	28.642668	28.642668
3	2016-01-11 17:30:00	50	40	19.89	46.066667	19.2	44.590000	19.79	45.000000	18.890000	...	17.000000	45.40	6.250000	733.8	92.0	6.000000	51.500000	5.0	45.410389	45.410389
4	2016-01-11 17:40:00	60	40	19.89	46.333333	19.2	44.530000	19.79	45.000000	18.890000	...	17.000000	45.40	6.133333	733.9	92.0	5.666667	47.666667	4.9	10.084097	10.084097

5 rows × 29 columns

Resample Data

We’re interested in the Appliances column, which is the energy use of the appliances in Wh.

First, we’ll resample the data to hourly resolution and fill missing values using the forward fill method.

Code

data['date'] = pd.to_datetime(data['date'])
data.set_index('date', inplace=True)

data = data['Appliances'].resample('h').mean().ffill() # Resample and fill missing

data.head()

date
2016-01-11 17:00:00     55.000000
2016-01-11 18:00:00    176.666667
2016-01-11 19:00:00    173.333333
2016-01-11 20:00:00    125.000000
2016-01-11 21:00:00    103.333333
Freq: h, Name: Appliances, dtype: float64

Preparing the Data

We create train-test splits and scale the data accordingly.

In addition, we create our own dataset class. In this class we create lagged sequences of the energy usage data. The amount of lag, or look-back sequence length, is set to 24.

Code

train_size = int(len(data) * 0.8)
test_size = len(data) - train_size

train_data = data[:train_size]
test_data = data[train_size:]

# Normalize data
scaler = MinMaxScaler()
train_data_scaled = scaler.fit_transform(train_data.values.reshape(-1, 1))
test_data_scaled = scaler.transform(test_data.values.reshape(-1, 1))

# Prepare data for LSTM
class TimeSeriesDataset(Dataset):
    def __init__(self, data, seq_length):
        self.data = data
        self.seq_length = seq_length

    def __len__(self):
        return len(self.data) - self.seq_length

    def __getitem__(self, index):
        X = self.data[index:index + self.seq_length]
        y = self.data[index + self.seq_length]
        return torch.tensor(X, dtype=torch.float32), torch.tensor(y, dtype=torch.float32)

seq_length = 24
train_dataset = TimeSeriesDataset(train_data_scaled, seq_length)
test_dataset = TimeSeriesDataset(test_data_scaled, seq_length)

# Create DataLoader
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False)

LSTM Architecture

We define our LSTM architecture in the code cell below.

# Define the LSTM model
class LSTMModel(nn.Module):
    def __init__(self, input_size=1, hidden_size=16, output_size=1):
        super(LSTMModel, self).__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x, _ = self.lstm(x)
        x = self.fc(x[:, -1, :])  # Use the output of the last time step
        return x

model = LSTMModel()
print(model)

LSTMModel(
  (lstm): LSTM(1, 16, batch_first=True)
  (fc): Linear(in_features=16, out_features=1, bias=True)
)

Loss and Training

We use a mean-square error loss function and the Adams optimizer. We train the model for 20 epochs. We display the decrease in the loss during training. Based on the plot, did the optimizer converge?

Code

# Set criterion and optimizer
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

epochs = 20
train_losses = []
for epoch in range(epochs):
    model.train()
    train_loss = 0.0
    for X, y in train_loader:

      optimizer.zero_grad()
      outputs = model(X)
      loss = criterion(outputs, y)
      loss.backward()
      optimizer.step()

      train_loss += loss.item()
    train_losses.append(train_loss)

# Plot the training losses over epochs
plt.figure(figsize=(10, 5))
plt.plot(range(1, epochs + 1), train_losses, marker='o', linestyle='-', color='b')
plt.title('Training Loss over Epochs')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.xticks(range(1, epochs + 1))  # Set x-tick marks to be integers
plt.grid(True)
plt.show()

Evaluation

We’ll evaluate the model’s performance by plotting the predicted values from the test set on top of the actual test set values.

How did the model perform? Are there any obvious problems?

Code

# Evaluate the model
model.eval()
predictions = []
trues = []
with torch.no_grad():
    for X, y in test_loader:
        preds = model(X)
        predictions.extend(preds.numpy())
        trues.extend(y.numpy())

# Rescale predictions and true to original scale
predictions_rescaled = scaler.inverse_transform(predictions)
true_rescaled = scaler.inverse_transform(trues)

# Plot results
plt.figure(figsize=(10, 6))
plt.plot(true_rescaled, label='True Values')
plt.plot(predictions_rescaled, label='Predicted Values', alpha=0.7)
plt.legend()
plt.show()

Other Applications

Andrej Karpathy has an excellent blog post titled The Unreasonable Effectiveness of Recurrent Neural Networks. We will describe some of the applications below.

Shakespeare

Using the a 3-layer RNN with 512 hidden nodes on each layer, Karpathy trained a model on the complete works of Shakespeare. With this model he was able to generate texts similar to what is seen on the right.

Observe that there are some typos and the syntax is not perfect, but overall the style appears Shakespearean.

Wikipedia

Using a 100MB dataset of raw Wikipedia data, Karpathy trained RNNs which generated Wikipedia content. An example of the generated markdown is shown to the right.

The text is nonsense, but is structured like a Wikipedia post. Interestingly, the model hallucinated and fabricated a url that does not exist.

Baby Names

Using a list of 8000 baby names from this link, Karpathy trained an RNN to predict baby names.

To the right is a list of generated names not on the lists. More of the names can be found here.

Some interesting additional names the model generated were Baby, Char, Mars, Hi, and With.

Recap

We discussed the basic RNN architecture. We then discussed the LSTM and GRU modifications. These modifications allowed the RNNs to handle long-term dependencies.

We then considered an example application of energy consumption prediction.

We also discussed the unreasonable effectiveness of RNNs.

What’s Next?

As we saw, the RNN architecture has evolved to include LSTMs and GRUs. However, the RNN architecture also evolved to add attention. This was introduced by Bahdanau, Cho, and Bengio (2014).

Attention in language models is a mechanism that allows the model to focus on relevant parts of the input sequence by assigning different weights to different words, enabling it to capture long-range dependencies and context more effectively.

In addition, recall that RNNs are only able to process data sequentially. For large scale natural language processing applications, this is a major computational bottleneck. This motivated the development of more advanced neural network architectures that could process sequential data in parallel and utilize attention.

The transformer architecture was introduced by Vaswani et al. (2017) and combines both of these desirable features.

Transformers form the basis for modern large language models (LLMs) and we will discuss them in our NLP lectures.

References

Understanding LSTMs, Colah’s blog, 2015, https://colah.github.io/posts/2015-08-Understanding-LSTMs/
Speech and Language Processing. Daniel Jurafsky & James H. Martin. Draft of January 5, 2024. – Chapter 9, RNNs and LSTMs, https://web.stanford.edu/~jurafsky/slpdraft/9.pdf
The Unreasonable Effectiveness of LSTMs, Andrej Karpathy, 2015, https://karpathy.github.io/2015/05/21/rnn-effectiveness/

References

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2014. “Neural Machine Translation by Jointly Learning to Align and Translate.” arXiv Preprint arXiv:1409.0473. http://arxiv.org/abs/1409.0473.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), 11. Long Beach, CA, USA. https://arxiv.org/abs/1706.03762.