Recurrent Neural Networks

RNN Theory

We introduce recurrent neural networks (RNNs) which is a neural network architecture used for machine learning on sequential data.

What are RNNs?

  • A type of artificial neural network designed for processing sequences of data.
  • Unlike traditional neural networks, RNNs have connections that form directed cycles, allowing information to persist.

The above figure shows an RNN architecture. The block \(A\) can be viewed as one stage of the RNN. \(A\) accepts as input \(x_t\) and outputs a value \(\hat{y}_t\). The loop with the hidden state \(h_{t-1}\) illustrates how information passes from one step of the network to the next.

Why Use RNNs?

  • Sequential Data: Ideal for tasks where data points are dependent on previous ones.
  • Memory: Capable of retaining information from previous inputs, making them powerful for tasks involving context. This is achieved from hidden states and feedback loops.

RNN Applications

  • Natural Language Processing (NLP): Language translation, sentiment analysis, and text generation.
  • Speech Recognition: Converting spoken language into text.
  • Time Series Prediction: Forecasting stock prices, weather, and other temporal data.
  • Music Generation: Creating a sequence of musical notes.

Outline

  1. Basic RNN Architecture
  2. LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Units)
  3. Examples

RNN Architecture

Forward Propagation

The forward pass in the RNN architecture is given by the following operations

  1. \(h_t = g_h(W_{hh} h_{t-1} + W_{hx} x_t + b_h)\)
  2. \(\hat{y}_t = g_y(W_{yh} h_t + b_y)\)

The vector \(x_t\) is the \(t\)-th element of the input sequence. The vector \(h_t\) is the hidden state at the \(t\)-th point of the sequence. The dimension of \(h_t\) is a hyperparameter of the RNN.

The vector \(\hat{y}_t\) is the \(t\)-th output of the sequence. The functions \(g_h\) and \(g_y\) are nonlinear activation functions.

The weight matrices \(W_{hh}\), \(W_{hx}\), \(W_{yh}\), and biases \(b_h\), \(b_y\) are trainable parameters that are reused in each time step. Note that we must define the vector \(h_0\). A common choice is \(h_0 = 0\).

RNN Cell

Model Types

Depending on the application there many be varying number of outputs \(\hat{y}_t\). The figure below illustrates how the architecture can change.

  1. One-to-one, is a regular feed forward neural network.
  2. One-to-many, e.g., image captioning (1 image and output a sequence of words).
  3. Many-to-one, e.g., sentiment analysis from a sequence of words or stock price prediction.
  4. Many-to-many, e.g., machine translation or video frame-by-frame action classification.

Our subsequent illustrations of RNNs all display many-to-many architectures.

Stacked RNNs

It is also possible to stack multiple RNNs on top of each other. This is illustrated in the figure below.

Each layer has its own set of weights and biases.

Loss Calculation for Sequential Data

For each prediction in the sequence \(\hat{y}_t\) and corresponding true value \(y_t\) we calculate the loss \(\mathcal{L}_t(y_t, \hat{y}_t)\).

The total loss is the sum of all the individual losses, \(\mathcal{L} = \sum_t \mathcal{L}_t\). This is illustrated in the figure above.

For categorical data we use cross-entropy loss \(\mathcal{L}_t(y_t, \hat{y}_t) = -y_t\log(\hat{y}_t)\).

For continuous data we would use a mean-square error loss.

Vanishing Gradients

The key idea of RNNs is that earlier items in the sequence influence the more recent outputs. This is highlighted in blue in the figure below.

However, for longer sequences the influence can be significantly reduced. This is illustrated in red in the figure below.

Due to the potential for long sequences of data, during training you are likely to encounter small gradients. As a result, RNNs are prone to vanishing gradients.

RNN Limitations

  • Vanishing Gradients:
    • During training, gradients can become very small (vanish), making it difficult for the network to learn long-term dependencies.
  • Long-Term Dependencies:
    • RNNs struggle to capture dependencies that are far apart in the sequence, leading to poor performance on tasks requiring long-term memory.
  • Difficulty in Parallelization:
    • The sequential processing of data in RNNs makes it challenging to parallelize training, leading to slower training times.
  • Limited Context:
    • Standard RNNs have a limited ability to remember context over long sequences, which can affect their performance on complex tasks.
  • Bottleneck Problem:
    • The hidden state is all that carries forward history and it can be a bottleneck in how expressive it can be.

Addressing RNN Limitations

Some proposed solutions for mitigating the previous issues are

  • LSTM (Long Short-Term Memory)
  • GRU (Gated Recurrent Units)

These are more advanced variants of RNNs designed to address some of these limitations by improving the ability to capture long-term dependencies and addressing gradient issues.

LSTM

Below is an illustration of the LSTM architecture.


The new components are:

  • The input cell state \(c_{t-1}\) and output cell state \(c_{t}\).
  • Yellow blocks consisting of neural network layers with sigmoid \(\sigma\) or tanh activation functions.
  • Red circles indicating point-wise operations.

Cell State

The cell state \(c_{t-1}\) is the input to the LSTM block.

This value then moves through the LSTM block.

It is modified by either a multiplication or addition interaction.

After these operations, the modified cell state is \(c_{t}\) is sent to the next LSTM block. In addition, \(c_t\) is added to the hidden state after tanh activation.

Forget Gate Layer

The forget layer computes a value

\[ f_t = \sigma(W_f[h_{t-1}, x_t] + b_f), \]

where \(\sigma(x) = \frac{1}{1 + e^{-x}}\) is the sigmoid function.

The output \(f_t\) is a vector of numbers between 0 and 1. These values multiply the corresponding vector values in \(c_{t-1}\).

A value of 0 says throw that component of \(c_{t-1}\) away. A value of 1 says keep that component of \(c_{t-1}\).

This operation tells us which old information in the cell state we should keep or remove.

Input Gate Layer

The input gate layer computes values

\[ \begin{align} i_t &= \sigma(W_i[h_{t-1}, x_t] + b_i), \\ \tilde{c}_{t} &= \operatorname{tanh}(W_c[h_t-1, x_t] + b_c). \end{align} \]

The value that is added to the cell state \(c_{t-1}\) is \(i_t \cdot \tilde{c}_t\).

This value tells us what new information to add to the cell state.

At this stage of the process, the cell state now has the formula

\[ c_t = f_t \cdot c_{t-1} + i_t \cdot \tilde{c}_t. \]

Output Layer

The output gate layer computes values

\[ \begin{align} o_t &= \sigma(W_o[h_{t-1}, x_t] + b_o), \\ h_{t} &=o_t * \operatorname{tanh}(c_t). \end{align} \]

The vector \(o_t\) from the sigmoid layer tells us what parts of the cell state we use for output. The output is a filtered version of the hyperbolic tangent of cell state \(c_t\).

The output of the block is \(\hat{y}_t\). It is the same as the hidden state \(h_t\) that is sent to the \(t+1\) LSTM block.

Weight Summary

For the LSTM architecture, we have the following sets of weights

  • \(W_f\) and \(b_f\) (forget layer),
  • \(W_i, W_c\) and \(b_i, b_c\) (input gate layer),
  • \(W_o\) and \(b_0\) (output layer).

Advantages of LSTMs

  • Long-term Dependencies: LSTMs can capture long-term dependencies in sequential data, making them effective for tasks like language modeling and time series prediction.
  • Avoiding Vanishing Gradient: The architecture of LSTMs helps mitigate the vanishing gradient problem, which is common in traditional RNNs.
  • Flexibility: LSTMs can handle variable-length sequences and are versatile for different types of sequential data.
  • Memory: They have a memory cell that can maintain information over long periods, which is crucial for tasks requiring context retention.

Disadvantages of LSTMs

  • Complexity: LSTMs are more complex than simpler models like traditional RNNs, leading to longer training times and higher computational costs.
  • Overfitting: Due to their complexity, LSTMs are prone to overfitting, especially with small datasets.
  • Resource Intensive: They require more computational resources and memory, which can be a limitation for large-scale applications.
  • Hyperparameter Tuning: LSTMs have many hyperparameters that need careful tuning, which can be time-consuming and challenging.
  • Bottleneck Problem: The hidden and cell states are all that carries forward history and they can be a bottleneck in how expressive it can be.

GRU

A variant of the LSTM architecture is the Gated Recurrence Unit.

It combines the forget and input gates into a single update gate. It also combines the cell-state and hidden state. The operations that are now performed are given below:

\[ \begin{align} z_t &= \sigma(W_z[h_{t-1}, x_t] + b_z),\\ r_t &= \sigma(W_r[h_{t-1}, x_t] + b_r), \\ \tilde{h}_t &= \sigma(W_{\tilde{h}}[h_{t-1}, x_t] + b_{\tilde{h}}), \\ h_t &= (1- z_t)\cdot h_{t-1} + z_t\cdot \tilde{h}_t. \end{align} \]

Advantages of GRUs

  • Simpler Architecture: GRUs have a simpler architecture compared to LSTMs, with fewer gates, making them easier to implement and train.
  • Faster Training: Due to their simpler structure, GRUs often train faster and require less computational power than LSTMs.
  • Effective Performance: GRUs can perform comparably to LSTMs on many tasks, especially when dealing with shorter sequences.
  • Less Prone to Overfitting: With fewer parameters, GRUs are generally less prone to overfitting compared to LSTMs.

Disadvantages of GRUs

  • Less Expressive Power: The simpler architecture of GRUs might not capture complex patterns in data as effectively as LSTMs.
  • Limited Long-term Dependencies: GRUs might struggle with very long-term dependencies in sequential data compared to LSTMs.
  • Less Flexibility: GRUs offer less flexibility in terms of controlling the flow of information through the network.
  • Bottleneck Problem: Like vanilaa RNNs, the hidden state is all that carries forward history and can be a bottleneck in how expressive it can be.

RNN Examples

Appliance Energy Prediction

In this example we will use a dataset containing the energy usage in Watt hours (Wh) of appliances in a low energy building.

Another example of Time Series Analysis.

In addition to the appliance energy information, the dataset includes the house temperature and humidity conditions. We will build an LSTM model that predicts the energy usage of the appliances.

Load the Data

The code cells below load the dataset.

import os

file_path = 'energydata_complete.csv'
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00374/energydata_complete.csv"

if os.path.exists(file_path):
    data = pd.read_csv(file_path)
else:
    data = pd.read_csv(url)
    data.to_csv(file_path, index=False)

data.head()
date Appliances lights T1 RH_1 T2 RH_2 T3 RH_3 T4 ... T9 RH_9 T_out Press_mm_hg RH_out Windspeed Visibility Tdewpoint rv1 rv2
0 2016-01-11 17:00:00 60 30 19.89 47.596667 19.2 44.790000 19.79 44.730000 19.000000 ... 17.033333 45.53 6.600000 733.5 92.0 7.000000 63.000000 5.3 13.275433 13.275433
1 2016-01-11 17:10:00 60 30 19.89 46.693333 19.2 44.722500 19.79 44.790000 19.000000 ... 17.066667 45.56 6.483333 733.6 92.0 6.666667 59.166667 5.2 18.606195 18.606195
2 2016-01-11 17:20:00 50 30 19.89 46.300000 19.2 44.626667 19.79 44.933333 18.926667 ... 17.000000 45.50 6.366667 733.7 92.0 6.333333 55.333333 5.1 28.642668 28.642668
3 2016-01-11 17:30:00 50 40 19.89 46.066667 19.2 44.590000 19.79 45.000000 18.890000 ... 17.000000 45.40 6.250000 733.8 92.0 6.000000 51.500000 5.0 45.410389 45.410389
4 2016-01-11 17:40:00 60 40 19.89 46.333333 19.2 44.530000 19.79 45.000000 18.890000 ... 17.000000 45.40 6.133333 733.9 92.0 5.666667 47.666667 4.9 10.084097 10.084097

5 rows × 29 columns

Resample Data

We’re interested in the Appliances column, which is the energy use of the appliances in Wh.

First, we’ll resample the data to hourly resolution and fill missing values using the forward fill method.

Code
data['date'] = pd.to_datetime(data['date'])
data.set_index('date', inplace=True)

data = data['Appliances'].resample('h').mean().ffill() # Resample and fill missing

data.head()
date
2016-01-11 17:00:00     55.000000
2016-01-11 18:00:00    176.666667
2016-01-11 19:00:00    173.333333
2016-01-11 20:00:00    125.000000
2016-01-11 21:00:00    103.333333
Freq: h, Name: Appliances, dtype: float64

Preparing the Data

We create train-test splits and scale the data accordingly.

In addition, we create our own dataset class. In this class we create lagged sequences of the energy usage data. The amount of lag, or look-back sequence length, is set to 24.

Code
train_size = int(len(data) * 0.8)
test_size = len(data) - train_size

train_data = data[:train_size]
test_data = data[train_size:]

# Normalize data
scaler = MinMaxScaler()
train_data_scaled = scaler.fit_transform(train_data.values.reshape(-1, 1))
test_data_scaled = scaler.transform(test_data.values.reshape(-1, 1))

# Prepare data for LSTM
class TimeSeriesDataset(Dataset):
    def __init__(self, data, seq_length):
        self.data = data
        self.seq_length = seq_length

    def __len__(self):
        return len(self.data) - self.seq_length

    def __getitem__(self, index):
        X = self.data[index:index + self.seq_length]
        y = self.data[index + self.seq_length]
        return torch.tensor(X, dtype=torch.float32), torch.tensor(y, dtype=torch.float32)

seq_length = 24
train_dataset = TimeSeriesDataset(train_data_scaled, seq_length)
test_dataset = TimeSeriesDataset(test_data_scaled, seq_length)

# Create DataLoader
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False)

LSTM Architecture

We define our LSTM architecture in the code cell below.

# Define the LSTM model
class LSTMModel(nn.Module):
    def __init__(self, input_size=1, hidden_size=16, output_size=1):
        super(LSTMModel, self).__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x, _ = self.lstm(x)
        x = self.fc(x[:, -1, :])  # Use the output of the last time step
        return x

model = LSTMModel()
print(model)
LSTMModel(
  (lstm): LSTM(1, 16, batch_first=True)
  (fc): Linear(in_features=16, out_features=1, bias=True)
)

Loss and Training

We use a mean-square error loss function and the Adams optimizer. We train the model for 20 epochs. We display the decrease in the loss during training. Based on the plot, did the optimizer converge?

Code
# Set criterion and optimizer
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

epochs = 20
train_losses = []
for epoch in range(epochs):
    model.train()
    train_loss = 0.0
    for X, y in train_loader:

      optimizer.zero_grad()
      outputs = model(X)
      loss = criterion(outputs, y)
      loss.backward()
      optimizer.step()

      train_loss += loss.item()
    train_losses.append(train_loss)

# Plot the training losses over epochs
plt.figure(figsize=(10, 5))
plt.plot(range(1, epochs + 1), train_losses, marker='o', linestyle='-', color='b')
plt.title('Training Loss over Epochs')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.xticks(range(1, epochs + 1))  # Set x-tick marks to be integers
plt.grid(True)
plt.show()

Evaluation

We’ll evaluate the model’s performance by plotting the predicted values from the test set on top of the actual test set values.

How did the model perform? Are there any obvious problems?

Code
# Evaluate the model
model.eval()
predictions = []
trues = []
with torch.no_grad():
    for X, y in test_loader:
        preds = model(X)
        predictions.extend(preds.numpy())
        trues.extend(y.numpy())

# Rescale predictions and true to original scale
predictions_rescaled = scaler.inverse_transform(predictions)
true_rescaled = scaler.inverse_transform(trues)

# Plot results
plt.figure(figsize=(10, 6))
plt.plot(true_rescaled, label='True Values')
plt.plot(predictions_rescaled, label='Predicted Values', alpha=0.7)
plt.legend()
plt.show()

Other Applications

Andrej Karpathy has an excellent blog post titled The Unreasonable Effectiveness of Recurrent Neural Networks. We will describe some of the applications below.

Shakespeare

Using the a 3-layer RNN with 512 hidden nodes on each layer, Karpathy trained a model on the complete works of Shakespeare. With this model he was able to generate texts similar to what is seen on the right.

Observe that there are some typos and the syntax is not perfect, but overall the style appears Shakespearean.

Wikipedia

Using a 100MB dataset of raw Wikipedia data, Karpathy trained RNNs which generated Wikipedia content. An example of the generated markdown is shown to the right.

The text is nonsense, but is structured like a Wikipedia post. Interestingly, the model hallucinated and fabricated a url that does not exist.

Baby Names

Using a list of 8000 baby names from this link, Karpathy trained an RNN to predict baby names.

To the right is a list of generated names not on the lists. More of the names can be found here.

Some interesting additional names the model generated were Baby, Char, Mars, Hi, and With.

Recap

We discussed the basic RNN architecture. We then discussed the LSTM and GRU modifications. These modifications allowed the RNNs to handle long-term dependencies.

We then considered an example application of energy consumption prediction.

We also discussed the unreasonable effectiveness of RNNs.

What’s Next?

As we saw, the RNN architecture has evolved to include LSTMs and GRUs. However, the RNN architecture also evolved to add attention. This was introduced by Bahdanau, Cho, and Bengio (2014).

Attention in language models is a mechanism that allows the model to focus on relevant parts of the input sequence by assigning different weights to different words, enabling it to capture long-range dependencies and context more effectively.

In addition, recall that RNNs are only able to process data sequentially. For large scale natural language processing applications, this is a major computational bottleneck. This motivated the development of more advanced neural network architectures that could process sequential data in parallel and utilize attention.

The transformer architecture was introduced by Vaswani et al. (2017) and combines both of these desirable features.

Transformers form the basis for modern large language models (LLMs) and we will discuss them in our NLP lectures.

References

  1. Understanding LSTMs, Colah’s blog, 2015, https://colah.github.io/posts/2015-08-Understanding-LSTMs/
  2. Speech and Language Processing. Daniel Jurafsky & James H. Martin. Draft of January 5, 2024. – Chapter 9, RNNs and LSTMs, https://web.stanford.edu/~jurafsky/slpdraft/9.pdf
  3. The Unreasonable Effectiveness of LSTMs, Andrej Karpathy, 2015, https://karpathy.github.io/2015/05/21/rnn-effectiveness/
Back to top

References

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2014. “Neural Machine Translation by Jointly Learning to Align and Translate.” arXiv Preprint arXiv:1409.0473. http://arxiv.org/abs/1409.0473.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), 11. Long Beach, CA, USA. https://arxiv.org/abs/1706.03762.