Natural Language Processing

Natural language processing (NLP) is a subfield of artificial intelligence that allows computers to understand, process, and manipulate human language.

History of NLP

  • The Dawn of NLP (1950-1970s)
  • The Statistical Revolution (1980s-2000s)
  • The Deep Learning Era (2000s-present)

The Dawn of NLP

NLP has its origin ins the 1950s. Two important figures in the development of computational NLP.

  • Alan Turing. British mathematician, logician, and computer scientist. Considered the father of modern computer science.
    • Computing Machinery and Intelligence. Turing proposed a criterion for determining a machine’s intelligence.
    • This criterion is know as the Turing Test.
    • This test involves the interpretation and generation of natural language by a computer.


Turing Test

  • An interrogator communicates with a computer and a human.
  • The computer and human attempt to convince the interrogator that they are human.
  • If the interrogator cannot determine who is human, the computer wins the test.

Do you think LLMs are able to pass the Turing Test?


  • Noam Chomsky. Is an American professor and public intellectual known for his work in linguistics, political activism and social criticism. He worked at MIT (retiring in 2002).
    • His book, Syntactic Structures, revolutionized the scientific study of language.
    • He proposed a mathematical theory of language that introduces a generative model which enumerates the sentences in a language.
    • This work helped lay the foundation for modern NLP techniques.

Rule-Based Systems

The NLP research of the 1950’s focused on rules-based systems. Linguists would craft large sets of rules to capture the grammatical syntax and vocabulary of specific languages.

Though important at the time, these systems were quite limited. They weren’t able to capture the nuances and many exceptions in languages. They were poor at capturing slang.

The creation and maintenance of such rule based systems for every language was not scalable.

These systems focused predominantly on syntax and vocabulary. They were not able to capture the deeper meaning and context of the texts they were analyzing.

See for example WordNet, (Wikpedia:WordNet).

The Statistical Revolution

Due to the limitations or rule-based systems coupled with a steady increase in computational power and large collections of data, this period of time saw the advent of machine learning algorithms applied to language processing.

In contrast to rule-based systems, these statistical models learned patterns from data. This allowed them to better handle the variations and complexities of natural language. IBM research developed a set of important machine translation models, called the IBM alignment models.

The concepts of Recurrent Neural Networks were introduced in 1986 in the paper, “Learning Representations by Back-Propagating Errors,” by (Rumelhart, Hinton, and Williams 1986).

In addition, the use of n-gram models became more formalized and widely adopted. An n-gram language model predicts the probability of a word based on the previous n-1 words in a sequence, making it a fundamental tool in natural language processing for tasks like text prediction and speech recognition.

The Deep Learning Era

The 2000s ushered in the era of deep learning. This is when we saw the application of neural networks to NLP.

Here are some of the notable developments.

  • Bengio, Ducharme, and Vincent (2000). Used feed-forward neural networks as a language model. Significantly outperformed n-gram models.
  • Sutskever (2014) Used LSTMs with an encoder-decoder architecture which produced (at the time) state-of-the-art results for machine translation.
  • Bahdanau, Cho, and Bengio (2014) Proposed the concept of attention in RNNs. Attention is a mechanism that allows models to focus on specific parts of the input sequence when generating each part of the output sequence, improving the handling of long-range dependencies and context.
  • Vaswani et al. (2017) Introduce the Transformer architecture. This allows for parallelization of sequential data with attention. Transformers become the basis for LLMs.

Lecture Outline

For the rest of this lecture we will cover:

  • Numerical representations of words
  • Language models
    • N-gram models
    • Transformers
  • Transformer Architectures

Numerical Representation of Words

Numerical Representations of Words

Machine learning models for NLP are not able to process text in the form of characters and strings. Characters and strings must be converted to numbers in order to train our language models.

There are a number of ways to do this. These include

  • sparse representations, like one-hot encodings and TF-IDF encodings
  • word embeddings.

However, prior to creating a numerical representation of text, we need to tokenize the text.

Tokenization

Tokenization is the process of splitting raw text into smaller pieces, called (drum-roll please), tokens. Tokens can be individual characters, words, or sentences.

Examples of character and word tokenization are shown for the following raw text

Show me the money

Character tokenization:

['S', 'h', 'o', 'w', 'm', 'e', 't', 'h', 'e', 'm', 'o', 'n', 'e', 'y'].

Word tokenization:

['Show', 'me', 'the', 'money']


This code block demonstrates both of these tokenization techniques.

# Character and word tokenization

sentence = "Show me the money"
word_tokens = sentence.split()
print(word_tokens)
character_tokens = [char for char in sentence if char != ' ']
print(character_tokens)
['Show', 'me', 'the', 'money']
['S', 'h', 'o', 'w', 'm', 'e', 't', 'h', 'e', 'm', 'o', 'n', 'e', 'y']

There are advantages and disadvantages to different tokenization methods. We showed two very simple strategies.

However, there are other strategies, such as subword and sentence tokenization, see for example Byte-Pair Encoding, and SentencePiece.

With tokenization, our goal is to not lose meaning with the tokens. With character based tokenization, especially for English (non-character based languages) we certainly lose meaning.

Here is a demo of how to tokenize using the transformers package from Huggingface.

from transformers import AutoTokenizer, logging

logging.set_verbosity_warning()

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokens = tokenizer.tokenize(sentence)
print(tokens)

# Try a more advanced sentence
sentence2 = "Let's try to see if we can get this transformer to tokenize."
tokens2 = tokenizer.tokenize(sentence2)
print(tokens2)
['Show', 'me', 'the', 'money']
['Let', "'", 's', 'try', 'to', 'see', 'if', 'we', 'can', 'get', 'this', 'transform', '##er', 'to', 'token', '##ize', '.']

Tokens and Token IDs

Associated to each token is a unique token ID. The total number of unique tokens that a model can recognize and process is the vocabulary size. The vocabulary is the collection of all the unique tokens.

The tokens (and token ids) alone hold no (semantic) information. What is needed is a numerical representation that encodes this information.

There are different ways to achieve this. One encoding technique that we already considered is one-hot encodings. Another more powerful encoding method, is the creation of word embeddings.

Sparse Representations

We have previously considered the following sparse representations of textual data.

One-Hot Encoding

  • Each word is represented as a vector of zeros and a single one.
  • Simple but inefficient for large vocabularies.

Example

Given the words cat, dog, and emu here are sample one-hot encodings

\[ \begin{align*} \text{cat} &= [1, 0, 0]^{T}, \\ \text{dog} &= [0, 1, 0]^{T}, \\ \text{emu} &= [0, 0, 1]^{T}. \\ \end{align*} \]


Bag of Words (BoW)

  • Represents text as a collection of word counts.
  • Ignores grammar and word order.

Example

Suppose we have the following sentences

  1. The cat sat on the mat.
  2. The dog sat on the log.
  3. The emu sat on the mat.
Sentence the cat sat on mat dog log emu
“The cat sat on the mat.” 2 1 1 1 1 0 0 0
“The dog sat on the log.” 2 0 1 1 0 1 1 0
“The emu sat on the mat.” 2 0 1 1 1 0 0 1

TF-IDF (Term Frequency-Inverse Document Frequency)

Example

The TF-IDF representations corresponding to the previous sentences.

cat dog log mat on emu sat the
Sentence 1 0.4698 0.0000 0.0000 0.4698 0.3546 0.0000 0.3546 0.7093
Sentence 2 0.0000 0.4698 0.4698 0.0000 0.3546 0.0000 0.3546 0.7093
Sentence 3 0.0000 0.0000 0.0000 0.4698 0.3546 0.4698 0.3546 0.7093

Word Embeddings

Word embeddings represent words as dense vectors in high-dimensional spaces.

The individual values of the vector may be difficult to interpret, but the overall pattern is that words with similar meanings are close to each other, in the sense that their vectors have small angles with each other.

The similarity of two word embeddings is the cosine of the angle between the two vectors. Recall that for two vectors \(v_1, v_2\in\mathbb{R}^{n}\), the formula for the cosine of the angle between them is

\[ \cos{(\theta)} = \frac{v_1 \cdot v_2}{\Vert v_1 \Vert_2 \Vert v_2 \Vert_2}. \]

Word embeddings can be static or contextual. A static embedding is when each word has a single embedding, e.g., Word2Vec. A contextual embedding (used by more complex language model embedding algorithms) allows the embedding for a word to change depending on its context in a sentence.

Language Models

Language Models

A language model is a statistical tool that predicts the probability of a sequence of words. It helps in understanding and generating human language by learning patterns and structures from large text corpora.

  1. N-gram Models:
    • Predict the next word based on the previous \(n-1\) words.
    • Simple and effective for many tasks but limited by fixed context size.
  2. Neural Language Models:
    • Use neural networks to capture more complex patterns.
    • Examples include RNNs, LSTMs, and Transformers.

We previously covered RNNs and LSTMs.

We’ll discuss N-grams briefly followed by a deep dive on Transformers.

N-gram Models

N-gram models

  • Definition: An n-gram model is a type of probabilistic language model used in natural language processing.
  • Purpose: It predicts the next item in a sequence based on the previous ( n-1 ) items.
  • Types:
    • Unigram (n=1): Considers each word independently.
    • Bigram (n=2): Considers pairs of consecutive words.
    • Trigram (n=3): Considers triples of consecutive words.

How N-gram Models Work

  • Example: Let’s consider a bigram model.
  • Training Data: “I love machine learning. Machine learning is fun.”
  • Bigrams:
    • “I love”
    • “love machine”
    • “machine learning”
    • “learning Machine”
    • “Machine learning”
    • “learning is”
    • “is fun”
  • Probability Calculation:
    • P(“learning” | “machine”) = Count(“machine learning”) / Count(“machine”)

Example of N-gram Model in Action

  • Sentence Completion:
    • Given the sequence “machine learning”, predict the next word.
    • Using the bigram model:
      • P(“is” | “learning”) = Count(“learning is”) / Count(“learning”)
      • P(“fun” | “learning”) = Count(“learning fun”) / Count(“learning”)
  • Prediction:
    • If “learning is” appears more frequently than “learning fun” in the training data, the model predicts “is” as the next word.

Transformers

Transformers

Transformers are a deep learning model for processing sequential (text) data Vaswani et al. (2017).

  • Rely on a mechanism called Attention.
  • Revolutionized the field of natural language processing (NLP) and artificial intelligence (AI).
  • The model is easy to scale across GPUs.
  • The building blocks for large language models (LLMs) such as:
    • ChatGPT (OpenAI),
    • LLama (Meta),
    • BERT (Google),
    • Megatron (NVIDIA).

To introduce the Transformer architecture we will consider machine translation. This is an application of sequence-to-sequence modeling.

Transformer Architecture



Encoder-Decoder Blocks

Encoder

  • Function: Processes input data (e.g., text) and converts it into a set of continuous representations.

Decoder

  • Function: Generates output sequences (e.g., translated text) from the encoded representations.

The encoder and decoder work together to transform input sequences into meaningful output sequences.

Encoder Input

The input to the encoder are word embeddings.

  • Words are embedded into a n-dimensional space, To establish the order of the words, we use positional encodings. .
  • The length \(n\) of the vector is the embedding dimension.
  • Let \(m\) denote the number of words in a sentence.

Positional Encodings

To establish the order of the words, we use positional encodings.

  • This is achieved by adding a vector \(\mathbf{t}^{(i)}\) to the word embedding \(\mathbf{x}^{(i)}\).
  • The positionally encoded word embeddings are the inputs to the transformer.

Generating Position Vectors

The authors of Vaswani et al. (2017) proposed the following function for positional encodings \[ \mathbf{t}_{j}^{(i)} = \begin{cases} \sin{\left(\frac{i}{10000^{2j/n}}\right)} & \text{for~$j$~even}, \\ \cos{\left(\frac{i}{10000^{2j/n}}\right)} & \text{for~$j$~odd}. \\ \end{cases} \]


Here is sample code to generate an illustration of the positional encodings for embedding dimension \(n=64\) and \(m=10\) tokens.

Code
import numpy as np
import matplotlib.pyplot as plt

def get_angles(pos, i, d_model):
  angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model))
  return pos * angle_rates

def positional_encoding(position, d_model):
  angle_rads = get_angles(np.arange(position)[:, np.newaxis],
                          np.arange(d_model)[np.newaxis, :],
                          d_model)
  
  # apply sin to even indices in the array; 2i
  angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
  
  # apply cos to odd indices in the array; 2i+1
  angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
    
  pos_encoding = angle_rads[np.newaxis, ...]
    
  return pos_encoding


tokens = 10
dimensions = 64

pos_encoding = positional_encoding(tokens, dimensions)

plt.figure(figsize=(7, 5))
plt.pcolormesh(pos_encoding[0], cmap='viridis')
plt.xlabel('Embedding Dimensions', fontsize=16)
plt.xlim((0, dimensions))
plt.ylim((tokens, 0))
plt.ylabel('Token Position', fontsize=16)
plt.colorbar()
plt.show()

From Inputs to the Encoder

The positionally encoded vectors are transposed and stacked to form an input matrix. This input matrix is fed into the bottom of the encoder block.

The number of rows \(m\) is the number of words in the sentence an the number of columns \(n\) is the embedding dimension. \[ X = \begin{bmatrix} & & {\mathbf{x}^{(1)}}^{T} & & \\ & & {\mathbf{x}^{(2)}}^{T} & &\\ & & \vdots & & \\ & & {\mathbf{x}^{(m)}}^{T} & & \\ \end{bmatrix} \in\mathbb{R}^{m\times n}. \]

Attention

Attention

Attention in language models is a mechanism that allows the model to focus on relevant parts of the input sequence by assigning different weights to different words, enabling it to capture long-range dependencies and context more effectively.

Attention is needed in order to understand sentences such as,

The elephant didn't cross the river because it was tired.

Attention allows a language model to understand correctly that it refers to the elephant and not the river.

Queries, Keys, and Values

The building blocks of the attention mechanism are query \(\mathbf{q}\), key \(\mathbf{k}\), and value \(\mathbf{v}\) vectors.

The query and key vectors \(\mathbf{k}, \mathbf{q}\in\mathbb{R}^{d}\) and the value vector \(\mathbf{v}\in\mathbb{R}^{n}\).

At a high level

  • query vectors determine which parts of the input to focus on,
  • key vectors represent the input features, and
  • value vectors contain the actual data to be attended to.

Each \(\mathbf{q}^{(i)}\), \(\mathbf{k}^{(i)}\), and \(\mathbf{v}^{(i)}\) is associated with \(\mathbf{x}^{(i)}\). To compute the query, key, and value vectors, compute \[ \begin{align*} \mathbf{q}^{(i)} &= {\mathbf{x}^{(i)}}^{T}W^{Q}, \\ \mathbf{k}^{(i)} &= {\mathbf{x}^{(i)}}^{T}W^{K}, \\ \mathbf{v}^{(i)} &= {\mathbf{x}^{(i)}}^{T}W^{V}, \\ \end{align*} \]

where \(W^{Q}, W^{K}\in\mathbb{R}^{n\times d}\) and \(W^{V}\in\mathbb{R}^{n\times n}\) are trainable matrices of weights.


This operation can be vectorized to compute \[ \begin{align*} Q &= XW^{Q}, \\ K &= XW^{K}, \\ V &= XW^{V}. \\ \end{align*} \]

Computing Attention

Given a query, key, and value vector, we compute attention in the following sequence of operations

  1. Compute \(\mathbf{q}^{(i)}\cdot \mathbf{k}^{(j)} = s_{i,j}\) for each \(j\).
  2. Compute \(s_{i,j}=s_{i,j}/\sqrt{d}\).
  3. Compute \(\tilde{s}_{i,:} = \operatorname{softmax}(s_{i,:})\).
  4. Compute \(\mathbf{\tilde{v}}^{(j)}=\tilde{s}_{i, j} \mathbf{v}^{(j)}\) for each \(j\).
  5. Compute \(\mathbf{z}^{(i)} = \sum_{j} \mathbf{\tilde{v}}^{(j)}\).

Let \(\mathbf{s}\in\mathbb{R}^{n}\), recall \(softmax(\mathbf{s})_i = \frac{e^{s_i}}{\sum e^{s_{j}}}\).

Vectorized Attention

Attention can be easily vectorized. The procedure is then

  1. Compute \(QK^T = S\).
  2. Compute \(S=\frac{1}{\sqrt{d}}S\).
  3. Compute the softmax across rows \(\tilde{S} = \operatorname{softmax}(S)\).
  4. Compute \(Z = \tilde{S}V\).

Attention Visualized

Recall the original sentence to be translated, “The elephant did not cross the river because it was tired.”

A hypothetical visualization would demonstrate the links of the original sentence to the word it.

The darker the color, the more this word attends to the word it.

Attention Summary

  1. Compute an attention score matrix \(S_{ij} = \mathbf{q}^{(i)}\cdot \mathbf{k}^{(j)}/\sqrt{d}\).
  2. The softmax function is applied to each row of the matrix \(S\).
  3. For a given row, the values in the columns of this matrix are the weights of the linear combination of the values vectors \(\mathbf{v}^{(i)}\).
  4. These weights tell us how much (or how little) each value vector contributes in the output \(\mathbf{z}^{(i)}\).
  5. When there is only one set of \(W^{Q}, W^{K},\) and \(W^{V}\) matrices this process is called self-attention.

Layer Normalization and Feed-Forward Neural Network

Layer Normalization

  • Training with features on different scales takes longer and is the potential cause of exploding gradients.
  • Layer normalization ensures all values along the embedding dimension have the same distribution.
  • Layer normalization is calculated by a modified Z-score equation.

Proposed in Ba, Kiros, and Hinton (2016).


  • Sum \(\mathbf{x}+\mathbf{z}= \mathbf{u}\).
  • Compute mean \(\mu\) and the variance \(\sigma^{2}\) of \(\mathbf{u}\).
  • Compute \[ \bar{\mathbf{u}} = \frac{\mathbf{u}-\mu}{\sqrt{\sigma^{2}+\varepsilon}} \odot \boldsymbol{\gamma} + \boldsymbol{\beta}. \]
  • \(\varepsilon\) is a small number.
  • \(\boldsymbol{\gamma}, \boldsymbol{\beta}\) are trainable parameter vectors of length \(n\).
  • The notation \(\odot\) indicates entry-wise multiplication.
  • Layer normalization can be vectorized to produce \(\bar{U}\).

Feed-Forward Neural Network (FFNN)

  • The vectorized output of the layer normalization layer is \(\bar{U}\).
  • The tensor \(\bar{U}\) is the input to the neural network.
  • There is 1 hidden layer.
  • \(FFNN(\bar{U}) = W_2\operatorname{max}(0, W_1\bar{U} + b_1) + b_2\).
  • The nonlinearity introduced by activation function of this layer allows the model to further differentiate the attention of each word.

Encoder Output

  • A 2nd layer normalization is applied to the output of the FFNN.
  • There is a residual connection to the output of the 1st layer normalization layer.
  • The output from the 2nd layer normalization is the output of the encoder.
  • At this stage, attention has been incorporated into the output.
  • The output is then sent to the decoder blocks.

Decoder Blocks

The architecture of a decoder block (in an encoder-decoder transformer) is nearly identical to than of an encoder block.

The decoder blocks consists of

  • self-attention layer
  • layer normalization
  • feed-forward neural network
  • layer normalization

The major differences in the decoder

  1. An encoder-decoder attention layer. This layer attends to the output of the encoder with cross attention (e.g. keys and values come from the encoder and queries come from the decoder).
  2. The self-attention layer only attends to earlier positions (not future) in the output sequence.

Decoder Output

The final output of the decoder blocks is a linear layer followed by a softmax layer.

The linear layer outputs a vector that is fed into a softmax layer.

The softmax layer outputs the probability of a potential word being generated.

The word with the maximum probability is chosen and output by the model.

Transformer Architectures

3 Types of Transformer Models

  1. Encoder-Decoder – used in sequence-to-sequence tasks, where one text string is converted to another (e.g., machine translation)

  2. Encoder – transforms text embeddings into representations that support variety of tasks (e.g., sentiment analysis, classification) Model Example: BERT

  3. Decoder – predicts the next token to continue the input text (e.g., ChatGPT, AI assistants) Model Example: GPT4, GPT4

Encoder Model Example: BERT (2019)

Bidirectional Encoder Representations from Transformers

  • Hyperparameters
    • 30,000 token vocabulary
    • 1024-dimensional word embeddings
    • 24x transformer layers
    • 16 heads in self-attention mechanism
    • 4096 hidden units in middle of MLP
  • ~340 million parameters
  • Pre-trained in a self-supervised manner
  • Can be adapted to task with one additional layer and fine-tuned

Proposed in Devlin et al. (2019).

Encoder Pre-Training

  • A small percentage of input embedding replaced with a generic token
  • Predict missing token from output embeddings
  • Added linear layer and softmax to generate probabilities over vocabulary
  • Trained on BooksCorpus (800M words) and English Wikipedia (2.5B words)

Encoder Fine-Tuning

  • Extra layer(s) appended to convert output vectors to desired output format.
  • 3rd Example: Text span prediction – predict start and end location of answer to a question in passage of Wikipedia, see this link.

Decoder Model Example: GPT3 (2020)

Generative Pre-trained Transformer

  • One purpose: generate the next token in a sequence.
  • This is an autoregressive model.
  • Factors the probability of a sentence of tokens \(t_1, t_2, \ldots t_N\) as \[ P(t_1, t_2, \ldots, t_N) = P(t_1)\prod_{n=2}^{N} P(t_n | t_1, t_2, \ldots t_{n-1}). \]

Proposed in Brown et al. (2020).

Decoder: Masked Self-Attention

  • During training we want to maximize the log probability of the input text under the autoregressive model.
  • We want to make sure the model doesn’t “cheat” during training by looking ahead at the next token.
  • Therefore, we mask the self attention weights corresponding to current and right context to negative infinity.

Decoder: Text Generation

  • Prompt with token string “ It takes great”
  • Generate next token for the sequence by
    • picking most likely token
    • sample from the probability distribution
    • beam search – select the most likely sentence rather than picking in a greedy fashion

Course Recap, Evaluations, and References

DS701 Recap

In this course we introduced you to the Tools for Data Science.

We covered:

  • Important Python packages like Pandas, Scikit-Learn, Torch, statsmodel, and others.
  • Mathematical foundations of data science, including linear algebra, probability and statistics, and optimization
  • Unsupervised learning
    • Clustering
    • Dimensionality reduction
  • Supervised learning
    • Classification
    • Regression
  • Neural Networks and NLP
    • CNNs, RNNs, Transformers
  • Graphs
  • Recommender systems

Course Evaluations

Please be sure to fill out the course evaluations. They can be found at the following links:

Section A1

Follow this link to submit a course evaluation for lecture section A1.

Follow this link to submit a course evaluation for discussion section A2.

Section C1

Follow this link to submit a course evaluation for Section C1.

Back to top

References

Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. “Layer Normalization.” https://arxiv.org/abs/1607.06450.
Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2014. “Neural Machine Translation by Jointly Learning to Align and Translate.” arXiv Preprint arXiv:1409.0473. http://arxiv.org/abs/1409.0473.
Bengio, Yoshua, Réjean Ducharme, and Pascal Vincent. 2000. “A Neural Probabilistic Language Model.” Advances in Neural Information Processing Systems 13.
Brown, Tom B, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language Models Are Few-Shot Learners.” arXiv Preprint arXiv:2005.14165. https://doi.org/10.48550/arXiv.2005.14165.
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv Preprint arXiv:1810.04805. https://doi.org/10.48550/arXiv.1810.04805.
Rumelhart, David E, Geoffrey E Hinton, and Ronald J Williams. 1986. “Learning Representations by Back-Propagating Errors.” Nature 323 (6068): 533–36.
Sutskever, I. 2014. “Sequence to Sequence Learning with Neural Networks.” arXiv Preprint arXiv:1409.3215.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), 11. Long Beach, CA, USA. https://arxiv.org/abs/1706.03762.