# Programming #13: Recurrent Neural Networks

Last time, we explored how a Convolutional Neural Network could be trained to recognize and classify patterns in an image. With a slight modification, a CNN could also be trained to generate new images. But what if we were given a series of frames in an animation and wanted our CNN to predict the next frame? We could feed it a bunch of two frame pairs and see if it could learn that after frame ‘a’ usually came frame ‘b’ but this wouldn’t work that great.

What we really need is a neural network that is able to learn from longer sequences of data. For example, if all the previous frames show a ball flying in an arc, the neural network might be able to lean how quickly the ball is moving in each subsequent time period and make a prediction on the next frame based off that. This is where Recurrent Neural Networks (RNN) come in.

Today, we’ll be conceptualizing and exploring RNN’s by building a deep neural network that functions as part of an end-to-end machine translation pipeline. Our completed pipeline will accept English text as input and return the French translation as output. You can follow along with the code here.

### Where Vanilla Networks Fall Short

Say we want to predict a person’s height based on age and gender. We could feed a vanilla network [age, gender] or [gender, age]. There is no difference because there is no structure to the order of the input data in a normal neural network. But what if our inputs did have a naturally ordered structure that we wanted to exploit? For example, the phrase “my favorite color is red” is different than “color favorite my red is”. This is the problem that RNNs solve.

### Adding Memory to our Networks

If we give a trained MLP or CNN an input ‘X’, it will always return the same output ‘Y’. For example, If the network learned that the word “nail” in English corresponds to the word “ongle” in French, it will always return “ongle”. But what if we gave the network three words to translate: “hammer and nail”? The network would look at “hammer”, give its prediction, then look at “and”, give its prediction, and finally “nail”, where it responds with what it always responds with “ongle”, but this time the correct translation would be “clou”.

The problem is that “ongle” is correct in the case of “finger nail” but not in the case of “hammer and nail”. Our network has no memory of what comes before nail so it always answers the same despite the context. This referred to as the network being stateless. To solve the problem we need to add a memory to our network and make it stateful.

In practice, RNNs accomplish this by adjusting their calculations based on the inputs that they’ve recently seen. So when our translator receives “nail” as an input, it will also receive the memory of “hammer” and make the correct prediction.

### Where the Magic Happens

An RNN is just like other neural networks, but instead of ignoring previous data, they use recurrant layers that store their previous state as one of the inputs to their next set of calculations. This change allows RNNs to learn patterns in sequences of data that otherwise could not be learned. This could be used to predict the next word in a sequence of words, the next stock price in a sequence of stock prices, and many other sequence based patterns.

### Natural Language Processing

One field where RNNs have made a huge impact is natural language processing.  Natural language just means languages that contain large vocabularies of words with several different meanings and complex interactions. In other words, messy language. For example, the programming language Python wouldn’t be considered natural language, but the English language would.

Researches have spent decades trying to get computers to understand natural language. A simple dictionary would never work since there is an infinite number of sentence combinations. Instead, researchers have tried breaking down sentences into parts of speech and incorporating grammar rules into their programs. This works for simple phrases like “Siri buy an apple” but doesn’t generalize well to complicated phrases.

Researchers quickly moved away from these manually crafted rules and moved toward machine learning techniques that could learn automatically from large datasets. The most successful of these techniques have been RNNs which are able to learn context in a phrase.

The natural language problem we will be working on today is machine translation. We will train an RNN on a small dataset to translate English sentences into French. The code can be found here. We can break this problem down into 4 steps:

1. Text Preprocessing
2. Feature Extraction
3. Modeling
4. Translation

### Text Preprocessing

Normally, the text we are given will be messy. For example, if we scrape a website for our vocabulary, we’ll end up with a bunch of HTML tags and markup that aren’t useful inputs for our objective. For this reason, text processing is usually our first step in Natural Language Processing. Common text processing steps include:

• Cleaning – removing unwanted symbols, tags, stopwords, etc so that we are left with plain text.
• Normalization – making everything lowercase, removing punctuation, etc.

To make things easier, the data we will be working with is already pretty clean. Everything has been converted to lowercase and punctuations have been deliminated using spaces.

### Feature Extraction

Now that we have a bunch of plain text, we need to convert it into a format that a Neural Network can use. Recall that a neural network is just a series of multiplication and addition operations. It’s not that easy to perform math on words, so we’ll need to transform our words into numbers. We accomplish this through two steps:

• Tokenization – converting words to numbers.
• Padding – making each array of numbers the same size.

When we tokenize a series of words we end up with a dictionary that matches each word to a number. For example, “The quick brown fox jumps over the lazy dog” turns into:

```# Tokenize dictionary
{'the': 1, 'quick': 2, 'a': 3, 'brown': 4, 'fox': 5,
'jumps': 6, 'over': 7, 'lazy': 8, 'dog': 9}

Input:  The quick brown fox jumps over the lazy dog .
Output: [1, 2, 4, 5, 6, 7, 1, 8, 9]
```

When we pad the series of words we just add zeros to the end of the output so everything is the same size.

### Modeling – Basic RNN

Now that our data is ready to be inputted into a neural network, we need to decide on our network architecture. Let’s start with a simple three-layer RNN. We’ll use Gated Recurrent Units as our recurrent layers:

```# Initate model
model = Sequential()

# Return sequences set to True to remember full sequence and not just last output
# Add modest recurrent dropout to prevent overfitting

# Add fully connected layer and softmax activation
# Add a Time distributed wrapper to dense layer

# Compile
learning_rate = .001
model.compile(loss=sparse_categorical_crossentropy,
metrics=['accuracy'])
```

After training our network we get a valuation accuracy of about 87%. Testing on the phrase “new jersey is sometimes quiet during autumn and it is snowy in april” we get:

```Prediction: new jersey est parfois calme en l' automne l' il est neigeux en avril
Acutal:     new jersey est parfois calme pendant l' automne et il est neigeux en avril
```

### Modeling – Complex RNN

Let’s try to improve on our model by adding:

1. Embedding – We’ve turned our words into single number ids, but there’s an even better representation of a word called embeddings. An embedding is a vector representation of the word that is close to similar words in n-dimensional space, where the n represents the size of the embedding vectors. In other words, we take our words and run them through a separate neural network that outputs how ever many features we want. Words that are similar in meaning will be closer together.
2. Bidirectional Information – One limitation of an RNN is that it can’t see the future sequence input, only the past. However, a Bidirectional RNN allows the network to read future input information from its current state. This allows us to find context information not only from the words preceding our target, but also from the words following it.
3. Encoder-Decoder – As the name suggests, this model is made up of an encoder and decoder. The encoder creates a matrix representation of the sentence. The decoder takes this matrix as input and predicts the translation as output. Think of this as a two-step neural network. One network comes up with an encoding; the other comes up with the decoding.

Putting this all together:

```# Initate model
model = Sequential()

# Add Gated Recurrent Layers, Bidirectional, RepeatVector as encoder

# Add fully connected layer and softmax activation
# Add a Time distributed wrapper to dense layer

# Compile
learning_rate = .001
model.compile(loss=sparse_categorical_crossentropy,
```Prediction: new jersey est parfois calme pendant l' automne et il est neigeux en avril