Programming #15: NLP Classification

Previously, we explored how a Recurrent Neural Network could be used to translate French text into English text and how a Convolutional Neural Network could be used to predict a dog’s breed based on a picture. Today, we’ll be playing around with combining the two in order to solve a difficult natural language processing problem:

Given a user comment from the internet, classify whether the comment is:

  • toxic
  • severe_toxic
  • obscene
  • threat
  • insult
  • identity_hate

As always, the full code for this project can be found on my GitHub.

This project will be completed as part of the Toxic Comment Classification Kaggle competition. From the competition page:

In this competition, you’re challenged to build a multi-headed model that’s capable of detecting different types of toxicity like threats, obscenity, insults, and identity-based hate better than Perspective’s current models. You’ll be using a dataset of comments from Wikipedia’s talk page edits. Improvements to the current model will hopefully help online discussion become more productive and respectful.

Step 1: Gathering the Data

The data we’ll be using consists of a large number of Wikipedia comments which have been labeled by humans according to their relative toxicity. The data can be found here.

Note: The data is only as good as the human-supplied labels.

Our data contains 10,734,904 words, 532,299 of which are unique, and the 10 most common being: “the”, “to”, “of”, “and”, “a”, “I”, “is”, “you”, “that”, and “in”. For comparison, the Oxford English dictionary contains 171,476 full entries. One problem here is that we are counting uppercase words as different from lower case words and a bunch of other symbols that aren’t really useful for our goal. We’ll clean this up in the next step.

Examples of our data:

Comment #1: Explanation Why the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.
Label #1: [0 0 0 0 0 0]

Comment #2: D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)
Label #2: [0 0 0 0 0 0]

Comment #3: Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.
Label #3: [0 0 0 0 0 0]

Step 2: Preprocessing the Data

Before continuing, we’ll have to preprocess our data a bit so that it’s in a format we can input into a neural network. Let’s:

  1. Remove irrelevant characters (!”#$%&()*+,-./:;?@[\\]^_`{|}~\t\n).
  2. Convert all letters to lowercase (HeLlO -> hello).
  3. Tokenize our words (hi how are you -> [23, 1, 5, 13]).
  4. Standaridize our input length with padding (hi how are you -> [23, 1, 5, 13, 0, 0, 0]).

We can go further and consider combining misspelled slang, or different word inflections into single base words. However, the benefit of using a neural network is that they do well with raw input, so we’ll stick with what we have listed.

Lucky for us, Keras makes all of this pretty simple to do.

# Create tokenizer
tokenizer = Tokenizer(num_words=None,
                      split=" ",

# Fit and run tokenizer
tokenized_train = tokenizer.texts_to_sequences(X_train)
tokenized_test = tokenizer.texts_to_sequences(X_test)
word_index = tokenizer.word_index

# Pad sequences
processed_X_train = pad_sequences(tokenized_train, maxlen=max_len, padding='post', truncating='post')
processed_X_test = pad_sequences(tokenized_test, maxlen=max_len, padding='post', truncating='post')

After preprocessing, our vocabulary size drops to a more manageable 210,337.

Step 3: Embedding the Data

The most obvious data representation for our vocabulary is one-hot encoding where every word is transformed into a vector with a 1 in its corresponding location. For example, if our word vector is [hi, how, are, you] and the word we are looking at is “you”, the input vector for “you” would just be [0, 0, 0, 1]. This works fine unless our vocabulary is huge – in this case, 210,000 – which means we would end up with word vectors that consist mainly of a bunch of 0s.

Instead, we can use a Word2Vec technique to find continuous embeddings for our words. Here, we’ll be using the pretrained FastText embeddings from Facebook to produce a 300-dimension vector for each word in our vocabulary.

The benefit of this continuous embedding is that words with similar predictive power will appear closer together on our word vector, making training easier. The downside is that this creates more of a black box where the words with the most predictive power get lost in the numbers.

embedding_dim = 300

# Get embeddings
embeddings_index = {}
f = open('X:\\utility_data\\wiki.en.vec', encoding="utf8")
for line in f:
    values = line.rstrip().rsplit(' ', embedding_dim)
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs

print('Found {} word vectors.'.format(len(embeddings_index)))

# Build embedding matrix
embedding_matrix = np.zeros((len(word_index) + 1, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

Step 4: Building the Model

For learning purposes, we’ll build a neural network architecture that is more complicated than it needs to be. We’ll build an:

  1. Embedding layer – word vector representations.
  2. Bidirectional GRU layer – extract temporal data such as words that came before and after current word.
  3. Convolutional layer – run multiple filters over that temporal data.
  4. Fully connected layer – classify input based on filters.

The idea is that our recurrent layer will find temporal data which it will pass to our convolutional layer, where filters will be learned to detect toxicity. Let’s code this up:

from keras.models import Sequential
from keras.layers import CuDNNGRU, Dense, Conv1D, MaxPooling1D
from keras.layers import Dropout, GlobalMaxPooling1D, BatchNormalization
from keras.layers import Bidirectional
from keras.layers.embeddings import Embedding
from keras.optimizers import Nadam

# Initate model
model = Sequential()

# Add Embedding layer
model.add(Embedding(vocab_size + 1, embedding_dim, weights=[embedding_matrix],
                    input_length=max_len, trainable=True))

# Add Recurrent layers
model.add(Bidirectional(CuDNNGRU(300, return_sequences=True)))

# Add Convolutional layer
model.add(Conv1D(filters=128, kernel_size=5, padding='same', activation='relu'))

# Add fully connected layers
model.add(Dense(50, activation='relu'))
model.add(Dense(6, activation='sigmoid'))

# Summarize the model

Step 5: Evaluation

To evaluate our model, we’ll be looking at its AUC ROC score (area under the receiver operating characteristic curve). This is a fancy way to say we will be looking at the probability that our model ranks a randomly chosen positive instance higher than a randomly chosen negative one. With data that mostly consists of negative labels (no toxicity), our model could just learn to always predict negative and end up with a pretty high accuracy. AUC ROC helps correct this by putting more weight on the the positive examples.

After training our model, we end up with an AUC ROC score of 98.26%. Not bad for a first run. To do even better we can:

  • Add more layers.
  • Experiment with different dropout and normalization techniques.
  • Experiment with different layers and parameters.
  • Experiment with cleaning the data more (translation, label adjustments, etc).

Step 6: Use Case

Finally, let’s build an app pipeline that can be put into production for toxic comment classification:

def toxicity_level(string):
    Return toxicity probability based on inputed string.
    # Process string
    new_string = [string]
    new_string = tokenizer.texts_to_sequences(new_string)
    new_string = pad_sequences(new_string, maxlen=max_len, padding='post', truncating='post')
    # Predict
    prediction = model.predict(new_string)
    # Print output
    print("Toxicity levels for '{}':".format(string))
    print('Toxic:         {:.0%}'.format(prediction[0][0]))
    print('Severe Toxic:  {:.0%}'.format(prediction[0][1]))
    print('Obscene:       {:.0%}'.format(prediction[0][2]))
    print('Threat:        {:.0%}'.format(prediction[0][3]))
    print('Insult:        {:.0%}'.format(prediction[0][4]))
    print('Identity Hate: {:.0%}'.format(prediction[0][5]))

Some examples:

Toxicity levels for 'go jump off a bridge jerk':
Toxic:         99%
Severe Toxic:  12%
Obscene:       91%
Threat:        2%
Insult:        93%
Identity Hate: 4%

Toxicity levels for 'i will kill you':
Toxic:         87%
Severe Toxic:  8%
Obscene:       41%
Threat:        63%
Insult:        59%
Identity Hate: 12%

Toxicity levels for 'have a nice day':
Toxic:         0%
Severe Toxic:  0%
Obscene:       0%
Threat:        0%
Insult:        0%
Identity Hate: 0%

Toxicity levels for 'hola, como estas':
Toxic:         0%
Severe Toxic:  0%
Obscene:       0%
Threat:        0%
Insult:        0%
Identity Hate: 0%

Toxicity levels for 'hola mierda joder':
Toxic:         16%
Severe Toxic:  0%
Obscene:       9%
Threat:        0%
Insult:        1%
Identity Hate: 0%



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s