Natural Language Processing(NLP) with TensorFlow - Live Blog post
Notebook demonstrates Natural Language Processing using TensorFlow
- Introduction to Natural Language Processing (NLP) Fundamentals in TensorFlow
some common applications of NLP:
- Classification of articles into labels
- Text Generation
- Machine Translation
- Voice Assistants.
All of there are also referred to as Sequence Problems
Different Types of Sequence Problems:
This Notebook covers:
- Downloading and preparing a text dataset
- How to prepare text data for modelling(tokenization and embedding)
- Setting up multiple modelling experiments with recurrent neural networks(RNNs)
- Building a text feature extraction model using TensorFlow Hub
- Finding the most wrong prediction examples
- Using a model we've built to make predictions on text from the wild.
Architecture of an RNN:
Hyperparamter/Layer type | What does it do? | Typical values |
---|---|---|
Input text(s) | Target texts/sequences you'd like to discover patterns in | Whatever you can represent as a text or a sequence |
Input layer | Takes in a target sequence | input_shape = [batch_size, embeddding_size] or [batch_size, sequence_shape] |
Text Vectorization layer | Maps input sequences to numbers | Multiple, can create with tf.keras.layers.experimental.preprocessing.TextVectorization
|
Embedding | Turns mapping of text vectors to embedding matrix(representation of how words realate) | Multiple, can create with tf.keras.layers.Embedding
|
RNN Cell(s) | Finds patterns in sequences | Simple RNN, LSTM, GRU |
Hidden activation | Adds non-linearity to learned features(non-straight lines) | Usually Tanh(hyperbolic tangent)(tf.keras.activations.tanh ) |
Pooling layer | Reduces the dimensionality of learned sequence features (usually Conv1D models) | Average(tf.keras.layers.GlobalAveragePooling1D or Max(tf.keras.layers.GlobalMaxPool1D ) |
Fully connected layer | Further refines learned features from recurrent layers | tf.keras.layers.Dense |
Output layer | Takes learned features and outputs them in shape of target labels |
output_shape = [number_of_classes] (e.g. 2 for Disaster/Not Disaster example) |
Output activation | Adds non-linearities to output layer |
tf.keras.activations.sigmoid (binary classification) or tf.keras.activations.softmax
|
Example TensorFlow code for RNN Model:
#1. Create LSTM model
from tensorflow.keras import layers
inputs = layers.Input(shape = (1,), dtype = "string")
x = text_vectorizer(inputs) # turn inputs sequence to numbers
x = embedding(x) # Create embedding matrix
x = layers.LSTM(64, activation = "tanh")(x)
outputs = layers.Dense(1, activation = "sigmoid")(x)
model = tf.keras.Model(inputs,outputs, name = "LSTM_model")
# 2. Compile the Model
model.compile(loss = tf.keras.losses.BinaryCrossentropy(),
optimzer = tf.keras.optimizers.Adam(),
metrics = ["accuracy"])
# 3. Fit the model
history = model.fit(train_sentences, train_labels, epochs = 5)
!nvidia-smi -L
!wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py
# Import a series of helper functions for the notebook
from helper_functions import unzip_data, create_tensorboard_callback,plot_loss_curves,compare_historys
!wget https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
# Unzip the dataset
unzip_data("nlp_getting_started.zip")
import pandas as pd
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
train_df.head()
train_df_shuffled = train_df.sample(frac = 1, random_state=42)
train_df_shuffled.head()
test_df.head()
train_df.target.value_counts()
len(train_df), len(test_df)
import random
random_index = random.randint(0, len(train_df)-5) # Create random indexes
for row in train_df_shuffled[["text", "target"]][random_index: random_index+5].itertuples():
_, text, target = row
print(f"Target:{target}", "(real disaster)" if target > 0 else "(not real disaster)")
print(f"Text:\n {text} \n")
print("---\n")
from sklearn.model_selection import train_test_split
train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled["text"].to_numpy(),
train_df_shuffled["target"].to_numpy(),
test_size = 0.1,# use 10% of training data for validation
random_state = 42)
len(train_sentences), len(val_sentences), len(train_labels), len(val_labels)
train_sentences[:10], train_labels[:10]
Converting text into numbers
When dealing with a text problem, one of the first things you'll have to do before you can build a model is to convert your text to numbers.
There are a few ways to do this, namely:
-
Tokenization: Straight mapping from token to number(can be modelled but quickly gets too big)
-
Embedding: richer representation of relationships between tokens (can limit size + can be learned)
Tokenization vs Embedding
E.g. I am a Human
I = 0
am = 1
a = 2
Human = 3
or using one-hot enconding
[[1,0,0,0],
[0,1,0,0]
[0,0,1,0]
[0,0,0,1]]
or by creating an Embedding
[[0.492, 0.005, 0.019],
[0.060, 0.233, 0.899],
[0.741, 0.983, 0.567]]
There are a few ways to do this, namely:
-
Tokenization: Straight mapping from token to number(can be modelled but quickly gets too big)
-
Embedding: richer representation of relationships between tokens (can limit size + can be learned)
train_sentences[:5]
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
# Use the default TextVectorization parameters(just to demonstrate the default values of this instance)
text_vectorizer = TextVectorization(max_tokens = None, # how many words in the vocabulary(automatically add <OOV?)
standardize = "lower_and_strip_punctuation",
split = "whitespace", # or SPLIT_WHITESPACE also works
ngrams = None, # Create groups of n-words
output_mode ="int", # How to map token to numbers
output_sequence_length = None) # how long do you want the sequences to be
#pad_to_max_tokens = True) not valid if using max_tokens=None
len(train_sentences[0].split())
round(sum([len(i.split()) for i in train_sentences])/len(train_sentences))
max_vocab_length = 10000 # max number of words to have in our vocabulary
max_length = 15 # max lenght our sequences will be (e.g. how many words from a tweet our model see)
text_vectorizer = TextVectorization(max_tokens = max_vocab_length,
output_mode ="int",
output_sequence_length = max_length)
text_vectorizer.adapt(train_sentences)
sample_sentence = "There's a flood in my street!"
text_vectorizer([sample_sentence])
# Choose a random sentence from the training dataset and tokenize it
random_sentence = random.choice(train_sentences)
print(f"Original text: \n {random_sentence} \
\n \n Vectorized version: ")
text_vectorizer([random_sentence])
words_in_vocab = text_vectorizer.get_vocabulary() # get all of the unique words in our training data
top_5_words = words_in_vocab[:5] # get the most common words
bottom_5_words= words_in_vocab[-5:] # get the least common words
print(f"Number of words in vocab: {len(words_in_vocab)}")
print(f"5 most common words: {top_5_words}")
print(f"5 least common words: {bottom_5_words}")
Creating and Embedding using an Emedding Layer
To make our embedding we are going to use TensorFlow's Embedding layer.
The parameters we care most about for our embedding layer:
-
input_dim
= the size of our vocabulary -
output_dim
= the size of the output embedding vector, for example, a value of 100 would mean each token gets represented by a vector 100 long -
input_length
= length of sequences being passed to the embedding layer.
from tensorflow.keras import layers
embedding = layers.Embedding(input_dim = max_vocab_length, # set input size
output_dim = 128,
embeddings_initializer = 'uniform',
input_length = max_length # how long is each input
)
embedding
# Get a random sentence from the training set
random_sentence = random.choice(train_sentences)
print(f"Original text:\n {random_sentence}\
\n \nEmbedded version: ")
# Embed the random sentence (turn it into dense vectors of fixed size)
sample_embed = embedding(text_vectorizer([random_sentence]))
sample_embed
# Check out a single token's embedding
sample_embed[0][0], sample_embed[0][0].shape, random_sentence[0]
Modelling a text dataset
Experiment Number | Model |
---|---|
0 | Naive Bayes with TF-IDF encoder(baseline) |
1 | Feed-forward neural network(dense model) |
2 | LSTM(RNN) |
3 | GRU(RNN) |
4 | Bidirectional-LSTM(RNN) |
5 | 1D Convolutional Neural Network |
6 | TensorFlow Hub Pretrained Feature Extractor |
7 | TensorFlow Hub Pretrained Feature Extractor (10% of data) |
Standards steps involved in running Modelling Experiments:
- Create a model
- Build a model
- Fit a model
- Evaluate our model
Model 0: Naive Bayes with TF-IDF encoder
To create our baseline, we'll use Sklearn's Multinomial Naive Bayes using the TF-IDF formula to convert our words to numbers.
Note: It's common practice to use non-DL algorithms as a baseline because of their speed and later we can use DL algorithms to see if we can improve upon them.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
# Create tokenization and modelling pipeline
model_0 = Pipeline([
("tfidf", TfidfVectorizer()), # convert words to numbers using tfidf
("clf", MultinomialNB()) # Model the text
])
# Fit the pipeline to the training dat
model_0.fit(train_sentences, train_labels)
baseline_score = model_0.score(val_sentences, val_labels)
print(f"Our baseline model achieves an accuracy of : {baseline_score*100:.2f}%")
train_df.target.value_counts()
So, our model is doing better than guessing, since there are almost 50% of example of both label types in the dataset.
baseline_preds = model_0.predict(val_sentences)
baseline_preds[:20]
Creating an evaluation function for our model experiments
We can evaluate all of our model's predictions with different metrics every time, instead of repeating the code, we can create a function so that we can reuse it later for all the model experiments.
The functions should output the following evaluation metrics:
- Accuracy
- Precision
- Recall
- F1-Score
Resource: Metrics and scoring: quantifying the quality of predictions
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
def calculate_results(y_true, y_pred):
"""
Calculates model accuracy, precision, recall and f1 score of a binary classification model
"""
# Calculate the model accuracy
model_accuracy = accuracy_score(y_true, y_pred)*100
# Calculate model precision, recall and f1-score using "weighted" average
model_precision, model_recall, model_f1, _ = precision_recall_fscore_support(y_true,y_pred, average = "weighted")
model_results = {"accuracy": model_accuracy,
"precision": model_precision,
"recall": model_recall,
"f1": model_f1}
return model_results
baseline_results = calculate_results(y_true = val_labels,
y_pred = baseline_preds)
baseline_results
from helper_functions import create_tensorboard_callback
# Create a directory to save TensorBoard logs
SAVE_DIR = "model_logs"
from tensorflow.keras import layers
inputs = layers.Input(shape=(1,), dtype="string") # inputs are 1-dimensional strings
x = text_vectorizer(inputs) # turn the input text into numbers
x = embedding(x) # create an embedding of the numerized numbers
x = layers.GlobalAveragePooling1D()(x) # lower the dimensionality of the embedding
outputs = layers.Dense(1, activation="sigmoid")(x) # create the output layer, want binary outputs so use sigmoid activation
model_1 = tf.keras.Model(inputs, outputs, name="model_1_dense") # construct the model
model_1.summary()
model_1.compile(loss = "binary_crossentropy",
optimizer = tf.keras.optimizers.Adam(),
metrics = ["accuracy"])
model_1_history = model_1.fit(train_sentences,
train_labels,
epochs = 5,
validation_data = (val_sentences,val_labels),
callbacks = [create_tensorboard_callback(dir_name = SAVE_DIR,
experiment_name = "model_1_dense" )])
model_1.evaluate(val_sentences, val_labels)
model_1_pred_probs = model_1.predict(val_sentences)
model_1_pred_probs.shape
model_1_pred_probs[1]
These are prediction probabilites that came out of the output layer.
model_1_preds = tf.squeeze(tf.round(model_1_pred_probs))
model_1_preds[:20]
model_1_results = calculate_results(y_true = val_labels,
y_pred = model_1_preds)
model_1_results
import numpy as np
np.array(list(model_1_results.values())) > np.array(list(baseline_results.values()))
words_in_vocab = text_vectorizer.get_vocabulary()
len(words_in_vocab), words_in_vocab[:10]
model_1.summary()
# (these numerical representations of each token in our training data, which has been learned for 5 epochs)
embed_weights = model_1.get_layer("embedding_1").get_weights()[0]
embed_weights # Same size as vocab size and embedding_dim
print(embed_weights.shape) # same size as vocab size and embedding_dim
Now we have got the embedding matrix our model has learned to represent our tokens, let's see how we can visualize it. To do so, Tensorflow has a tool called projector: https://projector.tensorflow.org/
And TensorFlow also has an incredible guide on word embeddings: https://www.tensorflow.org/text/guide/word_embeddings
import io
out_v = io.open('vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('metadata.tsv', 'w', encoding='utf-8')
for index, word in enumerate(words_in_vocab):
if index == 0:
continue # skip 0, it's padding.
vec = embed_weights[index]
out_v.write('\t'.join([str(x) for x in vec]) + "\n")
out_m.write(word + "\n")
out_v.close()
out_m.close()
try:
from google.colab import files
files.download('vectors.tsv')
files.download('metadata.tsv')
except Exception:
pass
def plot_functions(k_values, m_values, n_values):
return create_tensorboard_callback(dirname, dirpath)
tf.keras.utils.text_dataset_from_directory(
directory, labels='inferred', label_mode='int',
class_names=None, batch_size=32, max_length=None, shuffle=True, seed=None,
validation_split=None, subset=None, follow_links=False
)