TensorFlow/[TensorFlow 学习笔记 ] 5 使用 RNN 进行文本分类

TensorFlow/[TensorFlow 学习笔记 ] 5 使用 RNN 进行文本分类

使用 LSTM

首先导入库

1
2
3
import tensorflow_datasets as tfds
import tensorflow as tf
print(tf.__version__)

加载数据集

1
2
3
# Get the data
dataset, info = tfds.load('imdb_reviews/subwords8k', with_info=True, as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']

数据预处理 ,进行 padding和 分 batch。

1
2
3
4
5
6
7
8
tokenizer = info.features['text'].encoder

BUFFER_SIZE = 10000
BATCH_SIZE = 64

train_dataset = train_dataset.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE, tf.compat.v1.data.get_output_shapes(train_dataset))
test_dataset = test_dataset.padded_batch(BATCH_SIZE, tf.compat.v1.data.get_output_shapes(test_dataset))

定义模型:

1
2
3
4
5
6
7
8
model = tf.keras.Sequential([
tf.keras.layers.Embedding(tokenizer.vocab_size, 64),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])

model.summary()

输出如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, None, 64) 523840
_________________________________________________________________
bidirectional (Bidirectional (None, 128) 66048
_________________________________________________________________
dense (Dense) (None, 64) 8256
_________________________________________________________________
dense_1 (Dense) (None, 1) 65
=================================================================
Total params: 598,209
Trainable params: 598,209
Non-trainable params: 0
____________________________

编译和训练模型

1
2
3
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
NUM_EPOCHS = 10
history = model.fit(train_dataset, epochs=NUM_EPOCHS, validation_data=test_dataset)

画出训练的准确率曲线和 loss 曲线。

如果使用多层 LSTM,那么前面的 LSTM 需要设置 return_sequences=True 参数。

1
2
3
4
5
6
7
8
model = tf.keras.Sequential([
tf.keras.layers.Embedding(tokenizer.vocab_size, 64),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True)),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.summary()

模型结构输出如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, None, 64) 523840
_________________________________________________________________
bidirectional_1 (Bidirection (None, None, 128) 66048
_________________________________________________________________
bidirectional_2 (Bidirection (None, 64) 41216
_________________________________________________________________
dense_2 (Dense) (None, 64) 4160
_________________________________________________________________
dense_3 (Dense) (None, 1) 65
=================================================================
Total params: 635,329
Trainable params: 635,329
Non-trainable params: 0
_________________________________________________________________

还可以使用一维卷积。

1
2
3
4
5
6
7
8
model = tf.keras.Sequential([
tf.keras.layers.Embedding(tokenizer.vocab_size, 64),
tf.keras.layers.Conv1D(128, 5, activation='relu'),
tf.keras.layers.GlobalAveragePooling1D(),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.summary()

模型结构输出如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Model: "sequential_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_2 (Embedding) (None, None, 64) 523840
_________________________________________________________________
conv1d (Conv1D) (None, None, 128) 41088
_________________________________________________________________
global_average_pooling1d (Gl (None, 128) 0
_________________________________________________________________
dense_4 (Dense) (None, 64) 8256
_________________________________________________________________
dense_5 (Dense) (None, 1) 65
=================================================================
Total params: 573,249
Trainable params: 573,249
Non-trainable params: 0
_________________________________________________________________

上面所有的模型结构中,序列长度都是 None,这是因为最开始的 Embedding 层没有指定 input_length

下面是使用 GRU 的例子,并且制定了 input_length

1
2
3
4
5
6
7
8
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, 100, input_length=16),
tf.keras.layers.Bidirectional(tf.keras.layers.GRU(32)),
tf.keras.layers.Dense(6, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

模型结构输出如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 16, 100) 13811100
_________________________________________________________________
bidirectional (Bidirectional (None, 64) 25728
_________________________________________________________________
dense (Dense) (None, 6) 390
_________________________________________________________________
dense_1 (Dense) (None, 1) 7
=================================================================
Total params: 13,837,225
Trainable params: 13,837,225
Non-trainable params: 0
_________________________________________________________________

实际案例

下面是一个情感 2 分类问题。

数据集下载地址:https://storage.googleapis.com/laurencemoroney-blog.appspot.com/training_cleaned.csv

使用了预训练好的 GloVe 词向量,来源于 https://nlp.stanford.edu/projects/glove/。GloVe 词向量下载地址:https://storage.googleapis.com/laurencemoroney-blog.appspot.com/glove.6B.100d.txt

首先导入库,设置超参数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import json
import tensorflow as tf
import csv
import random
import numpy as np

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import regularizers


embedding_dim = 100
max_length = 16
trunc_type='post'
padding_type='post'
oov_tok = "<OOV>"
training_size=160000
test_portion=.1

corpus = []

下载数据,并读取文本和 label。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Note that I cleaned the Stanford dataset to remove LATIN1 encoding to make it easier for Python CSV reader
# You can do that yourself with:
# iconv -f LATIN1 -t UTF8 training.1600000.processed.noemoticon.csv -o training_cleaned.csv
# I then hosted it on my site to make it easier to use in this notebook

!wget --no-check-certificate https://storage.googleapis.com/laurencemoroney-blog.appspot.com/training_cleaned.csv -O training_cleaned.csv

num_sentences = 0

with open("training_cleaned.csv") as csvfile:
reader = csv.reader(csvfile, delimiter=',')
for row in reader:
list_item=[]
# 第 5 个是文本
list_item.append(row[5])
# 第 0 个是 label
this_label=row[0]
if this_label=='0':
list_item.append(0)
else:
list_item.append(1)
num_sentences = num_sentences + 1
corpus.append(list_item)

打印数据:

1
2
3
print(num_sentences)
print(len(corpus))
print(corpus[1])

输出如下:

1
2
3
1600000
1600000
["is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!", 0]

对本文进行 tokenization,并划分训练集、测试集。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
sentences=[]
labels=[]
random.shuffle(corpus)
for x in range(training_size):
# list 中每个元素是 [text, label]
sentences.append(corpus[x][0])
labels.append(corpus[x][1])


tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)

word_index = tokenizer.word_index
vocab_size=len(word_index)

# 进行 tokenization
sequences = tokenizer.texts_to_sequences(sentences)
# 进行 padding 和截断
padded = pad_sequences(sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

split = int(test_portion * training_size)

# 划分训练集、测试集
test_sequences = padded[0:split]
training_sequences = padded[split:training_size]
test_labels = labels[0:split]
training_labels = labels[split:training_size]

下载词向量文件,并构建词向量矩阵。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Note this is the 100 dimension version of GloVe from Stanford
# I unzipped and hosted it on my site to make this notebook easier
!wget --no-check-certificate https://storage.googleapis.com/laurencemoroney-blog.appspot.com/glove.6B.100d.txt -O glove.6B.100d.txt

# 构建 dict,key 是单词,value 是词向量
embeddings_index = {};
with open('glove.6B.100d.txt') as f:
for line in f:
values = line.split();
# 每行第0个元素是词
word = values[0];
# 每行剩下的元素是词向量
coefs = np.asarray(values[1:], dtype='float32');

embeddings_index[word] = coefs;

# 使用 vocab_size+1,是因为需要留一个单词索引给 OOV
embeddings_matrix = np.zeros((vocab_size+1, embedding_dim));
# 注意 for 循环的内容,word 和 i
for word, i in word_index.items():
# 根据 word 获取词向量
embedding_vector = embeddings_index.get(word);
if embedding_vector is not None:
# 把词向量放进矩阵中对应的单词索引
embeddings_matrix[i] = embedding_vector;

打印单词个数:

1
2
3
print(len(embeddings_matrix))
# Expected Output
# 138859

定义模型,注意下面的 Embedding 层需要设置了 weights=[embeddings_matrix],表示使用这个词向量矩阵,而trainable=False 表示 Embedding 层的词向量不变。

1
2
3
4
5
6
7
8
9
10
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size+1, embedding_dim, input_length=max_length, weights=[embeddings_matrix], trainable=False),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Conv1D(64, 5, activation='relu'),
tf.keras.layers.MaxPooling1D(pool_size=4),
tf.keras.layers.LSTM(64),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

模型结构如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 16, 100) 13811200
_________________________________________________________________
dropout (Dropout) (None, 16, 100) 0
_________________________________________________________________
conv1d (Conv1D) (None, 12, 64) 32064
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 3, 64) 0
_________________________________________________________________
lstm (LSTM) (None, 64) 33024
_________________________________________________________________
dense_2 (Dense) (None, 1) 65
=================================================================
Total params: 13,876,353
Trainable params: 65,153
Non-trainable params: 13,811,200
_________________________________________________________________

训练模型:

1
2
3
4
5
6
7
8
9
10
num_epochs = 50

training_padded = np.array(training_sequences)
training_labels = np.array(training_labels)
testing_padded = np.array(test_sequences)
testing_labels = np.array(test_labels)

history = model.fit(training_padded, training_labels, epochs=num_epochs, validation_data=(testing_padded, testing_labels), verbose=2)

print("Training Complete")

画出训练过程中的准确率曲线和 loss 曲线。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import matplotlib.image  as mpimg
import matplotlib.pyplot as plt

#-----------------------------------------------------------
# Retrieve a list of list results on training and test data
# sets for each training epoch
#-----------------------------------------------------------
acc=history.history['accuracy']
val_acc=history.history['val_accuracy']
loss=history.history['loss']
val_loss=history.history['val_loss']

epochs=range(len(acc)) # Get number of epochs

#------------------------------------------------
# Plot training and validation accuracy per epoch
#------------------------------------------------
plt.plot(epochs, acc, 'r')
plt.plot(epochs, val_acc, 'b')
plt.title('Training and validation accuracy')
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
plt.legend(["Accuracy", "Validation Accuracy"])

plt.figure()

#------------------------------------------------
# Plot training and validation loss per epoch
#------------------------------------------------
plt.plot(epochs, loss, 'r')
plt.plot(epochs, val_loss, 'b')
plt.title('Training and validation loss')
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend(["Loss", "Validation Loss"])

plt.figure()


# Expected Output
# A chart where the validation loss does not increase sharply!

如下所示:



评论