TensorFlow/[TensorFlow 学习笔记 ] 4 文本数据预处理

TensorFlow/[TensorFlow 学习笔记 ] 4 文本数据预处理

从这篇文章开始,我们讲解 NLP 模块。

由于文本数据不能直接输入到网络,我们首先讲文本数据的预处理。

首先下载数据集。

1
2
wget --no-check-certificate \
https://storage.googleapis.com/laurencemoroney-blog.appspot.com/bbc-text.csv -O bbc-text.csv

导入库,定义停用词。

1
2
3
4
5
6
7
8
import csv
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences



#Stopwords list from https://github.com/Yoast/YoastSEO.js/blob/develop/src/config/stopwords.js
stopwords = [ "a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves" ]

读取文件,并去掉停用词。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
sentences = []
labels = []
with open("/tmp/bbc-text.csv", 'r') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
next(reader)
for row in reader:
labels.append(row[0])
sentence = row[1]
for word in stopwords:
token = " " + word + " "
sentence = sentence.replace(token, " ")
sentence = sentence.replace(" ", " ")
sentences.append(sentence)


print(len(sentences))
print(sentences[0])

输出是

1
2
2225
tv future hands ...

定义 Tokenizer。如果传入 num_words 参数,表示需要保留的最大词数,基于词频。只有最常出现的 num_words 词会被保留。

Tokenizerword_index 是一个 dict,其中 key 是词,value 是数字,表示词和数字之间的对应关系。

1
2
3
4
tokenizer = Tokenizer(oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(len(word_index))

输出如下:

1
29714

把文本转换为数字,并进行 padding

1
2
3
4
sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, padding='post')
print(padded[0])
print(padded.shape)

输出是

1
2
[  96  176 1158 ...    0    0    0]
(2225, 2442)

label 也进行 tokenization

1
2
3
4
5
6
label_tokenizer = Tokenizer()
label_tokenizer.fit_on_texts(labels)
label_word_index = label_tokenizer.word_index
label_seq = label_tokenizer.texts_to_sequences(labels)
print(label_seq)
print(label_word_index)

输出如下:

1
2
3
# Expected Output
# [[4], [2], [1]. . .]
# {'sport': 1, 'business': 2, 'politics': 3, 'tech': 4, 'entertainment': 5}

imdb 情感分类

imdb 是一个电影评论数据集,包括正面和负面 2 个类别。

下面来看如何构造网络,实现情感分类。

首先导入库。

1
2
import tensorflow_datasets as tfds
imdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)

读取训练集和测试集的 list。并转换为 array

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import numpy as np

train_data, test_data = imdb['train'], imdb['test']

training_sentences = []
training_labels = []

testing_sentences = []
testing_labels = []

# str(s.tonumpy()) is needed in Python3 instead of just s.numpy()
for s,l in train_data:
training_sentences.append(s.numpy().decode('utf8'))
training_labels.append(l.numpy())

for s,l in test_data:
testing_sentences.append(s.numpy().decode('utf8'))
testing_labels.append(l.numpy())

# 把 list 转换为 array
training_labels_final = np.array(training_labels)
testing_labels_final = np.array(testing_labels)

对文本数据进行预处理。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# 表示根据词频,取词频最高的 10000 个单词作为词汇,其他使用 OOV
vocab_size = 10000
embedding_dim = 16
# 最大长度是 max_length,超过 max_length 则截断
max_length = 120
trunc_type='post'
oov_tok = "<OOV>"


from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(training_sentences)
# 超过 max_length,则进行 post 截断
padded = pad_sequences(sequences,maxlen=max_length, truncating=trunc_type)

testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences,maxlen=max_length)

下面的 padded,再转换为文本。

1
2
3
4
5
6
7
8
9
10
# 得到一个 dict, 其中 key 是数字,value 是词
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
return ' '.join([reverse_word_index.get(i, '?') for i in text])

# 打印转换之后的句子,会把一些词替换为 OOV(?)
print(decode_review(padded[3]))
# 打印原来的句子
print(training_sentences[3])

定义模型:

1
2
3
4
5
6
7
8
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(6, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

模型如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 120, 16) 160000
_________________________________________________________________
flatten (Flatten) (None, 1920) 0
_________________________________________________________________
dense (Dense) (None, 6) 11526
_________________________________________________________________
dense_1 (Dense) (None, 1) 7
=================================================================
Total params: 171,533
Trainable params: 171,533
Non-trainable params: 0

训练模型

1
2
num_epochs = 10
model.fit(padded, training_labels_final, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))

取出第一个层(Embedding 层)的权重

1
2
3
4
e = model.layers[0]
# e.get_weights() 是一个 list,里面只有一个元素
weights = e.get_weights()[0]
print(weights.shape) # shape: (vocab_size, embedding_dim)

输出是 (10000, 16)

把权重保存到文件中。其中 meta.tsv 用于保存单词,vecs.tsv 用于保存词向量。

1
2
3
4
5
6
7
8
9
10
11
12
13
import io

out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')
for word_num in range(1, vocab_size):
word = reverse_word_index[word_num]
embeddings = weights[word_num]
# 保存单词
out_m.write(word + "\n")
# 保存权重
out_v.write('\t'.join([str(x) for x in embeddings]) + "\n")
out_v.close()
out_m.close()

打开 https://projector.tensorflow.org/,首先点击 Load 按钮,在弹出的界面中分别上传两个文件。就可以对词向量进行可视化。


下面是一个 tokenization 的例子。

1
2
3
sentence = "I really think this is amazing. honest."
sequence = tokenizer.texts_to_sequences([sentence])
print(sequence)

输出是:

1
[[11, 64, 102, 12, 7, 478, 1200]]

sarcasm 数据 2 分类

首先导入库

1
2
3
4
5
import json
import tensorflow as tf

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

定义超参数

1
2
3
4
5
6
7
vocab_size = 10000
embedding_dim = 16
max_length = 100
trunc_type='post'
padding_type='post'
oov_tok = "<OOV>"
training_size = 20000

下载数据

1
2
!wget --no-check-certificate \
https://storage.googleapis.com/laurencemoroney-blog.appspot.com/sarcasm.json -O sarcasm.json

读取数据集和标签。

1
2
3
4
5
6
7
8
9
with open("sarcasm.json", 'r') as f:
datastore = json.load(f)

sentences = []
labels = []

for item in datastore:
sentences.append(item['headline'])
labels.append(item['is_sarcastic'])

划分训练集和验证集

1
2
3
4
training_sentences = sentences[0:training_size]
testing_sentences = sentences[training_size:]
training_labels = labels[0:training_size]
testing_labels = labels[training_size:]

对数据进行预处理

1
2
3
4
5
6
7
8
9
10
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)

word_index = tokenizer.word_index

training_sequences = tokenizer.texts_to_sequences(training_sentences)
training_padded = pad_sequences(training_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

把 list 转换为 array

1
2
3
4
5
6
# Need this block to get it to work with TensorFlow 2.x
import numpy as np
training_padded = np.array(training_padded)
training_labels = np.array(training_labels)
testing_padded = np.array(testing_padded)
testing_labels = np.array(testing_labels)

定义模型

1
2
3
4
5
6
7
8
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
tf.keras.layers.GlobalAveragePooling1D(),
tf.keras.layers.Dense(24, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

模型展示如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 100, 16) 160000
_________________________________________________________________
global_average_pooling1d (Gl (None, 16) 0
_________________________________________________________________
dense (Dense) (None, 24) 408
_________________________________________________________________
dense_1 (Dense) (None, 1) 25
=================================================================
Total params: 160,433
Trainable params: 160,433
Non-trainable params: 0

训练模型。

1
2
num_epochs = 30
history = model.fit(training_padded, training_labels, epochs=num_epochs, validation_data=(testing_padded, testing_labels), verbose=2)

并画出训练曲线。

1
2
3
4
5
6
7
8
9
10
11
12
13
import matplotlib.pyplot as plt


def plot_graphs(history, string):
plt.plot(history.history[string])
plt.plot(history.history['val_'+string])
plt.xlabel("Epochs")
plt.ylabel(string)
plt.legend([string, 'val_'+string])
plt.show()

plot_graphs(history, "accuracy")
plot_graphs(history, "loss")

如下所示:



预测一个句子,也需要先进行通常的预处理。

1
2
3
4
sentence = ["granny starting to fear spiders in the garden might be real", "game of thrones season finale showing this sunday night"]
sequences = tokenizer.texts_to_sequences(sentence)
padded = pad_sequences(sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)
print(model.predict(padded))

预测结果如下:

1
2
[[9.5331246e-01]
[2.6024200e-04]]

使用 imdb subword 编码

tensorflow 提供了 imdb 数据集的 subword 编码。下面我们使用这种方法来进行编码。

1
2
3
4
5
6
7
import tensorflow as tf

# If the import fails, run this
# !pip install -q tensorflow-datasets

import tensorflow_datasets as tfds
imdb, info = tfds.load("imdb_reviews/subwords8k", with_info=True, as_supervised=True)

获取训练集和验证集,以及 tokenizer

1
2
3
train_data, test_data = imdb['train'], imdb['test']
tokenizer = info.features['text'].encoder
print(tokenizer.subwords)

输出如下:

1
['the_', ', ', '. ', 'a_', 'and_'...]

对一个句子进行 tokenization

1
2
3
4
5
6
7
sample_string = 'TensorFlow, from basics to mastery'

tokenized_string = tokenizer.encode(sample_string)
print ('Tokenized string is {}'.format(tokenized_string))

original_string = tokenizer.decode(tokenized_string)
print ('The original string: {}'.format(original_string))

输出如下:

1
2
Tokenized string is [6307, 2327, 4043, 2120, 2, 48, 4249, 4429, 7, 2652, 8050]
The original string: TensorFlow, from basics to mastery

查看每一个 token 对应的数字。

1
2
for ts in tokenized_string:
print ('{} ----> {}'.format(ts, tokenizer.decode([ts])))

输出如下:

1
2
3
4
5
6
7
8
9
10
11
6307 ----> Ten
2327 ----> sor
4043 ----> Fl
2120 ----> ow
2 ----> ,
48 ----> from
4249 ----> basi
4429 ----> cs
7 ----> to
2652 ----> master
8050 ----> y

对数据进行 padding。

1
2
3
4
5
6
BUFFER_SIZE = 10000
BATCH_SIZE = 64

train_dataset = train_data.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE, tf.compat.v1.data.get_output_shapes(train_dataset))
test_dataset = test_data.padded_batch(BATCH_SIZE, tf.compat.v1.data.get_output_shapes(test_data))

定义模型

1
2
3
4
5
6
7
8
9
embedding_dim = 64
model = tf.keras.Sequential([
tf.keras.layers.Embedding(tokenizer.vocab_size, embedding_dim),
tf.keras.layers.GlobalAveragePooling1D(),
tf.keras.layers.Dense(6, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])

model.summary()

模型如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, None, 64) 523840
_________________________________________________________________
global_average_pooling1d (Gl (None, 64) 0
_________________________________________________________________
dense (Dense) (None, 6) 390
_________________________________________________________________
dense_1 (Dense) (None, 1) 7
=================================================================
Total params: 524,237
Trainable params: 524,237
Non-trainable params: 0
_________________________________________________________________

训练模型:

1
2
3
4
5
num_epochs = 10

model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

history = model.fit(train_data, epochs=num_epochs, validation_data=test_data)

评论