TensorFlow/[TensorFlow 学习笔记 ] 6 RNN 实现文本生成

TensorFlow/[TensorFlow 学习笔记 ] 6 RNN 实现文本生成

在这篇文章中,我们来看下如何使用 RNN 生成文本。

首先导入库。

1
2
3
4
5
6
7
8
import tensorflow as tf

from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
import numpy as np

接着定义数据,使用 \n 符号分行。

这里单词数量 total_words = len(tokenizer.word_index) + 1,因为有一个 OOV,total_words 用作后面分类的数量。

1
2
3
4
5
6
7
8
9
10
11
12
13
tokenizer = Tokenizer()
# 数据
data="In the town of Athy one Jeremy Lanigan \n Battered away til he hadnt a pound. \nHis father died and made him a man again \n Left him a farm and ten acres of ground. \nHe gave a grand party for friends and relations \nWho didnt forget him when come to the wall, \nAnd if youll but listen Ill make your eyes glisten \nOf the rows and the ructions of Lanigans Ball. \nMyself to be sure got free invitation, \nFor all the nice girls and boys I might ask, \nAnd just in a minute both friends and relations \nWere dancing round merry as bees round a cask. \nJudy ODaly, that nice little milliner, \nShe tipped me a wink for to give her a call, \nAnd I soon arrived with Peggy McGilligan \nJust in time for Lanigans Ball. \nThere were lashings of punch and wine for the ladies, \nPotatoes and cakes; there was bacon and tea, \nThere were the Nolans, Dolans, OGradys \nCourting the girls and dancing away. \nSongs they went round as plenty as water, \nThe harp that once sounded in Taras old hall,\nSweet Nelly Gray and The Rat Catchers Daughter,\nAll singing together at Lanigans Ball. \nThey were doing all kinds of nonsensical polkas \nAll round the room in a whirligig. \nJulia and I, we banished their nonsense \nAnd tipped them the twist of a reel and a jig. \nAch mavrone, how the girls got all mad at me \nDanced til youd think the ceiling would fall. \nFor I spent three weeks at Brooks Academy \nLearning new steps for Lanigans Ball. \nThree long weeks I spent up in Dublin, \nThree long weeks to learn nothing at all,\n Three long weeks I spent up in Dublin, \nLearning new steps for Lanigans Ball. \nShe stepped out and I stepped in again, \nI stepped out and she stepped in again, \nShe stepped out and I stepped in again, \nLearning new steps for Lanigans Ball. \nBoys were all merry and the girls they were hearty \nAnd danced all around in couples and groups, \nTil an accident happened, young Terrance McCarthy \nPut his right leg through miss Finnertys hoops. \nPoor creature fainted and cried Meelia murther, \nCalled for her brothers and gathered them all. \nCarmody swore that hed go no further \nTil he had satisfaction at Lanigans Ball. \nIn the midst of the row miss Kerrigan fainted, \nHer cheeks at the same time as red as a rose. \nSome of the lads declared she was painted, \nShe took a small drop too much, I suppose. \nHer sweetheart, Ned Morgan, so powerful and able, \nWhen he saw his fair colleen stretched out by the wall, \nTore the left leg from under the table \nAnd smashed all the Chaneys at Lanigans Ball. \nBoys, oh boys, twas then there were runctions. \nMyself got a lick from big Phelim McHugh. \nI soon replied to his introduction \nAnd kicked up a terrible hullabaloo. \nOld Casey, the piper, was near being strangled. \nThey squeezed up his pipes, bellows, chanters and all. \nThe girls, in their ribbons, they got all entangled \nAnd that put an end to Lanigans Ball."

# 使用 \n 分割
corpus = data.lower().split("\n")

tokenizer.fit_on_texts(corpus)
# 这里单词数量需要 +1,因为有一个 OOV,用作后面分类的数量
total_words = len(tokenizer.word_index) + 1

print(tokenizer.word_index)
print(total_words)

总共有 263 个词。

接下来处理数据,构造训练集和测试集。

1
2
3
4
5
6
7
8
9
10
input_sequences = []
for line in corpus:
# tokenizer.texts_to_sequences([line])[0]:
# [[1, 26, 61, 60, 262, 13, 9, 10]]
# [0] 是取出第一个元素,也是一个 list
token_list = tokenizer.texts_to_sequences([line])[0]
for i in range(1, len(token_list)):
# 从第 2 个起,取到最后一个
n_gram_sequence = token_list[:i+1]
input_sequences.append(n_gram_sequence)

说明如下:


然后进行 padding,注意这里指定 padding='pre',这是为了把最后一个单词对齐,方便下面取出最后一个词作为 label。

1
2
3
4
# pad sequences 
max_sequence_len = max([len(x) for x in input_sequences])
# 把 padding 得到的 list 转换为 array
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))


然后生成数据和标签。

1
2
# create predictors and label
xs, labels = input_sequences[:,:-1],input_sequences[:,-1]


并对 label 进行 one-hot 编码。

1
ys = tf.keras.utils.to_categorical(labels, num_classes=total_words)

接下来定义模型:

注意 input_length=max_sequence_len-1,因为最后一个词作为 label,因此输入的数据会少一个。

1
2
3
4
5
6
7
model = Sequential()
# 注意 `input_length=max_sequence_len-1`,因为最后一个词作为 label,因此输入的数据会少一个
model.add(Embedding(total_words, 64, input_length=max_sequence_len-1))
model.add(Bidirectional(LSTM(20)))
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(xs, ys, epochs=500, verbose=1)

最后画出训练过程的准确率变化曲线。

1
2
3
4
5
6
7
8
9
10
import matplotlib.pyplot as plt


def plot_graphs(history, string):
plt.plot(history.history[string])
plt.xlabel("Epochs")
plt.ylabel(string)
plt.show()

plot_graphs(history, 'accuracy')

如下所示:


下面是用模型进行预测。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
seed_text = "Laurence went to dublin"
next_words = 100

for _ in range(next_words):
# 进行 tokenization
token_list = tokenizer.texts_to_sequences([seed_text])[0]
# 进行 padding
token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
# 预测,predicted 是索引。
predicted = model.predict_classes(token_list, verbose=0)
output_word = ""
# 根据索引得到词
for word, index in tokenizer.word_index.items():
if index == predicted:
output_word = word
break
# 将预测的词添加到句子后面
seed_text += " " + output_word

print(seed_text)

由于训练数据只有 263 个词,因此预测结果不是很好。

下面是用一份比较大的数据进行训练。

数据下载地址:https://storage.googleapis.com/laurencemoroney-blog.appspot.com/irish-lyrics-eof.txt

或者另一份数据:https://storage.googleapis.com/laurencemoroney-blog.appspot.com/sonnets.txt

代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import tensorflow as tf

from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
# 是用正则化
from tensorflow.keras import regularizers
import numpy as np

!wget --no-check-certificate https://storage.googleapis.com/laurencemoroney-blog.appspot.com/irish-lyrics-eof.txt -O /tmp/irish-lyrics-eof.txt

tokenizer = Tokenizer()

data = open('/tmp/irish-lyrics-eof.txt').read()

corpus = data.lower().split("\n")

tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1

print(tokenizer.word_index)
print(total_words)

input_sequences = []
for line in corpus:
token_list = tokenizer.texts_to_sequences([line])[0]
for i in range(1, len(token_list)):
n_gram_sequence = token_list[:i+1]
input_sequences.append(n_gram_sequence)

# pad sequences
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

# create predictors and label
xs, labels = input_sequences[:,:-1],input_sequences[:,-1]

ys = tf.keras.utils.to_categorical(labels, num_classes=total_words)


model = Sequential()
model.add(Embedding(total_words, 100, input_length=max_sequence_len-1))
model.add(Bidirectional(LSTM(150)))
model.add(Dense(total_words, activation='softmax'))
adam = Adam(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])
#earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=5, verbose=0, mode='auto')
history = model.fit(xs, ys, epochs=100, verbose=1)
#print model.summary()
print(model)

最后进行预测:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
seed_text = "I've got a bad feeling about this"
next_words = 100

for _ in range(next_words):
token_list = tokenizer.texts_to_sequences([seed_text])[0]
token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
predicted = model.predict_classes(token_list, verbose=0)
output_word = ""
for word, index in tokenizer.word_index.items():
if index == predicted:
output_word = word
break
seed_text += " " + output_word
print(seed_text)

你还可以在全连接层中使用正则化。

1
2
3
4
5
6
7
8
9
10
from tensorflow.keras import regularizers

model = Sequential()
model.add(Embedding(total_words, 100, input_length=max_sequence_len-1))
model.add(Bidirectional(LSTM(150, return_sequences = True)))
model.add(Dropout(0.2))
model.add(LSTM(100))
model.add(Dense(total_words/2, activation='relu', kernel_regularizer=regularizers.l2(0.01)))
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

使用 RNN 生成 莎士比亚话剧

请查看 TensorFlow 官方文档 https://www.tensorflow.org/tutorials/text/text_generation

评论