为什么写篇文章 
首先,这篇文章的代码全部都来源于 Datawhale 提供的开源代码,我添加了自己的笔记,帮助新手更好地理解这个代码。
Datawhale 提供的代码哪里需要补充? 
Datawhale 提供的代码里包含了数据处理,以及从 0 到 1模型建立的完整流程。但是和前面提供的 basesline 的都不太一样,它包含了非常多数据处理的细节,模型也是由 3 个部分构成,所以看起来难度陡然上升。
其次,代码里的注释非常少,也没有讲解整个模型的整体流程。
最后,代码中很多数据转换的逻辑,和 3 个模型的连接,对于新手 来说,颇为头痛。 我自己也是一个新人,花了一天的时间,仔细研究数据在一种每一个步骤的转化,对于一些难以理解的代码,在钉钉群里询问之后,也得到了 Datawhale 成员的热心解答。最终才搞懂这个代码。
 
我做了什么改进? 
所以,为了减少对于新手的阅读难度,我添加了一些内容。
首先,梳理了整个流程,包括两大部分:数据处理 和模型 。
因为代码不是从上到下顺序阅读 的。因此,更容易让人理解的做法是:先从整体上给出宏观的数据转换流程图,其中要包括数据在每一步的 shape,以及包含的转换步骤,让读者心中有一个框架图,再带着这个框架图去看细节,会更加了然于胸。
 
其次,了解了整体流程之外,在真正的细节代码里,读者可能还是会看不懂某一段小逻辑。因此,我在原有代码的基础之上增添了许多注释 ,以降低代码的理解门槛。
 
 
数据处理 
数据拆分为 10 份 
数据首先会经过all_data2fold函数,这个函数的作用是把原始的 DataFrame 数据,转换为一个list,有 10 个元素,表示交叉验证里的 10 份,每个元素是 dict,每个 dict 包括 label 和 text。
首先根据 label 来划分数据行所在 index, 生成 label2id。
label2id 是一个 dict,key 为 label,value 是一个 list,存储的是该类对应的 index
  然后根据label2id,把每一类别的数据,划分到 10 份数据中。
 
 
  2. 最后,把前 9 份数据作为训练集train_data,最后一份数据作为验证集dev_data,并读取测试集test_data。
定义并创建 Vacab 
Vocab 的作用是:
创建 词 和 index 对应的字典,这里包括 2 份字典,分别是:_id2word 和 _id2extword。 
其中 _id2word 是从新闻得到的, 把词频小于 5 的词替换为了 UNK。对应到模型输入的 batch_inputs1。 
_id2extword 是从 word2vec.txt 中得到的,有 5976 个词。对应到模型输入的 batch_inputs2。 
后面会有两个 embedding 层,其中 _id2word 对应的 embedding 是可学习的,_id2extword 对应的 embedding 是从文件中加载的,是固定的。 
创建 label 和 index 对应的字典。 
上面这些字典,都是基于train_data创建的。 
 
模型 
把文章分割为句子 
上上一步得到的 3 个数据,都是一个list,list里的每个元素是 dict,每个 dict 包括 label 和 text。这 3 个数据会经过 get_examples函数。 get_examples函数里,会调用sentence_split函数,把每一篇文章分割成为句子。
然后,根据vocab,把 word 转换为对应的索引,这里使用了 2 个字典,转换为 2 份索引,分别是:word_ids和extword_ids。最后返回的数据是一个 list,每个元素是一个 tuple: (label, 句子数量,doc)。其中doc又是一个 list,每个 元素是一个 tuple: (句子长度,word_ids, extword_ids)。
 
在迭代训练时,调用data_iter函数,生成每一批的batch_data。在data_iter函数里,会调用batch_slice函数生成每一个batch。拿到batch_data后,每个数据的格式仍然是上图中所示的格式,下面,调用batch2tensor函数。
 
 
生成训练数据 
batch2tensor函数最后返回的数据是:(batch_inputs1, batch_inputs2, batch_masks), batch_labels。形状都是(batch_size, doc_len, sent_len)。doc_len表示每篇新闻有几乎话,sent_len表示每句话有多少个单词。
batch_masks在有单词的位置,值为1,其他地方为 0,用于后面计算 Attention,把那些没有单词的位置的 attention 改为 0。
batch_inputs1, batch_inputs2, batch_masks,形状是(batch_size, doc_len, sent_len),转换为(batch_size * doc_len, sent_len)。
网络部分 
下面,终于来到网络部分。模型结构图如下:
WordCNNEncoder 
WordCNNEncoder 网络结构示意图如下:
  #### 1. Embedding
batch_inputs1, batch_inputs2都输入到WordCNNEncoder。WordCNNEncoder包括两个embedding层,分别对应batch_inputs1,embedding 层是可学习的,得到word_embed;batch_inputs2,读取的是外部训练好的词向,因此是不可学习的,得到extword_embed。所以会分别得到两个词向量,将 2 个词向量相加,得到最终的词向量batch_embed,形状是(batch_size * doc_len, sent_len, 100),然后添加一个维度,变为(batch_size * doc_len, 1, sent_len, 100),对应 Pytorch 里图像的(B, C, H, W)。
2. CNN 
然后,分别定义 3 个卷积核,output channel 都是 100 维。
第一个卷积核大小为[2,100],得到的输出是(batch_size * doc_len, 100, sent_len-2+1, 1),定义一个池化层大小为[sent_len-2+1, 1],最终得到输出经过squeeze()的形状是(batch_size * doc_len, 100)。
同理,第 2 个卷积核大小为[3,100],第 3 个卷积核大小为[4,100]。卷积+池化得到的输出形状也是(batch_size * doc_len, 100)。
最后,将这 3 个向量在第 2 个维度上做拼接,得到输出的形状是(batch_size * doc_len, 300)。
shape 转换 
把上一步得到的数据的形状,转换为(batch_size , doc_len, 300)名字是sent_reps。然后,对mask进行处理。
batch_masks的形状是(batch_size , doc_len, 300),表示单词的 mask,经过sent_masks = batch_masks.bool().any(2).float()得到句子的 mask。含义是:在最后一个维度,判断是否有单词,只要有 1 个单词,那么整句话的 mask 就是 1,sent_masks的维度是:(batch_size , doc_len)。
SentEncoder 
SentEncoder 网络结构示意图如下:
SentEncoder包含了 2 层的双向 LSTM,输入数据sent_reps的形状是(batch_size , doc_len, 300),LSTM 的 hidden_size 为 256,由于是双向的,经过 LSTM 后的数据维度是(batch_size , doc_len, 512),然后和 mask 按位置相乘,把没有单词的句子的位置改为 0,最后输出的数据sent_hiddens,维度依然是(batch_size , doc_len, 512)。
Attention 
接着,经过Attention。Attention的输入是sent_hiddens和sent_masks。在Attention里,sent_hiddens首先经过线性变化得到key,维度不变,依然是(batch_size , doc_len, 512)。
然后key和query相乘,得到outputs。query的维度是512,因此output的维度是(batch_size , doc_len),这个就是我们需要的attention,表示分配到每个句子的权重。下一步需要对这个attetion做softmax,并使用sent_masks,把没有单词的句子的权重置为-1e32,得到masked_attn_scores。
最后把masked_attn_scores和key相乘,得到batch_outputs,形状是(batch_size, 512)。
FC 
最后经过FC层,得到分类概率的向量。
完整代码+注释 
数据处理 
导入包
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 import  randomimport  numpy as  npimport  torchimport  logginglogging.basicConfig(level=logging.INFO, format='%(asctime)-15s %(levelname)s: %(message)s' ) seed = 666  random.seed(seed) np.random.seed(seed) torch.cuda.manual_seed(seed) torch.manual_seed(seed) gpu = 0  use_cuda = gpu >= 0  and  torch.cuda.is_available() if  use_cuda:    torch.cuda.set_device(gpu)     device = torch.device("cuda" , gpu) else :    device = torch.device("cpu" ) logging.info("Use cuda: %s, gpu id: %d." , use_cuda, gpu) 
 
2020-08-13 17:12:16,510 INFO: Use cuda: False, gpu id: 0. 
把数据分成 10 份 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 fold_num = 10  data_file = 'train_set.csv'  import  pandas as  pddef  all_data2fold (fold_num, num=10000 ) :    fold_data = []     f = pd.read_csv(data_file, sep='\t' , encoding='UTF-8' )     texts = f['text' ].tolist()[:num]     labels = f['label' ].tolist()[:num]     total = len(labels)     index = list(range(total))          np.random.shuffle(index)          all_texts = []     all_labels = []     for  i in  index:         all_texts.append(texts[i])         all_labels.append(labels[i])          label2id = {}     for  i in  range(total):         label = str(all_labels[i])         if  label not  in  label2id:             label2id[label] = [i]         else :             label2id[label].append(i)          all_index = [[] for  _ in  range(fold_num)]     for  label, data in  label2id.items():                  batch_size = int(len(data) / fold_num)                  other = len(data) - batch_size * fold_num                  for  i in  range(fold_num):                          cur_batch_size = batch_size + 1  if  i < other else  batch_size                                       batch_data = [data[i * batch_size + b] for  b in  range(cur_batch_size)]             all_index[i].extend(batch_data)     batch_size = int(total / fold_num)     other_texts = []     other_labels = []     other_num = 0      start = 0                for  fold in  range(fold_num):         num = len(all_index[fold])         texts = [all_texts[i] for  i in  all_index[fold]]         labels = [all_labels[i] for  i in  all_index[fold]]         if  num > batch_size:              fold_texts = texts[:batch_size]             other_texts.extend(texts[batch_size:])             fold_labels = labels[:batch_size]             other_labels.extend(labels[batch_size:])             other_num += num - batch_size         elif  num < batch_size:              end = start + batch_size - num             fold_texts = texts + other_texts[start: end]             fold_labels = labels + other_labels[start: end]             start = end         else :             fold_texts = texts             fold_labels = labels         assert  batch_size == len(fold_labels)                  index = list(range(batch_size))         np.random.shuffle(index)                  shuffle_fold_texts = []         shuffle_fold_labels = []         for  i in  index:             shuffle_fold_texts.append(fold_texts[i])             shuffle_fold_labels.append(fold_labels[i])         data = {'label' : shuffle_fold_labels, 'text' : shuffle_fold_texts}         fold_data.append(data)     logging.info("Fold lens %s" , str([len(data['label' ]) for  data in  fold_data]))     return  fold_data fold_data = all_data2fold(10 ) 
 
2020-08-13 17:12:45,012 INFO: Fold lens [1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000] 
拆分训练集、验证集,读取测试集 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 fold_id = 9  dev_data = fold_data[fold_id] train_texts = [] train_labels = [] for  i in  range(0 , fold_id):    data = fold_data[i]     train_texts.extend(data['text' ])     train_labels.extend(data['label' ]) train_data = {'label' : train_labels, 'text' : train_texts} test_data_file = 'test_a.csv'  f = pd.read_csv(test_data_file, sep='\t' , encoding='UTF-8' ) texts = f['text' ].tolist() test_data = {'label' : [0 ] * len(texts), 'text' : texts} 
 
创建 Vocab 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 from  collections import  Counterfrom  transformers import  BasicTokenizerbasic_tokenizer = BasicTokenizer() class  Vocab () :    def  __init__ (self, train_data) :         self.min_count = 5          self.pad = 0          self.unk = 1          self._id2word = ['[PAD]' , '[UNK]' ]         self._id2extword = ['[PAD]' , '[UNK]' ]         self._id2label = []         self.target_names = []         self.build_vocab(train_data)         reverse = lambda  x: dict(zip(x, range(len(x))))                  self._word2id = reverse(self._id2word)                  self._label2id = reverse(self._id2label)         logging.info("Build vocab: words %d, labels %d."  % (self.word_size, self.label_size))          def  build_vocab (self, data) :         self.word_counter = Counter()                  for  text in  data['text' ]:             words = text.split()             for  word in  words:                 self.word_counter[word] += 1                   for  word, count in  self.word_counter.most_common():             if  count >= self.min_count:                 self._id2word.append(word)         label2name = {0 : '科技' , 1 : '股票' , 2 : '体育' , 3 : '娱乐' , 4 : '时政' , 5 : '社会' , 6 : '教育' , 7 : '财经' ,                       8 : '家居' , 9 : '游戏' , 10 : '房产' , 11 : '时尚' , 12 : '彩票' , 13 : '星座' }         self.label_counter = Counter(data['label' ])         for  label in  range(len(self.label_counter)):             count = self.label_counter[label]              self._id2label.append(label)              self.target_names.append(label2name[label])      def  load_pretrained_embs (self, embfile) :         with  open(embfile, encoding='utf-8' ) as  f:             lines = f.readlines()             items = lines[0 ].split()                          word_count, embedding_dim = int(items[0 ]), int(items[1 ])         index = len(self._id2extword)         embeddings = np.zeros((word_count + index, embedding_dim))                  for  line in  lines[1 :]:             values = line.split()             self._id2extword.append(values[0 ])              vector = np.array(values[1 :], dtype='float64' )              embeddings[self.unk] += vector             embeddings[index] = vector             index += 1                   embeddings[self.unk] = embeddings[self.unk] / word_count                  embeddings = embeddings / np.std(embeddings)         reverse = lambda  x: dict(zip(x, range(len(x))))         self._extword2id = reverse(self._id2extword)         assert  len(set(self._id2extword)) == len(self._id2extword)         return  embeddings          def  word2id (self, xs) :         if  isinstance(xs, list):             return  [self._word2id.get(x, self.unk) for  x in  xs]         return  self._word2id.get(xs, self.unk)          def  extword2id (self, xs) :         if  isinstance(xs, list):             return  [self._extword2id.get(x, self.unk) for  x in  xs]         return  self._extword2id.get(xs, self.unk)          def  label2id (self, xs) :         if  isinstance(xs, list):             return  [self._label2id.get(x, self.unk) for  x in  xs]         return  self._label2id.get(xs, self.unk)     @property     def  word_size (self) :         return  len(self._id2word)     @property     def  extword_size (self) :         return  len(self._id2extword)     @property     def  label_size (self) :         return  len(self._id2label) vocab = Vocab(train_data) 
 
[1, 1, 0, 0, 2, 0, 6, 2, 1, 4] 
模型 
定义 Attention 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 import  torch.nn as  nnimport  torch.nn.functional as  Fclass  Attention (nn.Module) :    def  __init__ (self, hidden_size) :         super(Attention, self).__init__()         self.weight = nn.Parameter(torch.Tensor(hidden_size, hidden_size))         self.weight.data.normal_(mean=0.0 , std=0.05 )         self.bias = nn.Parameter(torch.Tensor(hidden_size))         b = np.zeros(hidden_size, dtype=np.float32)         self.bias.data.copy_(torch.from_numpy(b))         self.query = nn.Parameter(torch.Tensor(hidden_size))         self.query.data.normal_(mean=0.0 , std=0.05 )     def  forward (self, batch_hidden, batch_masks) :                                             key = torch.matmul(batch_hidden, self.weight) + self.bias                                     outputs = torch.matmul(key, self.query)                             masked_outputs = outputs.masked_fill((1  - batch_masks).bool(), float(-1e32 ))                  attn_scores = F.softmax(masked_outputs, dim=1 )                    masked_attn_scores = attn_scores.masked_fill((1  - batch_masks).bool(), 0.0 )                                             batch_outputs = torch.bmm(masked_attn_scores.unsqueeze(1 ), key).squeeze(1 )           return  batch_outputs, attn_scores 
 
定义 WordCNNEncoder 
1 2 word2vec_path = '../emb/word2vec.txt'  dropout = 0.15  
 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 class  WordCNNEncoder (nn.Module) :    def  __init__ (self, vocab) :         super(WordCNNEncoder, self).__init__()         self.dropout = nn.Dropout(dropout)         self.word_dims = 100                             self.word_embed = nn.Embedding(vocab.word_size, self.word_dims, padding_idx=0 )         extword_embed = vocab.load_pretrained_embs(word2vec_path)         extword_size, word_dims = extword_embed.shape         logging.info("Load extword embed: words %d, dims %d."  % (extword_size, word_dims))                  self.extword_embed = nn.Embedding(extword_size, word_dims, padding_idx=0 )         self.extword_embed.weight.data.copy_(torch.from_numpy(extword_embed))         self.extword_embed.weight.requires_grad = False          input_size = self.word_dims         self.filter_sizes = [2 , 3 , 4 ]           self.out_channel = 100                   self.convs = nn.ModuleList([nn.Conv2d(1 , self.out_channel, (filter_size, input_size), bias=True )                                     for  filter_size in  self.filter_sizes])     def  forward (self, word_ids, extword_ids) :                                    sen_num, sent_len = word_ids.shape                                    word_embed = self.word_embed(word_ids)         extword_embed = self.extword_embed(extword_ids)         batch_embed = word_embed + extword_embed         if  self.training:             batch_embed = self.dropout(batch_embed)                                    batch_embed.unsqueeze_(1 )           pooled_outputs = []                  for  i in  range(len(self.filter_sizes)):                                                                 filter_height = sent_len - self.filter_sizes[i] + 1                           conv = self.convs[i](batch_embed)             hidden = F.relu(conv)                            mp = nn.MaxPool2d((filter_height, 1 ))                                         pooled = mp(hidden).reshape(sen_num,                                         self.out_channel)                           pooled_outputs.append(pooled)                           reps = torch.cat(pooled_outputs, dim=1 )           if  self.training:             reps = self.dropout(reps)         return  reps 
 
定义 SentEncoder 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 sent_hidden_size = 256  sent_num_layers = 2  class  SentEncoder (nn.Module) :    def  __init__ (self, sent_rep_size) :         super(SentEncoder, self).__init__()         self.dropout = nn.Dropout(dropout)         self.sent_lstm = nn.LSTM(             input_size=sent_rep_size,              hidden_size=sent_hidden_size,             num_layers=sent_num_layers,             batch_first=True ,             bidirectional=True          )     def  forward (self, sent_reps, sent_masks) :                                             sent_hiddens, _ = self.sent_lstm(sent_reps)                    sent_hiddens = sent_hiddens * sent_masks.unsqueeze(2 )                  if  self.training:             sent_hiddens = self.dropout(sent_hiddens)         return  sent_hiddens 
 
定义整个模型Attention 
把 WordCNNEncoder、SentEncoder、Attention、FC 全部连接起来
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 class  Model (nn.Module) :    def  __init__ (self, vocab) :         super(Model, self).__init__()         self.sent_rep_size = 300           self.doc_rep_size = sent_hidden_size * 2           self.all_parameters = {}         parameters = []         self.word_encoder = WordCNNEncoder(vocab)                  parameters.extend(list(filter(lambda  p: p.requires_grad, self.word_encoder.parameters())))         self.sent_encoder = SentEncoder(self.sent_rep_size)         self.sent_attention = Attention(self.doc_rep_size)         parameters.extend(list(filter(lambda  p: p.requires_grad, self.sent_encoder.parameters())))         parameters.extend(list(filter(lambda  p: p.requires_grad, self.sent_attention.parameters())))                  self.out = nn.Linear(self.doc_rep_size, vocab.label_size, bias=True )         parameters.extend(list(filter(lambda  p: p.requires_grad, self.out.parameters())))         if  use_cuda:             self.to(device)         if  len(parameters) > 0 :             self.all_parameters["basic_parameters" ] = parameters         logging.info('Build model with cnn word encoder, lstm sent encoder.' )         para_num = sum([np.prod(list(p.size())) for  p in  self.parameters()])         logging.info('Model param num: %.2f M.'  % (para_num / 1e6 ))     def  forward (self, batch_inputs) :                           batch_inputs1, batch_inputs2, batch_masks = batch_inputs         batch_size, max_doc_len, max_sent_len = batch_inputs1.shape[0 ], batch_inputs1.shape[1 ], batch_inputs1.shape[2 ]                  batch_inputs1 = batch_inputs1.view(batch_size * max_doc_len, max_sent_len)                    batch_inputs2 = batch_inputs2.view(batch_size * max_doc_len, max_sent_len)                  batch_masks = batch_masks.view(batch_size * max_doc_len, max_sent_len)                             sent_reps = self.word_encoder(batch_inputs1, batch_inputs2)                                     sent_reps = sent_reps.view(batch_size, max_doc_len, self.sent_rep_size)                    batch_masks = batch_masks.view(batch_size, max_doc_len, max_sent_len)                             sent_masks = batch_masks.bool().any(2 ).float()                             sent_hiddens = self.sent_encoder(sent_reps, sent_masks)                                               doc_reps, atten_scores = self.sent_attention(sent_hiddens, sent_masks)                             batch_outputs = self.out(doc_reps)           return  batch_outputs model = Model(vocab) 
 
定义 Optimizer 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 learning_rate = 2e-4  decay = .75  decay_step = 1000  class  Optimizer :    def  __init__ (self, model_parameters) :         self.all_params = []         self.optims = []         self.schedulers = []         for  name, parameters in  model_parameters.items():             if  name.startswith("basic" ):                 optim = torch.optim.Adam(parameters, lr=learning_rate)                 self.optims.append(optim)                 l = lambda  step: decay ** (step // decay_step)                 scheduler = torch.optim.lr_scheduler.LambdaLR(optim, lr_lambda=l)                 self.schedulers.append(scheduler)                 self.all_params.extend(parameters)             else :                 Exception("no nameed parameters." )         self.num = len(self.optims)     def  step (self) :         for  optim, scheduler in  zip(self.optims, self.schedulers):             optim.step()             scheduler.step()             optim.zero_grad()     def  zero_grad (self) :         for  optim in  self.optims:             optim.zero_grad()     def  get_lr (self) :         lrs = tuple(map(lambda  x: x.get_lr()[-1 ], self.schedulers))         lr = ' %.5f'  * self.num         res = lr % lrs         return  res 
 
定义 sentence_split,把文章划分为句子 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 def  sentence_split (text, vocab, max_sent_len=256 , max_segment=16 ) :         words = text.strip().split()     document_len = len(words)          index = list(range(0 , document_len, max_sent_len))     index.append(document_len)     segments = []     for  i in  range(len(index) - 1 ):                  segment = words[index[i]: index[i + 1 ]]         assert  len(segment) > 0                   segment = [word if  word in  vocab._id2word else  '<UNK>'  for  word in  segment]                  segments.append([len(segment), segment])     assert  len(segments) > 0           if  len(segments) > max_segment:         segment_ = int(max_segment / 2 )         return  segments[:segment_] + segments[-segment_:]     else :                  return  segments 
 
定义 get_examples 
里面调用 sentence_split
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 def  get_examples (data, vocab, max_sent_len=256 , max_segment=8 ) :    label2id = vocab.label2id     examples = []     for  text, label in  zip(data['text' ], data['label' ]):                  id = label2id(label)                  sents_words = sentence_split(text, vocab, max_sent_len, max_segment)         doc = []         for  sent_len, sent_words in  sents_words:                          word_ids = vocab.word2id(sent_words)                          extword_ids = vocab.extword2id(sent_words)             doc.append([sent_len, word_ids, extword_ids])         examples.append([id, len(doc), doc])     logging.info('Total %d docs.'  % len(examples))     return  examples 
 
定义 batch_slice 
1 2 3 4 5 6 7 8 9 10 11 12 def  batch_slice (data, batch_size) :    batch_num = int(np.ceil(len(data) / float(batch_size)))     for  i in  range(batch_num):                  cur_batch_size = batch_size if  i < batch_num - 1  else  len(data) - batch_size * i         docs = [data[i * batch_size + b] for  b in  range(cur_batch_size)]         yield  docs 
 
定义 data_iter 
里面调用 batch_slice
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 def  data_iter (data, batch_size, shuffle=True, noise=1.0 ) :    """      randomly permute data, then sort by source length, and partition into batches     ensure that the length of  sentences in each batch     """     batched_data = []     if  shuffle:                  np.random.shuffle(data)                  lengths = [example[1 ] for  example in  data]          noisy_lengths = [- (l + np.random.uniform(- noise, noise)) for  l in  lengths]         sorted_indices = np.argsort(noisy_lengths).tolist()         sorted_data = [data[i] for  i in  sorted_indices]     else :         sorted_data = data          batched_data.extend(list(batch_slice(sorted_data, batch_size)))     if  shuffle:                  np.random.shuffle(batched_data)     for  batch in  batched_data:         yield  batch 
 
定义指标计算 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 from  sklearn.metrics import  f1_score, precision_score, recall_scoredef  get_score (y_ture, y_pred) :    y_ture = np.array(y_ture)     y_pred = np.array(y_pred)     f1 = f1_score(y_ture, y_pred, average='macro' ) * 100      p = precision_score(y_ture, y_pred, average='macro' ) * 100      r = recall_score(y_ture, y_pred, average='macro' ) * 100      return  str((reformat(p, 2 ), reformat(r, 2 ), reformat(f1, 2 ))), reformat(f1, 2 ) def  reformat (num, n) :    return  float(format(num, '0.'  + str(n) + 'f' )) 
 
定义训练和测试的方法 
包括 batch2tensor
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 import  timefrom  sklearn.metrics import  classification_reportclip = 5.0  epochs = 1  early_stops = 3  log_interval = 50  test_batch_size = 128  train_batch_size = 128  save_model = './cnn.bin'  save_test = './cnn.csv'  class  Trainer () :    def  __init__ (self, model, vocab) :         self.model = model         self.report = True                                     self.train_data = get_examples(train_data, vocab)         self.batch_num = int(np.ceil(len(self.train_data) / float(train_batch_size)))         self.dev_data = get_examples(dev_data, vocab)         self.test_data = get_examples(test_data, vocab)                  self.criterion = nn.CrossEntropyLoss()                  self.target_names = vocab.target_names                  self.optimizer = Optimizer(model.all_parameters)                  self.step = 0          self.early_stop = -1          self.best_train_f1, self.best_dev_f1 = 0 , 0          self.last_epoch = epochs     def  train (self) :         logging.info('Start training...' )         for  epoch in  range(1 , epochs + 1 ):             train_f1 = self._train(epoch)             dev_f1 = self._eval(epoch)             if  self.best_dev_f1 <= dev_f1:                 logging.info(                     "Exceed history dev = %.2f, current dev = %.2f"  % (self.best_dev_f1, dev_f1))                 torch.save(self.model.state_dict(), save_model)                 self.best_train_f1 = train_f1                 self.best_dev_f1 = dev_f1                 self.early_stop = 0              else :                 self.early_stop += 1                  if  self.early_stop == early_stops:                     logging.info(                         "Eearly stop in epoch %d, best train: %.2f, dev: %.2f"  % (                             epoch - early_stops, self.best_train_f1, self.best_dev_f1))                     self.last_epoch = epoch                     break      def  test (self) :         self.model.load_state_dict(torch.load(save_model))         self._eval(self.last_epoch + 1 , test=True )          def  _train (self, epoch) :         self.optimizer.zero_grad()         self.model.train()         start_time = time.time()         epoch_start_time = time.time()         overall_losses = 0          losses = 0          batch_idx = 1          y_pred = []         y_true = []         for  batch_data in  data_iter(self.train_data, train_batch_size, shuffle=True ):             torch.cuda.empty_cache()                                                    batch_inputs, batch_labels = self.batch2tensor(batch_data)                          batch_outputs = self.model(batch_inputs)                                       loss = self.criterion(batch_outputs, batch_labels)                          loss.backward()             loss_value = loss.detach().cpu().item()             losses += loss_value             overall_losses += loss_value                          y_pred.extend(torch.max(batch_outputs, dim=1 )[1 ].cpu().numpy().tolist())             y_true.extend(batch_labels.cpu().numpy().tolist())                          nn.utils.clip_grad_norm_(self.optimizer.all_params, max_norm=clip)             for  optimizer, scheduler in  zip(self.optimizer.optims, self.optimizer.schedulers):                 optimizer.step()                 scheduler.step()             self.optimizer.zero_grad()             self.step += 1              if  batch_idx % log_interval == 0 :                 elapsed = time.time() - start_time                                  lrs = self.optimizer.get_lr()                 logging.info(                     '| epoch {:3d} | step {:3d} | batch {:3d}/{:3d} | lr{} | loss {:.4f} | s/batch {:.2f}' .format(                         epoch, self.step, batch_idx, self.batch_num, lrs,                         losses / log_interval,                         elapsed / log_interval))                                  losses = 0                  start_time = time.time()                              batch_idx += 1                       overall_losses /= self.batch_num         during_time = time.time() - epoch_start_time                  overall_losses = reformat(overall_losses, 4 )         score, f1 = get_score(y_true, y_pred)         logging.info(             '| epoch {:3d} | score {} | f1 {} | loss {:.4f} | time {:.2f}' .format(epoch, score, f1,                                                                                   overall_losses,                  if  set(y_true) == set(y_pred) and  self.report:             report = classification_report(y_true, y_pred, digits=4 , target_names=self.target_names)             logging.info('\n'  + report)         return  f1          def  _eval (self, epoch, test=False) :         self.model.eval()         start_time = time.time()         data = self.test_data if  test else  self.dev_data         y_pred = []         y_true = []         with  torch.no_grad():             for  batch_data in  data_iter(data, test_batch_size, shuffle=False ):                 torch.cuda.empty_cache()                                                                        batch_inputs, batch_labels = self.batch2tensor(batch_data)                                  batch_outputs = self.model(batch_inputs)                                  y_pred.extend(torch.max(batch_outputs, dim=1 )[1 ].cpu().numpy().tolist())                 y_true.extend(batch_labels.cpu().numpy().tolist())             score, f1 = get_score(y_true, y_pred)             during_time = time.time() - start_time                          if  test:                 df = pd.DataFrame({'label' : y_pred})                 df.to_csv(save_test, index=False , sep=',' )             else :                 logging.info(                     '| epoch {:3d} | dev | score {} | f1 {} | time {:.2f}' .format(epoch, score, f1,                                                                               during_time))                 if  set(y_true) == set(y_pred) and  self.report:                     report = classification_report(y_true, y_pred, digits=4 , target_names=self.target_names)                     logging.info('\n'  + report)         return  f1                         def  batch2tensor (self, batch_data) :         '''              [[label, doc_len, [[sent_len, [sent_id0, ...], [sent_id1, ...]], ...]]         '''         batch_size = len(batch_data)         doc_labels = []         doc_lens = []         doc_max_sent_len = []         for  doc_data in  batch_data:                                       doc_labels.append(doc_data[0 ])                          doc_lens.append(doc_data[1 ])                                       sent_lens = [sent_data[0 ] for  sent_data in  doc_data[2 ]]                          max_sent_len = max(sent_lens)             doc_max_sent_len.append(max_sent_len)                           max_doc_len = max(doc_lens)                  max_sent_len = max(doc_max_sent_len)                  batch_inputs1 = torch.zeros((batch_size, max_doc_len, max_sent_len), dtype=torch.int64)         batch_inputs2 = torch.zeros((batch_size, max_doc_len, max_sent_len), dtype=torch.int64)         batch_masks = torch.zeros((batch_size, max_doc_len, max_sent_len), dtype=torch.float32)         batch_labels = torch.LongTensor(doc_labels)         for  b in  range(batch_size):             for  sent_idx in  range(doc_lens[b]):                                  sent_data = batch_data[b][2 ][sent_idx]                  for  word_idx in  range(sent_data[0 ]):                                           batch_inputs1[b, sent_idx, word_idx] = sent_data[1 ][word_idx]                                          batch_inputs2[b, sent_idx, word_idx] = sent_data[2 ][word_idx]                                          batch_masks[b, sent_idx, word_idx] = 1          if  use_cuda:             batch_inputs1 = batch_inputs1.to(device)             batch_inputs2 = batch_inputs2.to(device)             batch_masks = batch_masks.to(device)             batch_labels = batch_labels.to(device)         return  (batch_inputs1, batch_inputs2, batch_masks), batch_labels 
 
2 
1 2 3 trainer = Trainer(model, vocab) trainer.train()