2020-08-22数据竞赛19 分钟读完 (大约 2919 个字)

数据竞赛/阿里天池 NLP 入门赛 Bert 方案 -3 Bert 预训练与分类

前言

这篇文章用于记录阿里天池 NLP 入门赛，详细讲解了整个数据处理流程，以及如何从零构建一个模型，适合新手入门。

赛题以新闻数据为赛题数据，数据集报名后可见并可下载。赛题数据为新闻文本，并按照字符级别进行匿名处理。整合划分出14个候选分类类别：财经、彩票、房产、股票、家居、教育、科技、社会、时尚、时政、体育、星座、游戏、娱乐的文本数据。实质上是一个 14 分类问题。

赛题数据由以下几个部分构成：训练集20w条样本，测试集A包括5w条样本，测试集B包括5w条样本。

比赛地址：https://tianchi.aliyun.com/competition/entrance/531810/introduction

数据可以通过上面的链接下载。

代码地址：https://github.com/zhangxiann/Tianchi-NLP-Beginner

分为 3 篇文章介绍：

在上一篇文章中，我们介绍了 Bert 的源码。

这篇文章，我们来看下如何预训练 Bert，以及使用 Bert 进行分类。

训练 Bert

在前面，我们已经了解完了 Bert 的源码，现在我们我来看如何训练 Bert。

训练 Bert 对应的代码文件是 run_pretraining.py。

脚本

训练脚本为：run_pretraining.sh，内容如下：

python run_pretraining.py 
--input_file=./records/*.tfrecord                # 处理好的文件
--output_dir=./bert-mini                         # 训练好模型，保存的位置
--do_train=True                                  # 开启训练
--do_eval=True                                   # 开启验证
--bert_config_file=./bert-mini/bert_config.json  # 词典路径
--train_batch_size=128                           # 训练的 batch_size
--eval_batch_size=128                            # 测试的 batch_size
--max_seq_length=256                             # 句子的最大长度
--max_predictions_per_seq=32                     # 每个句子 mask 的最大数量
--learning_rate=1e-4                             # 学习率

训练过程主要用了estimator调度器。这个调度器支持自定义训练过程，将训练集传入之后自动训练。

对应的代码文件是 run_pretraining.py。

主要函数是 model_fn_builder() ，get_masked_lm_output()，get_next_sentence_output()。

model_fn_builder()

在这个函数里创建 Bert 模型，得到输出，然后分别调用 get_masked_lm_output() 计算预测 mask 词的损失~~，调用 get_next_sentence_output() 计算预测前后句子的 loss*~~（这里不预测句子前后关系，因此不计算 loss）。

def model_fn_builder(bert_config, init_checkpoint, learning_rate,
                     num_train_steps, num_warmup_steps, use_tpu,
                     use_one_hot_embeddings):
    """Returns `model_fn` closure for TPUEstimator."""

    def model_fn(features, labels, mode, params):  # pylint: disable=unused-argument
        """The `model_fn` for TPUEstimator."""
    
        tf.logging.info("*** Features ***")
        for name in sorted(features.keys()):
            tf.logging.info("  name = %s, shape = %s" % (name, features[name].shape))
        # input_ids: [batch_size, seq_length]
        input_ids = features["input_ids"]
        # input_mask: [batch_size, seq_length]
        input_mask = features["input_mask"]
        # segment_ids: [batch_size, seq_length]
        segment_ids = features["segment_ids"]
        # masked_lm_positions: [batch_size, max_predictions_per_seq]
        masked_lm_positions = features["masked_lm_positions"]
        # masked_lm_ids: [batch_size, max_predictions_per_seq]
        masked_lm_ids = features["masked_lm_ids"]
        # masked_lm_weights: [batch_size, max_predictions_per_seq]
        masked_lm_weights = features["masked_lm_weights"]
        # 这里没用到 NSP，因此用不到这个变量
        next_sentence_labels = features["next_sentence_labels"]
    
        is_training = (mode == tf.estimator.ModeKeys.TRAIN)
        # 创建 Bert
        model = modeling.BertModel(
            config=bert_config,
            is_training=is_training,
            input_ids=input_ids,
            input_mask=input_mask,
            token_type_ids=segment_ids,
            use_one_hot_embeddings=use_one_hot_embeddings)
    
        # 调用 get_masked_lm_output，计算 loss
        (masked_lm_loss,
         masked_lm_example_loss, masked_lm_log_probs) = get_masked_lm_output(
            bert_config, model.get_sequence_output(), model.get_embedding_table(),
            masked_lm_positions, masked_lm_ids, masked_lm_weights)
    
        total_loss = masked_lm_loss
    
        # No NSP
        # (next_sentence_loss, next_sentence_example_loss,
        #  next_sentence_log_probs) = get_next_sentence_output(
        #     bert_config, model.get_pooled_output(), next_sentence_labels)
        #
        # total_loss = masked_lm_loss + next_sentence_loss
    
        tvars = tf.trainable_variables()
    
        initialized_variable_names = {}
        scaffold_fn = None
        # 如果之前有训练好的模型，那么加载训练好的参数
        if init_checkpoint:
            (assignment_map, initialized_variable_names
             ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
            if use_tpu:
    
                def tpu_scaffold():
                    tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
                    return tf.train.Scaffold()
    
                scaffold_fn = tpu_scaffold
            else:
                tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
    
        tf.logging.info("**** Trainable Variables ****")
        for var in tvars:
            init_string = ""
            if var.name in initialized_variable_names:
                init_string = ", *INIT_FROM_CKPT*"
            tf.logging.info("  name = %s, shape = %s%s", var.name, var.shape,
                            init_string)
    
        output_spec = None
        if mode == tf.estimator.ModeKeys.TRAIN:
        # 验证
            # 定义优化器
            train_op = optimization.create_optimizer(
                total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)
            # 不懂
            output_spec = tf.contrib.tpu.TPUEstimatorSpec(
                mode=mode,
                loss=total_loss,
                train_op=train_op,
                scaffold_fn=scaffold_fn)
        elif mode == tf.estimator.ModeKeys.EVAL:
        # 验证
            def metric_fn(masked_lm_example_loss, masked_lm_log_probs, masked_lm_ids,
                          masked_lm_weights):
                """Computes the loss and accuracy of the model."""
                # masked_lm_log_probs: [batch_size * max_predictions_per_seq, vocab_size]
                masked_lm_log_probs = tf.reshape(masked_lm_log_probs,
                                                 [-1, masked_lm_log_probs.shape[-1]])
                # 取最大值所在的索引，获得预测的 id: [batch_size * max_predictions_per_seq]
                masked_lm_predictions = tf.argmax(
                    masked_lm_log_probs, axis=-1, output_type=tf.int32)
                # masked_lm_example_loss: [batch_size * max_predictions_per_seq]
                masked_lm_example_loss = tf.reshape(masked_lm_example_loss, [-1])
                masked_lm_ids = tf.reshape(masked_lm_ids, [-1])
                masked_lm_weights = tf.reshape(masked_lm_weights, [-1])
                # 计算平均准确率
                masked_lm_accuracy = tf.metrics.accuracy(
                    labels=masked_lm_ids,
                    predictions=masked_lm_predictions,
                    weights=masked_lm_weights)
                # 计算平均 loss，这个 loss 和 masked_lm_loss 是一样的
                masked_lm_mean_loss = tf.metrics.mean(
                    values=masked_lm_example_loss, weights=masked_lm_weights)
                # 返回准确率和 loss
                return {
                    "masked_lm_accuracy": masked_lm_accuracy,
                    "masked_lm_loss": masked_lm_mean_loss,
                }
    
            eval_metrics = (metric_fn, [
                masked_lm_example_loss, masked_lm_log_probs, masked_lm_ids,
                masked_lm_weights
            ])
            output_spec = tf.contrib.tpu.TPUEstimatorSpec(
                mode=mode,
                loss=total_loss,
                eval_metrics=eval_metrics,
                scaffold_fn=scaffold_fn)
        else:
            raise ValueError("Only TRAIN and EVAL modes are supported: %s" % (mode))
    
        return output_spec
    
    return model_fn

get_masked_lm_output()

get_masked_lm_output() 的作用是计算 mask 预测的 loss。

输入参数：

input_tensor：BertModel 最后一层的输出，形状是 [batch_size, seq_length, hidden_size]。
output_weights：形状是 [vocab_size, hidden_size]。
positions：表示 mask 的位置，形状是 [vocab_size, hidden_size]。
label_ids：表示 mask 对应的真实 token。
label_weights：每个 mask 的权重。

流程如下：

从 input_tensor 中，根据 positions 取出 mask 对应的输出。
将 input_tensor 经过一个全连接层和 layer_norm 层，得到 logits，形状为 [batch_size * max_predictions_per_seq, vocab_size]。
将 logits 和 output_weights 相乘，得到概率矩阵 log_probs，形状为 [batch_size * max_predictions_per_seq, vocab_size]，再经过 softmax。
将 log_probs 和真实标签 one_hot_labels 计算加权 loss。

# input_tensor: [batch_size, seq_length, hidden_size]
# output_weights: [vocab_size, hidden_size]
def get_masked_lm_output(bert_config, input_tensor, output_weights, positions,
                         label_ids, label_weights):
    """Get loss and log probs for the masked LM."""
    # 取出 mask 的元素
    # input_tensor: [batch_size, seq_length, hidden_size]
    # input_tensor: [batch_size * max_predictions_per_seq, hidden_size]
    input_tensor = gather_indexes(input_tensor, positions)

    with tf.variable_scope("cls/predictions"):
        # We apply one more non-linear transformation before the output layer.
        # This matrix is not used after pre-training.
        # 将 mask 的元素经过全连接层 和 layer_norm
        with tf.variable_scope("transform"):
            input_tensor = tf.layers.dense(
                input_tensor,
                units=bert_config.hidden_size,
                activation=modeling.get_activation(bert_config.hidden_act),
                kernel_initializer=modeling.create_initializer(
                    bert_config.initializer_range))
            input_tensor = modeling.layer_norm(input_tensor)
    
        # The output weights are the same as the input embeddings, but there is
        # an output-only bias for each token.
        # output_bias: [vocab_size]
        output_bias = tf.get_variable(
            "output_bias",
            shape=[bert_config.vocab_size],
            initializer=tf.zeros_initializer())
    
        # transpose_b=True 表示把第二个参数转置
        # logits: [batch_size * max_predictions_per_seq, vocab_size]
        logits = tf.matmul(input_tensor, output_weights, transpose_b=True)
        logits = tf.nn.bias_add(logits, output_bias)
        # log_probs: [batch_size * max_predictions_per_seq, 1]
        log_probs = tf.nn.log_softmax(logits, axis=-1)
        # [batch_size, max_predictions_per_seq] -> batch_size * max_predictions_per_seq
        label_ids = tf.reshape(label_ids, [-1])
        # [batch_size, max_predictions_per_seq] -> batch_size * max_predictions_per_seq
        label_weights = tf.reshape(label_weights, [-1])
        # one_hot_labels: [batch_size * max_predictions_per_seq, vocab_size]
        one_hot_labels = tf.one_hot(label_ids, depth=bert_config.vocab_size, dtype=tf.float32)
    
        # The `positions` tensor might be zero-padded (if the sequence is too
        # short to have the maximum number of predictions). The `label_weights`
        # tensor has a value of 1.0 for every real prediction and 0.0 for the
        # padding predictions.
        # per_example_loss: [batch_size * max_predictions_per_seq] 每个位置相乘
        per_example_loss = -tf.reduce_sum(log_probs * one_hot_labels, axis=[-1])
        numerator = tf.reduce_sum(label_weights * per_example_loss) # 分子
        denominator = tf.reduce_sum(label_weights) + 1e-5 # 分母
        # 计算加权平均 loss
        loss = numerator / denominator
    
    return (loss, per_example_loss, log_probs)

训练完成后，会把训练好的模型保存到 output_dit 中。

转换为 PyTorch 模型

由于我们是使用 Tensorflow 来训练模型，而我们的文本分类模型是使用 PyTorch 的，因此需要把 Tensorflow 的模型，转换为 PyTorch 的模型。

这里使用 HuggingFace 提供的转换代码。

代码文件为 convert_checkpoint.py，脚本文件为 convert_checkpoint.sh，脚本如下：

export BERT_BASE_DIR=./bert-mini                       # 设置模型路径
python convert_checkpoint.py
--bert_config_file $BERT_BASE_DIR/bert_config.json     # Bert 配置文件
--tf_checkpoint $BERT_BASE_DIR/bert_model.ckpt-100000  # Tensorflow 模型名称
--config $BERT_BASE_DIR/bert_config.json               # 词典路径
--pytorch_dump_path $BERT_BASE_DIR/pytorch_model.bin # PyTorch 模型名称

注意，你需要先安装 tensorflow，pytorch，transformers。

微调 Bert 模型

在上一篇文章阿里天池 NLP 入门赛 TextCNN 方案代码详细注释和流程讲解中，我们使用 TextCNN 来训练模型，模型结构图如下：

图中的 WordCNNEncoder 就是TextCNN。

我们把 TextCNN 替换为 Bert。

模型结构图如下：

我们只关注如何使用 WordBertEncoder，模型其他部分的细节与上一篇文章一样，请查看阿里天池 NLP 入门赛 TextCNN 方案代码详细注释和流程讲解。

WordBertEncoder 代码如下。

首先加载转换好的 PyTorch 模型。

在 forward() 函数中，将 input_ids 和 token_type_ids 输入到 Bert 模型。

得到 sequence_output（表示最后一个 Encoder 对应的 hidden-states），pooled_output（表示最后一个 Encoder 的第一个 token 对应的 hidden-states）。

代码中有详细注释。

# build word encoder
bert_path = osp.join(dir,'./bert/bert-mini/')
dropout = 0.15

from transformers import BertModel


class WordBertEncoder(nn.Module):
    def __init__(self):
        super(WordBertEncoder, self).__init__()
        self.dropout = nn.Dropout(dropout)

        self.tokenizer = WhitespaceTokenizer()
        # 加载 Bert 模型
        self.bert = BertModel.from_pretrained(bert_path)
    
        self.pooled = False
        logging.info('Build Bert encoder with pooled {}.'.format(self.pooled))
    
    def encode(self, tokens):
        tokens = self.tokenizer.tokenize(tokens)
        return tokens
    
    # 如果参数名字里，包含 ['bias', 'LayerNorm.weight']，那么没有 decay
    # 其他参数都有 0.01 的 decay
    def get_bert_parameters(self):
        no_decay = ['bias', 'LayerNorm.weight']
        optimizer_parameters = [
            {'params': [p for n, p in self.bert.named_parameters() if not any(nd in n for nd in no_decay)],
             'weight_decay': 0.01},
            {'params': [p for n, p in self.bert.named_parameters() if any(nd in n for nd in no_decay)],
             'weight_decay': 0.0}
        ]
        return optimizer_parameters
    
    def forward(self, input_ids, token_type_ids):
        # bert_len 是句子的长度
        # input_ids: sen_num * bert_len
        # token_type_ids: sen_num  * bert_len


    
        # 256 是 hidden_size
        # sequence_output：sen_num * bert_len * 256。是最后一个 Encoder 输出的 hidden-states
        # pooled_output：sen_num * 256。首先取最后一个 Encoder 层输出的 hidden-states 的第一个位置对应的 hidden-state，
        # 也就是 CLS 对应的 hidden state，是一个 256 维的向量。经过线性变换和 Tanh 激活函数得到最终的 256 维向量。
        # 可以直接用于分类
        sequence_output, pooled_output = self.bert(input_ids=input_ids, token_type_ids=token_type_ids)
        # Bert 模型的输出是一个 tuple，包含 4 个元素：last_hidden_state、pooler_output、hidden_states、attentions
        
        if self.pooled:
            reps = pooled_output             # 取第一个元素的 hidden state： sen_num * 256
        else:
            reps = sequence_output[:, 0, :]  # 取第一个元素的 hidden state： sen_num * 256
    
        if self.training:
            reps = self.dropout(reps)
    
        return reps # sen_num * 256