1 - Sequence to Sequence Learning with Neural Networks

在本系列中，我们将使用PyTorch和TorchText构建一个机器学习模型，从一个序列到另一个序列的转换。这将在德语到英语翻译中完成，但模型可以应用于涉及从一个序列到另一个序列的任何问题，例如摘要。

在第一本笔记本中，我们将通过实现神经网络的序列到序列学习论文中的模型，开始简单地理解一般概念。本文使用的是pytorch 1.0 和 torchtext 0.3，在python3.6环境下。

简介

最常见的序列到序列（seq2seq）模型是 编码器 - 解码器(encoder-decoder) 模型，其（通常）使用递归神经网络（RNN）来将源（输入）句子编码成单个向量。在这个笔记本中，我们将这个单个向量称为上下文向量(context vector)。您可以将上下文向量视为整个输入句子的抽象表示。然后，该矢量由第二RNN解码，该第二RNN通过一次生成一个字来学习输出目标（输出）句子。

上图显示了一个示例翻译。输入语句 “guten morgen” 一次一个单词地输入编码器（绿色）。我们还将序列开头（<sos>）和序列结尾（<eos>）标记的开头分别附加到句子的开头和结尾。在每一步里，编码器RNN的输入都是当前单词$ x_t $，以及来自前一时间步的隐藏状态$ h_ {t-1}$ ，并且编码器RNN输出新的隐藏状态$ h_t $。到目前为止，您可以将隐藏状态(hidden state)视为句子的向量表示。

$h_t = \text{EncoderRNN}(x_t, h_{t-1})$

我们这里通常使用RNN架构，它可以是任何循环架构，例如LSTM （长短期存储器）或 GRU(门控循环单元）。

在这里，我们有

$X = \{ x_1，x_2，...，x_T\}\ ,$

其中

$x_1 = \ text\ {<sos>}\ ，x_2 = \ text\ \{guten\} \ \text {等}$

初始隐藏状态—$h_0$ 通常初始化为零或学习参数。

一旦最后一个字$ x_T $被传递到RNN，我们使用最终隐藏状态$ h_T $作为上下文向量(context vector)，即$ h_T = z $。这是整个源句子的向量表示。

现在我们有了我们的上下文向量$ z $，我们可以开始解码它来获得目标句子——“早上好”。同样，我们将序列标记的开始和结束附加到目标句子。在每个时间步骤，解码器RNN（蓝色）的输入是当前单词，$ y_t $，以及来自前一时间步的隐藏状态，$ s_{t-1} $。其中初始解码器隐藏状态 $ s_0 $是上下文向量(context vector)，$ s_0 = z = h_T $，即初始解码器隐藏状态是最终编码器隐藏状态。

解码器的公式如下：

$s_t = \text{DecoderRNN}(y_t, s_{t-1})$

在解码器中，我们需要从隐藏状态转换为实际单词，因此在每个时间步骤我们使用$ s_t $进行预测（通过传递给它的线性全连接层，以紫色显示）我们认为是序列中的下一个单词，$ \hat{y}_t $。

$\hat{y}_t = f(s_t)$

我们总是使用<sos>作为解码器的第一个输入，$ y_1 $，但是对于后续输入，$ y_ {t>1} $，我们有时会使用序列中实真实的下一个单词，$ y_t $。有时也会使用我们的解码器预测的单词$ \hat{y}_{t-1} $。这被称为teacher forcing，你可以在这里阅读更多内容。

在训练/测试我们的模型时，我们总是知道目标句子中有多少单词，所以一旦我们达到那么多，我们就会停止生成单词。在推理（即现实世界使用）期间，通常保持生成单词直到模型输出<eos>标记或者在生成一定量的单词之后。

一旦我们得到了预测的目标句子，

$\hat{Y} = \{\hat{y}_1，\hat{y}_2，...，\hat{y}_T \} \ ，$

我们将它与我们的比较实际目标句子，

$Y = \{y_1，y_2，...，y_T\}\ ，$

以计算我们的损失。然后我们使用此损失来更新模型中的所有参数。

数据处理

我们将在pytorch中编写模型并使用torchtext帮助我们完成所需的所有预处理。我们还将使用spacy来协助数据的标记化。

import torch
import torch.nn as nn
import torch.optim as optim

from torchtext.datasets import TranslationDataset, Multi30k
from torchtext.data import Field, BucketIterator

import spacy

import random
import math
import os
import time

SEED = 1
random.seed(SEED)  # set random seed
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

接下来，我们将创建tokenizer。 Tokenizer用于将包含句子的字符串转换为构成该字符串的各个token的列表，例如， “Good morning!” 转换成为[“早上”，“好”，“!”]。我们从现在开始讨论的句子是一系列token，而不是说它们是一系列单词。

有什么不同？好吧，“Good”和“morning”都是单词和token。但是“!” 是一个token，而不是一个单词。

spacy有每种语言的模型（德语为“de”，英语为“en”）需要加载，因此我们可以访问每个模型的标记器。

注意：必须首先使用命令行下载en,de模型：

python -m spacy download en
python -m spacy download de

我们这样加载模型：

spacy_de = spacy.load('de')
spacy_en = spacy.load('en')

接下来，我们创建tokenizer函数。 tokenizer函数可以传递给TorchText，并将字符串(句子)作为函数的输入并将句子转换成的token列表作为输出。

在我们正在实现的论文中，他们发现扭转输入顺序是有益的，他们认为“在数据中引入了许多短期依赖关系，使优化问题变得更加容易”。

def tokenize_de(text):
    """
    Tokenizes German text from a string into a list of strings (tokens) and reverses it
    """
    return [tok.text for tok in spacy_de.tokenizer(text)][::-1]

def tokenize_en(text):
    """
    Tokenizes English text from a string into a list of strings (tokens)
    """
    return [tok.text for tok in spacy_en.tokenizer(text)]

这里我们使用TorchText的Field函数处理数据。您可以阅读所有可能的参数此处。

我们将tokenize参数设置为每个的正确标记化函数，德语是SRC（源）字段，英语是TRG（目标）字段。该字段还通过init_token和eos_token参数附加“序列开始”和“序列结束”标记，并将所有单词转换为小写。

SRC = Field(tokenize=tokenize_de, init_token='<sos>', eos_token='<eos>', lower=True)
TRG = Field(tokenize=tokenize_en, init_token='<sos>', eos_token='<eos>', lower=True)

接下来，我们下载并加载训练集，验证集和测试集数据。我们将使用的数据集是Multi30k数据集。这是一个包含约30,000个并行英语、德语和法语句子的数据集，每个句子大约12个单词。exts指定使用哪些语言作为源和目标（源首先）和fields指定用于源和目标的字段。

train_data, valid_data, test_data = Multi30k.splits(exts=('.de', '.en'), fields=(SRC, TRG))
print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of validation examples: {len(valid_data.examples)}")
print(f"Number of testing examples: {len(test_data.examples)}")

接下来，我们将为源语言和目标语言构建词汇表(vocabulary)。词汇表用于将每个唯一的token与索引（整数）相关联，这用于为每一个token构建一个热编码(one-hot encoding)。源语言和目标语言的词汇表是截然不同的。 使用min_freq参数，我们只允许至少出现2次的token出现在我们的词汇表中。仅出现一次令牌转换成<UNK>（未知）token。重要的是要注意，您的词汇表只能从训练集而不是验证/测试集构建。这可以防止“信息泄漏”进入您的模型，为您提供人为夸大的验证/测试分数。

SRC.build_vocab(train_data, min_freq=2)
TRG.build_vocab(train_data, min_freq=2)

准备数据的最后一步是创建迭代器。这些可以被迭代以返回一批具有src属性的数据（包含一批数字化源句的PyTorch张量）和trg属性的数据（包含一批数字化目标句子的PyTorch张量）。数字化只是一种奇特的方式，即使用词汇表将它们从可读token序列转换为相应的索引序列。

我们还需要定义一个torch.device。这用于告诉TorchText将张量放在GPU上。我们使用torch.cuda.is_available（）函数，如果在我们的计算机上检测到GPU，它将返回True。我们将这个device传递给迭代器。

当我们使用迭代器获得一批示例时，我们需要确保所有源句子都填充到相同的长度，与目标句子相同。幸运的是，TorchText迭代器为我们处理这个问题！我们使用BucketIterator而不是标准的Iterator，因为它创建批次的方式可以最大限度地减少源句和目标句子中的填充量。

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
BATCH_SIZE = 128
train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), batch_size=BATCH_SIZE, device=device)

构建Seq2Seq Model

我们将分三个部分构建我们的模型。编码器，解码器和封装编码器和解码器的seq2seq模型，并且将提供与每个模块的接口。

Encoder

首先，编码器，2层LSTM。我们正在实现的论文使用了4层LSTM，但为了训练时间，我们将其减少到2层。多层RNN的概念很容易从2层扩展到4层。对于多层RNN，输入句子$ X $进入RNN的第一层(最底层)，而第一层的隐藏状态输出，$ H = {h_1，h_2，…，h_T } $，作为上一层的RNN的输入。因此，用上标表示每个层，第一层中的隐藏状态由下式给出：

$h_t^1 = \text{EncoderRNN}^1(x_t, h_{t-1}^1)$

第二层中的隐藏状态由下式给出：

$h_t^2 = \text{EncoderRNN}^2(h_t^1, h_{t-1}^2)$

使用多层RNN也意味着我们还需要一个初始隐藏状态作为每层输入，$ h_0 ^ 1 $，我们还将输出每层的上下文向量，$ z ^ 1$。如果你想了解更多关于LSTM的信息，请参阅LSTM博客文章，我们只需要知道的是，它是另一种类型的RNN——不是仅仅处读取一个隐藏状态，在每个时间步返回一个新的隐藏状态。LSTM在每个时间步也接收并返回单元状态，$ c_t $。

$\begin{align*} h_t &= \text{RNN}(x_t, h_{t-1})\\ (h_t, c_t) &= \text{LSTM}(x_t, (h_{t-1}, c_{t-1})) \end{align*}$

您可以将$ c_t $视为另一种隐藏状态。类似于$ h_0^1$，$ c_0^1$将被初始化为全零的张量。此外，我们的上下文向量现在是最终隐藏状态和最终单元状态，即$ z^1 =（h_T^1，c_T^1）$。

将我们的多层方程扩展到LSTM，我们得到：

$\begin{align*} (h_t^1, c_t^1) &= \text{EncoderLSTM}^1(x_t, (h_{t-1}^1, c_{t-1}^1))\\ (h_t^2, c_t^2) &= \text{EncoderLSTM}^2(h_t^1, (h_{t-1}^2, c_{t-1}^2)) \end{align*}$

请注意，我们只将第一层的隐藏状态作为输入传递给第二层，而不是单元状态。

LSTM

We create this in code by making an Encoder module, which requires we inherit from torch.nn.Module and use the super().__init__() as some boilerplate code. The encoder takes the following arguments:

input_dim is the size/dimensionality of the one-hot vectors that will be input to the encoder. This is equal to the input (source) vocabulary size.
emb_dim is the dimensionality of the embedding layer. This layer converts the one-hot vectors into dense vectors with emb_dim dimensions.
hid_dim is the dimensionality of the hidden and cell states.
n_layers is the number of layers in the RNN.
dropout is the amount of dropout to use. This is a regularization parameter to prevent overfitting. Check out this for more details about dropout.

我们不打算在这些教程中讨论嵌入层。您需要知道的是，在单词之前还有一个步骤上——word embedding。单词的索引被传递到RNN，其中单词被转换为向量。如果您想了解嵌入词向量，我建议这些文章1，2，3，4。

The embedding layer is created using nn.Embedding, the LSTM with nn.LSTM and a dropout layer with nn.Dropout. Check the PyTorch documentation for more about these.

在forward方法中，我们传入源句子$ X $，使用embedding层将其转换为密集向量，然后应用dropout。然后将这些嵌入传递到RNN。当我们将整个序列传递给RNN时，它会自动为整个序列重复计算隐藏状态！您可能会注意到我们没有将初始隐藏或单元状态传递给RNN。这是因为，如文档所述，如果没有将隐藏/单元状态传递给RNN，它将会自动创建一个初始隐藏/单元格状态作为全零的张量。

RNN返回：outputs（每个时间步骤的顶层隐藏状态），hidden（每层的最终隐藏状态，$ h_T $，堆叠在彼此之上）和cell（每层的最终单元状态，$ c_T $，相互叠加）。

因为我们只需要最终的隐藏和单元格状态（使我们的上下文向量），forward只返回hidden和cell。

每个张量的大小在代码中留作注释。在此实现中，n_directions将始终为1，但请注意，双向RNN（后续介绍）将具有n_directions为2。

class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.input_dim = input_dim
        self.emb_dim = emb_dim
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        self.dropout = dropout
        
        self.embedding = nn.Embedding(input_dim, emb_dim)
        
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout=dropout)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        
        #src = [src sent len, batch size]
        
        embedded = self.dropout(self.embedding(src))
        
        #embedded = [src sent len, batch size, emb dim]
        
        outputs, (hidden, cell) = self.rnn(embedded)
        
        #outputs = [src sent len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #outputs are always from the top hidden layer
        
        return hidden, cell

Decoder

接下来，我们将构建我们的解码器，它也将是一个2层（论文中为4个）LSTM。

LSTM

Decoder类只执行一个解码步骤。第一层将从前一个时间步骤(previous time step)，接收隐藏和单元状态$ s_ {t-1} ^ 1，c_ {t-1} ^ 1）$。LSTM使用当前token, $ y_t $，产生一个新的隐藏和单元状态，$（s_t ^ 1，c_t ^ 1）$。后续层将使用前面层中的隐藏状态，$ s_t ^ {l-1} $，以及来自当前层的之前的隐藏和单元状态，$（s_ {t-1} ^ l，c_ {t-1}^ l）$。这提供了与编码器中的方程非常相似的方程。

$\begin{align*} (s_t^1, c_t^1) = \text{DecoderLSTM}^1(y_t, (s_{t-1}^1, c_{t-1}^1))\\ (s_t^2, c_t^2) = \text{DecoderLSTM}^2(s_t^1, (s_{t-1}^2, c_{t-1}^2)) \end{align*}$

请记住，解码器的初始隐藏和单元状态是我们的上下文向量，它们是来自同一层的编码器的最终隐藏和单元状态，即$（s_0 ^ 1，c_0 ^ 1）= z ^ 1 =（h_T^1，C_T^1）$。

然后我们通过线性层$ f $传递来自RNN顶层的隐藏状态$ s_t ^L $，以预测目标（输出）序列中的下一个token应该是什么，$ \hat{Y}_ {T +1}$。

$\hat{y}_{t+1} = f(s_t^L)$

参数和初始化类似于Encoder类，除了我们现在有一个output_dim，它是输入到Decoder的one-hot矢量的维度大小——这些等于输出/目标的词汇量大小。我们还增加了线性连接层，用于从顶层隐藏状态进行预测。

在forward方法中，我们接受一批输入的token，先前的隐藏状态和先前的单元状态。我们压缩输入标记以添加1的句子长度维度。然后，类似于编码器，我们通过嵌入层并应用dropout。然后将这批嵌入式token传递到具有先前隐藏和单元状态的RNN。这会产生一个output（来自RNN最顶层的隐藏状态），一个新的hidden状态（每层一个，堆叠在一起）和一个新的cell状态（每层一个，堆叠在彼此之上）。然后我们通过线性层传递output（在删除句子长度维度之后）以接收我们的prediction。然后我们返回prediction，新的hidden状态和新的cell状态。

class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()

        self.emb_dim = emb_dim
        self.hid_dim = hid_dim
        self.output_dim = output_dim
        self.n_layers = n_layers
        self.dropout = dropout
        
        self.embedding = nn.Embedding(output_dim, emb_dim)
        
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout=dropout)
        
        self.out = nn.Linear(hid_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, cell):
        
        #input = [batch size]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #n directions in the decoder will both always be 1, therefore:
        #hidden = [n layers, batch size, hid dim]
        #context = [n layers, batch size, hid dim]
        
        input = input.unsqueeze(0)
        
        #input = [1, batch size]
        
        embedded = self.dropout(self.embedding(input))
        
        #embedded = [1, batch size, emb dim]
                
        output, (hidden, cell) = self.rnn(embedded, (hidden, cell))
        
        #output = [sent len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #sent len and n directions will always be 1 in the decoder, therefore:
        #output = [1, batch size, hid dim]
        #hidden = [n layers, batch size, hid dim]
        #cell = [n layers, batch size, hid dim]
        
        prediction = self.out(output.squeeze(0))
        
        #prediction = [batch size, output dim]
        
        return prediction, hidden, cell

Seq2Seq

对于实现的最后部分，我们将实现seq2seq模型。这将处理：

接收输入/源语句
使用编码器生成上下文向量
使用解码器产生预测的输出/目标句子

我们的完整模型将如下所示：

seq2seq

Seq2Seq模型包含Encoder，Decoder和device（用于在GPU上放置张量，如果存在的话）。

对于这个实现，我们必须确保Encoder和Decoder中的层数和隐藏（和单元）维度相等。然而情况并非总是如此。在序列到序列模型中，您不一定需要相同数量的层或相同的隐藏层维度。但是，如果您执行的操作具有不同数量的层，则需要决定如何处理这些图层。例如，如果您的编码器有2层而您的解码器只有1，那么这是如何处理的？你对解码器输出的两个上下文向量求平均值吗？你是否通过线性层？您是否仅使用最高层的上下文向量？等等。

我们的forward方法需要输入源语句，目标语句和teacher-forcing ratio(不知道怎么翻译？)。在训练我们的模型时使用teacher-forcing ratio。解码时在每个时间步骤，我们将使用来自先前解码的标记，$ \ hat {y} _ {t + 1} = f（s_t ^ L）$预测目标序列中的下一个token。当概率等于teacher_forcing_ratio时，我们将使用序列中的实际的下一个token作为下一个时间步骤期间解码器的输入。但是，在概率为1 - teacher_forcing_ratio的情况下，我们将使用模型预测的token作为模型的下一个输入，即使它与序列中的实际下一个token不匹配。

我们在forward方法中做的第一件事就是创建一个outputs张量，它将存储我们所有的预测，$ \ hat {Y} $。

然后，我们将输入/源语句$ X $ /src提供给编码器，并接收最终的隐藏和单元状态。

解码器的第一个输入是序列的开始<sos>token。因为我们的trg张量已经附加了<sos>标记（追溯到我们在TRG字段中定义init_token），我们通过切入它得到$ y_1 $。我们知道我们的目标句子应该是多长max_len，所以我们循环max_len次。

在循环的每次迭代期间，我们：

将输入、先前隐藏和先前的单元格状态（$ y_t，s_ {t-1}，c_ {t-1} $）传递到解码器；
从解码器接收预测，下一个隐藏状态和下一个单元状态（$ \hat {y} _ {t + 1}，s_ {t}，c_ {t} $）；
在我们的预测张量中放置我们的预测，$ \hat {y} _ {t + 1} $/ output，$ \hat {Y} $/ outputs；
决定我们是否要去“teaching force”:
- 如果我们这样做，下一个input是将会是序列中实际的下一个token，$ y_ {t + 1} $/ trg[t];
- 如果我们不这样做，下一个input是序列中预测的下一个token，$ \hat {y} _ {t + 1} $ /top1;

一旦我们完成了所有的预测，我们就会返回充满预测的张量，$ \hat {Y} $ /outputs。

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
        assert encoder.hid_dim == decoder.hid_dim, "Hidden dimensions of encoder and decoder must be equal!"
        assert encoder.n_layers == decoder.n_layers, "Encoder and decoder must have equal number of layers!"
        
    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        
        #src = [src sent len, batch size]
        #trg = [trg sent len, batch size]
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use ground-truth inputs 75% of the time
        
        batch_size = trg.shape[1]
        max_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(max_len, batch_size, trg_vocab_size).to(self.device)
        
        #last hidden state of the encoder is used as the initial hidden state of the decoder
        hidden, cell = self.encoder(src)
        
        #first input to the decoder is the <sos> tokens
        input = trg[0,:]
        
        for t in range(1, max_len):
            
            output, hidden, cell = self.decoder(input, hidden, cell)
            outputs[t] = output
            teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.max(1)[1]
            input = (trg[t] if teacher_force else top1)
        
        return outputs

训练seq2seq模型

现在我们已经实施了模型，我们可以开始训练它。首先，我们将初始化我们的模型。如前所述，输入和输出维度由词汇表的大小定义。编码器和解码器的嵌入尺寸和dropout可以不同，但是层数和隐藏/单元状态的大小必须相同。然后我们定义编码器，解码器，接着定义我们放置在指定device上的Seq2Seq模型。

INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)

model = Seq2Seq(enc, dec, device).to(device)

我们还定义了一个函数，用于计算模型中可训练参数的数量。

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 13,898,501 trainable parameters.

我们定义了我们的优化器，我们用它来更新模型中的参数。查看此帖子，了解有关不同优化器的信息。在这里，我们将使用Adam optimizer。

optimizer = optim.Adam(model.parameters())

接下来，我们定义我们的损失函数。 CrossEntropyLoss函数计算log softmax以及我们预测的负对数似然。我们的损失函数计算每个标记的平均损失，但是通过将<pad>标记的token作为ignore_index参数传递，只要目标token是padding token，我们就会忽略其损失。

pad_idx = TRG.vocab.stoi['<pad>']
criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)

接下来，我们将定义我们的训练循环。首先，我们将使用model.train（）将模型设置为traning mode。这将打开dropout（和batch normalization），然后遍历我们的数据迭代器。

在每次迭代时：

从批处理中获取源和目标句子，$ X $和$ Y $
将最后一个batch计算的梯度初始化为零
将源语句和目标语句输入模型以获得输出$ \hat{Y} $
由于损失函数仅适用于具有1d目标函数(1维)的2d(2维)输入，我们需要用.view（）来扁平化
- 我们也不想测量<sos>token的损失，因此我们切掉了输出的第一列和目标语句的第一列张量
用loss.backward（）计算梯度
剪切梯度以防止产生梯度爆炸（RNN中的常见问题!!!）
通过执行优化步骤更新模型的参数
计算运行过程中的总体损失

最后，我们返回所有批次的平均损失。

def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        src = batch.src
        trg = batch.trg
        
        optimizer.zero_grad()
        
        output = model(src, trg)
        
        #trg = [trg sent len, batch size]
        #output = [trg sent len, batch size, output dim]
        
        output = output[1:].view(-1, output.shape[-1])
        trg = trg[1:].view(-1)
        
        #trg = [(trg sent len - 1) * batch size]
        #output = [(trg sent len - 1) * batch size, output dim]
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

我们的评估循环类似于我们的训练循环，但是由于我们没有更新任何参数，所以我们不需要传递优化器或剪切值(clip value)。我们必须记住使用model.eval（）将模型设置为评估模式。这将关闭dropout（和batch normalization）。我们使用with torch.no_grad（）块来确保在块内没有计算梯度，这可以减少内存消耗并加快速度。迭代循环类似（没有参数更新），但是我们必须确保使用teaching force进行评估。这将导致模型仅使用它自己的预测的单词来进行下一个单词的预测，真实地反映了它在部署中的使用性能。

def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            src = batch.src
            trg = batch.trg

            output = model(src, trg, 0) #turn off teacher forcing

            #trg = [trg sent len, batch size]
            #output = [trg sent len, batch size, output dim]

            output = output[1:].view(-1, output.shape[-1])
            trg = trg[1:].view(-1)

            #trg = [(trg sent len - 1) * batch size]
            #output = [(trg sent len - 1) * batch size, output dim]

            loss = criterion(output, trg)
            
            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

接下来，我们将创建一个函数，我们将用它来告诉我们一个epoch所需的时间。

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

我们终于可以开始训练我们的模型！在每个epcoh，我们将检查我们的模型到目前为止是否已达到最佳的验证集损失。如果有，我们将更新我们的最佳验证损失并保存我们模型的参数（在PyTorch中称为state_dict）。然后，当我们来测试我们的模型时，我们将使用保存的参数来实现最佳的验证损失。我们将在每个epoch打印出loss和困惑度(perplexity)。由于数字更大，更容易看到困惑度的变化而不是损失的变化。

N_EPOCHS = 10
CLIP = 1
SAVE_DIR = 'models'
MODEL_SAVE_PATH = os.path.join(SAVE_DIR, 'tut1_model.pt')

best_valid_loss = float('inf')

if not os.path.isdir(f'{SAVE_DIR}'):
    os.makedirs(f'{SAVE_DIR}')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), MODEL_SAVE_PATH)
    
    print(f'| Epoch: {epoch+1:03} | Time: {epoch_mins}m {epoch_secs}s| Train Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f} | Val. Loss: {valid_loss:.3f} | Val. PPL: {math.exp(valid_loss):7.3f} |')

| Epoch: 001 | Time: 0m 28s| Train Loss: 5.009 | Train PPL: 149.782 | Val. Loss: 4.892 | Val. PPL: 133.192 |
| Epoch: 002 | Time: 0m 27s| Train Loss: 4.450 | Train PPL:  85.609 | Val. Loss: 4.682 | Val. PPL: 108.018 |
| Epoch: 003 | Time: 0m 28s| Train Loss: 4.205 | Train PPL:  66.994 | Val. Loss: 4.479 | Val. PPL:  88.129 |
| Epoch: 004 | Time: 0m 28s| Train Loss: 4.018 | Train PPL:  55.571 | Val. Loss: 4.345 | Val. PPL:  77.091 |
| Epoch: 005 | Time: 0m 27s| Train Loss: 3.842 | Train PPL:  46.624 | Val. Loss: 4.281 | Val. PPL:  72.325 |
| Epoch: 006 | Time: 0m 27s| Train Loss: 3.689 | Train PPL:  39.989 | Val. Loss: 4.155 | Val. PPL:  63.763 |
| Epoch: 007 | Time: 0m 28s| Train Loss: 3.572 | Train PPL:  35.582 | Val. Loss: 3.992 | Val. PPL:  54.143 |
| Epoch: 008 | Time: 0m 27s| Train Loss: 3.444 | Train PPL:  31.320 | Val. Loss: 3.972 | Val. PPL:  53.100 |
| Epoch: 009 | Time: 0m 27s| Train Loss: 3.343 | Train PPL:  28.294 | Val. Loss: 3.905 | Val. PPL:  49.631 |
| Epoch: 010 | Time: 0m 26s| Train Loss: 3.248 | Train PPL:  25.745 | Val. Loss: 3.836 | Val. PPL:  46.326 |

我们将加载参数（state_dict），为我们的模型提供最佳验证损失，并在测试集上运行模型。

model.load_state_dict(torch.load(MODEL_SAVE_PATH))

test_loss = evaluate(model, test_iterator, criterion)

print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')

| Test Loss: 3.850 | Test PPL:  47.015 |

在后续的博文中，我们将实现一个实现改进的模型，模型效果更好但仅在编码器和解码器中使用单个层。

seq2seq

用pytorch实现seq2seq