Let's get the core idea of the Seq2Seq model
The Seq2Seq model is mainly used to achieve the conversion from one sequence to another, such as French-English translation. The Seq2Seq model consists of two deep neural networks.The deep neural network can be other neural networks such as RNN(Recurrent neural network) or LSTM(Long short-term memory). The Seq2Seq model uses a neural network to map the input sequence to a fixed-dimensional vector, which is an encoding process; then another neural network maps this vector to the target sequence, which is a decoding process.The model structure of Seq2Seq is shown in Fig1. The model inputs the sentence "ABC" and then generates "WXYZ" as the output sentence.
Fig1 Model structure of Seq2Seq
Encoding and decoding
Encoding and decoding is the core part of the Seq2Seq model.The deep learning network uses RNN as an example to explain the principle of encoding and decoding. The structure of encoding and decoding is shown in Fig 2.
Fig 2 Structure of encoding and decoding
Encoding: The RNN reads each symbol of the input sequence ๐ sequentially. When each symbol is read, the hidden state โ<๐ก > of the RNN changes, the formula is as follows. After reading the end of the sequence, the hidden state of the RNN is the summary ๐ถ of the entire input sequence.
โ<๐ก>
= ๐(โ(๐กโ1) , ๐ฅ๐ก )
Decoding: After another RNN is trained, the output sequence is generated by predicting the next symbol ๐ฆ๐ก . At this time, the hidden state โ <๐ก> of the decoder at time ๐ก is calculated as follows.๐ฆ๐ก and โ<๐ก> are ๐ฆ๐กโ1 and ๐ for condition.
โ<๐ก>
= ๐(โ(๐กโ1) , ๐ฆ๐ก โ1 , ๐)
The conditional distribution for the next symbol is
๐(๐ฆ๐ก|๐ฆ๐กโ1 , ๐ฆ๐ก โ2 , . . . . , ๐ฆ1, ๐) = ๐(โ< ๐ก>, ๐ฆ๐กโ1 , ๐)
Train the encoder and decoder of the RNN together, and find the maximum conditional log-likelihood function as follows:
Where ฮธ is a set of model parameters, and each (๐ฅ๐ , ๐ฆ๐ ) is a combination of an input sequence and an output sequence from the training set.
The specific structure of the hidden unit of the RNN is shown in FIG. 3, which includes an update gate z and a reset gate r. The role of the update gate is to select whether the hidden state is updated by the new hidden state. The role of the reset gate is to determine whether the previous hidden state was ignored.
Fig 3 Structure of hidden unit
The specific process of calculating the activation function of the ๐๐กโ hidden unit in the RNN is as follows:
The calculation formula of the reset gate ๐๐ is:
๐๐
= ๐([๐๐๐ฅ]๐
+ [๐๐โ<๐กโ1>]๐ )
Where ๐ is the logical sigmoid function, and ๐ฝ represents the ๐๐กโ element of the vector. ๐ฅ and โ <๐กโ1> represent the input sequence and the previous hidden state, respectively. ๐๐ and ๐๐ are learning weight matrices.
Then calculate the update gate ๐ง๐ in the same way:
๐ง๐
= ๐([๐๐ง ๐ฅ]๐
+ [๐๐ง โ<๐กโ1>]๐ )
The activation function of the last hidden unit โ๐ is:
โ
<๐ก>
๐
= ๐ง๐
โ<๐กโ1> + (1 โ ๐ง
<๐ก>
)โ
๐
๐
๐
In this formula, when the reset gate approaches 0, the hidden state will force the previous state to be ignored and reset with the current input. This method can make hidden state delete unimportant information in the future.
The Attention Mechanism
When the original Seq2Seq model is translated, the source sentence is compressed into a fixed-length vector, which makes it difficult for the neural network to process long sentences. After the Attention mechanism was proposed, this problem was effectively solved. The core idea of the Attention mechanism: each time the model generates a word in translation, it searches for the most relevant set of positions in the source sentence. Then, the model will base on the context vector associated with these source positions and all the target words previously generated To predict new target words.
Fig 4 Given the source sequence (๐ฅ1 , ๐ฅ2 , . . . , ๐ฅ๐ ), try to generate the ๐ก๐กโ target word ๐ฆ๐ก
In [ ]: