Python : Day 16 – Lesson 16


Let's get the core idea of the Seq2Seq model

The Seq2Seq model is mainly used to achieve the conversion from one sequence to another, such as French-English translation. The Seq2Seq model consists of two deep neural networks.The deep neural network can be other neural networks such as RNN(Recurrent neural network) or LSTM(Long short-term memory). The Seq2Seq model uses a neural network to map the input sequence to a fixed-dimensional vector, which is an encoding process; then another neural network maps this vector to the target sequence, which is a decoding process.The model structure of Seq2Seq is shown in Fig1. The model inputs the sentence "ABC" and then generates "WXYZ" as the output sentence.



Fig1 Model structure of Seq2Seq


Encoding and decoding

Encoding and decoding is the core part of the Seq2Seq model.The deep learning network uses RNN as an example to explain the principle of encoding and decoding. The structure of encoding and decoding is shown in Fig 2.


Fig 2 Structure of encoding and decoding


Encoding: The RNN reads each symbol of the input sequence ๐‘‹ sequentially. When each symbol is read, the hidden state โ„Ž<๐‘ก > of the RNN changes, the formula is as follows. After reading the end of the sequence, the hidden state of the RNN is the summary ๐ถ of the entire input sequence.


โ„Ž<๐‘ก>

= ๐‘“(โ„Ž(๐‘กโˆ’1) , ๐‘ฅ๐‘ก )


Decoding: After another RNN is trained, the output sequence is generated by predicting the next symbol ๐‘ฆ๐‘ก . At this time, the hidden state โ„Ž <๐‘ก> of the decoder at time ๐‘ก is calculated as follows.๐‘ฆ๐‘ก and โ„Ž<๐‘ก> are ๐‘ฆ๐‘กโˆ’1 and ๐‘ for condition.

โ„Ž<๐‘ก>

= ๐‘“(โ„Ž(๐‘กโˆ’1) , ๐‘ฆ๐‘ก โˆ’1 , ๐‘)


The conditional distribution for the next symbol is


๐‘ƒ(๐‘ฆ๐‘ก|๐‘ฆ๐‘กโˆ’1 , ๐‘ฆ๐‘ก โˆ’2 , . . . . , ๐‘ฆ1, ๐‘) = ๐‘”(โ„Ž< ๐‘ก>, ๐‘ฆ๐‘กโˆ’1 , ๐‘)

Train the encoder and decoder of the RNN together, and find the maximum conditional log-likelihood function as follows:



Where ฮธ is a set of model parameters, and each (๐‘ฅ๐‘› , ๐‘ฆ๐‘› ) is a combination of an input sequence and an output sequence from the training set.


The specific structure of the hidden unit of the RNN is shown in FIG. 3, which includes an update gate z and a reset gate r. The role of the update gate is to select whether the hidden state is updated by the new hidden state. The role of the reset gate is to determine whether the previous hidden state was ignored.


Fig 3 Structure of hidden unit


The specific process of calculating the activation function of the ๐‘—๐‘กโ„Ž hidden unit in the RNN is as follows:


The calculation formula of the reset gate ๐‘Ÿ๐‘— is:


๐‘Ÿ๐‘—

= ๐œŽ([๐‘Š๐‘Ÿ๐‘ฅ]๐‘—

+ [๐‘ˆ๐‘Ÿโ„Ž<๐‘กโˆ’1>]๐‘— )


Where ๐œŽ is the logical sigmoid function, and ๐ฝ represents the ๐‘—๐‘กโ„Ž element of the vector. ๐‘ฅ and โ„Ž <๐‘กโˆ’1> represent the input sequence and the previous hidden state, respectively. ๐‘Š๐‘Ÿ and ๐‘ˆ๐‘Ÿ are learning weight matrices.


Then calculate the update gate ๐‘ง๐‘— in the same way:


๐‘ง๐‘—

= ๐œŽ([๐‘Š๐‘ง ๐‘ฅ]๐‘—

+ [๐‘ˆ๐‘ง โ„Ž<๐‘กโˆ’1>]๐‘— )


The activation function of the last hidden unit โ„Ž๐‘— is:

โ„Ž

<๐‘ก>

๐‘—

= ๐‘ง๐‘—

โ„Ž<๐‘กโˆ’1> + (1 โˆ’ ๐‘ง

<๐‘ก>

)โ„Ž

๐‘—


๐‘—

๐‘—

In this formula, when the reset gate approaches 0, the hidden state will force the previous state to be ignored and reset with the current input. This method can make hidden state delete unimportant information in the future.


The Attention Mechanism

When the original Seq2Seq model is translated, the source sentence is compressed into a fixed-length vector, which makes it difficult for the neural network to process long sentences. After the Attention mechanism was proposed, this problem was effectively solved. The core idea of the Attention mechanism: each time the model generates a word in translation, it searches for the most relevant set of positions in the source sentence. Then, the model will base on the context vector associated with these source positions and all the target words previously generated To predict new target words.


Fig 4 Given the source sequence (๐‘ฅ1 , ๐‘ฅ2 , . . . , ๐‘ฅ๐‘‡ ), try to generate the ๐‘ก๐‘กโ„Ž target word ๐‘ฆ๐‘ก


In [ ]: