Python : Day 20 – Lesson 20


Model Architecture

Now that you have an example use-case in your head for how BERT can be used, let’s take a closer look at how it works.




The paper presents two model sizes for BERT:


BERT BASE Comparable in size to the OpenAI Transformer in order to compare performance

BERT LARGE A ridiculously huge model which achieved the state of the art results reported in the paper


BERT is basically a trained Transformer Encoder stack.



Both BERT model sizes have a large number of encoder layers (which the paper calls Transformer Blocks) – twelve for the Base version, and twenty four for the Large version. These also have larger feedforward-networks (768 and 1024 hidden units respectively), and more attention heads (12 and 16 respectively) than the default configuration in the reference implementation of the Transformer in the initial paper (6 encoder layers, 512 hidden units, and 8 attention heads).


Model Inputs




The first input token is supplied with a special [CLS] token for reasons that will become apparent later on. CLS here stands for Classification.


Just like the vanilla encoder of the transformer, BERT takes a sequence of words as input which keep flowing up the stack. Each layer applies self-attention, and passes its results through a feed-forward network, and then hands it off to the next encoder.



In terms of architecture, this has been identical to the Transformer up until this point (aside from size, which are just configurations we can set). It is at the output that we first start seeing how things diverge.


Model Outputs

Each position outputs a vector of size hidden_size (768 in BERT Base). For the sentence classification example we’ve looked at above, we focus on the output of only the first position (that we passed the special [CLS] token to).


That vector can now be used as the input for a classifier of our choosing. The paper achieves great results by just using a single-layer neural network as the classifier.



If you have more labels (for example if you’re an email service that tags emails with “spam”, “not spam”, “social”, and “promotion”), you just tweak the classifier network to have more output neurons that then pass through softmax.