BERT
The year 2018 has been an inflection point for machine learning models handling text (or more accurately, Natural Language Processing or NLP for short). Our conceptual understanding of how best to represent words and sentences in a way that best captures underlying meanings and relationships is rapidly evolving. Moreover, the NLP community has been putting forward incredibly powerful components that you can freely download and use in your own models and pipelines
One of the latest milestones in this development is the release of BERT, an event described as marking the beginning of a new era in NLP. BERT is a model that broke several records for how well models can handle language-based tasks. Soon after the release of the paper describing the model, the team also open-sourced the code of the model, and made available for download versions of the model that were already pre-trained on massive datasets. This is a momentous development since it enables anyone building a machine learning model
involving language processing to use this powerhouse as a readily-available component –saving the time, energy, knowledge, and resources that would have gone to training a
language-processing model from scratch.
The two steps of how BERT is developed. You can download the model pre-trained in step 1 (trained on un-annotated data), and only worry about fine-tuning it for step 2.
There are a number of concepts one needs to be aware of to properly wrap one’s head around what BERT is. So let’s start by looking at ways you can use BERT before looking at the concepts involved in the model itself.
Example: Sentence Classification
The most straight-forward way to use BERT is to use it to classify a single piece of text. This model would look like this:
To train such a model, you mainly have to train the classifier, with minimal changes happening to the BERT model during the training phase. This training process is called Fine-Tuning, and has roots in Semi-supervised Sequence Learning.
For people not versed in the topic, since we’re talking about classifiers, then we are in the supervised-learning domain of machine learning. Which would mean we need a labeled dataset to train such a model. For this spam classifier example, the labeled dataset would be a list of email messages and a label (“spam” or “not spam” for each message).
Other examples for such a use-case include:
Sentiment analysis
Input: Movie/Product review. Output: is the review positive or negative? Example dataset: https://nlp.stanford.edu/sentiment/ (https://nlp.stanford.edu/sentiment/)
Fact-checking
Input: sentence. Output: “Claim” or “Not Claim” More ambitious/futuristic example:
Input: Claim sentence. Output: “True” or “False”
Full fact is an organization building automatic fact-checking tools for the benefit of the public. Part of their pipeline is a classifier that reads news articles and detects claims (classifies text as either “claim” or “not claim”) which can later be fact-checked (by humans now, with ML later, hopefully).