Visualization is all you need

Transformers XAI - Understand the basics of the state-of-the-art

Transformers?

Image

If you ask what is a Transformer, in general, people will easily define a car that becomes a killer machine (or something like this). However, in the NLP field a Transformer is not even close to any car, or any killer, nor anything like this.
In 2017, Google introduced to the world Attention is all you need, the paper that would not only change the meaning of what is a Transformer, but also revolutionize and improve the results of natural language tasks until the moment.

Why?

State-of-the-art

The paper achieved never seen before results in the field, improving all the previous performances previously reached. It combined some of the best approaches until then: a seq-to-seq model, attention mechanisms and Encoder-Decoder architectures.

Simplicity

Up until that moment, the best approaches used a great variety of complex methods and architectures to achieve good results. Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs) like LSTMs, etc.

Attention

The attention mechanism allows the Neural Network to understand the sequence of words in a similar way humans do, permitting distinction between relevant and irrelevant words inside of it.

Velocity

The Encoder-Decoder architecture alongside with the Multi-Head Attention implementation, allows enough parallelization to have a quality model in 12h with P100 GPUs.

Adaptability

The structure of the data and training methodology makes it suitable for different tasks like machine translation, language modelling, feature extraction and more! It also allows the possibility of adopting pretrained weights to easily fine-tune for your own purposes.

An eye on the past

The attention mechanism, the positional encoding and the skip connections allow to know where in the sentence we are and not forget to what we have seen without overflowing the model or leading to exploding gradients due to long sentences.

How?

All in all, Transformers are very powerful tools to develop NLP projects. Despite this, it is essential to understand the background and have a good knowledge on how this works and runs internally.
In this project we provide an Explainable AI solution in order to comprehend the basis of the Transformers, so anyone who is interested in them can expand their notions via visualization and interaction.

All the results and visualizations shown have been extracted using a real Transformer. The task we will focus on is translation, concretely from English to French. However, the concepts explained are also suitable for other tasks.
The Transformer used is pretrained with the ACL 2014 ninth workshop on statistical machine translation dataset and no fine-tuning has been applied. The details on the network and the implementation are described in Scaling Neural Machine Translation and the source code is available in the transformer.wmt14.en-fr fairseq Github repository.

Basic Notions

Before we begin, we will clarify some basic notions that will help us to understand how the Transformer works and how it achieves the results it does. To do so, we will cover up the most important mechanisms it applies to interpret the natural language the way humans do. All along, we will follow the steps by using analogies with the sentence My mum eats a carrot.

Seq-to-Seq

Seq-to-Seq (also known as Seq2Seq) is the abbreviation of sequence-to-sequence, which is the characteristic that indicates that a model works with sequences of data (rather than single units) and transforms them into another sequence in some other domain. In a machine translation context, this means that the algorithm works with a sequence of language units from the source language (English in our case) and translates the whole sequence into the target language (French).

Seq-to-One vs. Seq-to-Seq

\(\to\) See the difference between the word embeddings of different sentences and the subtraction of the sentences' embeddings

First 150 values of real embeddings of dimension 1024

Embeddings

An embedding is the representation of some unit of the natural language in the form of a vector of real numbers.
For instance, an embedding could be a vector of 26 numbers where each of them represented the i-th letter of the alphabet and the quantity of the times such letter was found in a certain word. However, embeddings tend to be more complex than this last example in which coin and icon would provide the same representation. Recall that the vector represents the structure of the language we choose, not necessarily words: they could represent a letter, part of a word, a word or a whole sentence itself. Although, the most common is for them to be a word or part of it.

\(\to\) Compare the differences between words and tokens with the same examples

First 150 values of real embeddings of dimension 1024

Tokens

In this Transformer architecture, each embedding represents a part of the word. The tokenization found depends on several aspects: the algorithm of tokenization chosen, the frequency it has on the language and the encoding that is applied.
In addition, in this case, punctuation signs haver their own token (commas, apostrophes, etc) and a special one (<eos>) is used to indicate end of sentence (omitted in the plot). However, we insist that this is characteristic of the chosen architecture and, thus, not a generalized or standard processing of the words.
One rapidly sees that the relationship between tokens and words is not linear, as it can be appreciated on the example.

Attention

The attention mechanism permits the network to identify which tokens are more related between them. In human terms, it is the intuition that tells you that what is eaten is the carrot and who eats it is my mum. Mathematically, this is more abstract and harder to explain but the base of all is the scaled dot-product attention and three specific vectors K, Q and V (shown on the right). We will dig into detail in Attention.
We will also see that the Transformer uses what is called multi-head attention and its types.

Scaled dot-product attention

Sinusoidal functions are usually used for positional encoding

Positional encoding

We have seen how the Neural Network identifies a word and how it relates their meanings, but how does it determine the order of the words? Because obviously, saying 'My mum eats a carrot' is not the same as 'A carrot eats my mum'.
This is achieved thanks to what we call positional encoding, a vector of the same dimension as the embeddings that somehow contains information about the position of the token within a sentence. Said vectors can be trained weights (optimized during training) or pre-defined functions that work well enough. In the Transformer, the autors tried both methods using a mix of sinusoidal functions for the latter one, which they kept, as the results were pretty similar with both techniques.

Residual connections

Also known as skip connections or skip-layer connections, are those steps in which a vector is copied and ommits one ore more layers and then it is used at the output of a posterior layer along with the output of such layer itself. There, it can be concatenated or added.
They are useful to avoid vanishing gradient problems and they act in such a way that the network does not 'forget' the past as easily, allowing the model to perform better. As we will see in the architecture, the Transformer counts with several of them and are important to achieve the results it has. For instance, positional information does not get lost or vanished due to the residual connections.

Diagram of skip connections

Architecture

Encoder-Decoder

This type of architecture is a Neural Network that transforms the data using two modules: the Encoder and the Decoder.
The first one transforms the original data into an internal representation of it, which is sometimes referred as latent data or bottleneck in the literature. The second one takes said representations and transform them into a desired target.
As for translation purposes, the Encoder works with the source language and the Decoder outputs the translation in the target language. If the latent representations were generic enough, just by changing the Decoder one could train a Decoder that translated to any possible language.


Input set up

The input received in both parts, the Encoder and the Decoder, is an embedding that represents a token (recall, a part of a word). As this type of Neural Network is not recurrent (a type of network that does have a notion of temporality), it is added to each embedding some specific weights that will determine the position each token has within the sentence, the positional encoding.
To exemplify, we input the English embeddings of the sentence in the Encoder. In the Decoder, we input the translated right shifted embedding of the Encoder input. In other words: the last embedding we predicted in French.
To summarise, for each sentence we generate the tokens, we add the positional information and, then, they are ready to enter to the Encoder or the Decoder.

Positional Encoding

The weights of the positional encoding can be trained or determined by a function, typically a sinusoidal one. In the paper, said function is the following:

\(PE_{(pos,2i)}=sin(pos/10000^{2i/dmodel})\)
\(PE_{(pos,2i+1)}=cos(pos/10000^{2i/dmodel})\)

where \(pos\) is the position of the token in the sentence, \(d_{model}\) the dimension of the embeddings and \(i\) represents the position within the embedding.
In other words, in this particular case, to each embedding we add weights that come determined by the alternation of a sine and a cosine function (observable at your right) whose objective is to state in which position a certain token is found.

Encoder

After having the input embeddings prepared, the source language embeddings enter one by one to the Encoder, which will transform them into a latent representation and will eventually go inside the Decoder.
It is worth mentioning that, unlike Recurrent Neural Networks, the input is just one vector and therefore we don't have the positional information contained in any way, this is why positional encoding is needed.
Both modules contain mainly two blocks of layers: Feed Forward and Attention blocks. In addition, they also count with residual connections and normalization steps. All of them are repeated in a total of N stacks (being N=6). That is, we have 6 times what we see in the diagrams.
This module is the most basic of the two.

Feed forward blocks

As we mentioned, in each module of the architecture, there can be found two main blocks, the first one is the Feed Forward. They are fully-connected Feed Forward Networks which are applied to each position separately and identically. They consist of two layers with a ReLU activation in between and the operations equate the behaviour of two convolutions of kernel size 1.

\(FFN(x) = max(0,xW_1+b_1)W_2+b_2\)

After each stack of layers, both in the Encoder and the Decoder, we have the addition of a skip connection and a normalization layer. This helps the network to 'remember' the past and have small values inside the embeddings.

Attention blocks

The second main component inside the modules are the Attention blocks. In this paper, it is performed a simple attention mechanism (scaled dot-product) that is combined with what it is called heads. This combination of techniques result in what is known as Multi-Head Attention.
Each block of Attention consists of a linear layer, a scaled dot-product attention layer, a concatenation step and a final linear layer. Also, one of the blocks (the first found in the Decoder) is referenced as Masked Multi-Head Attention. This is because we don't have all the information when predicting at that step, so a mask is applied when training.
We will provide more detail in Attention.

Decoder

The Decoder architecture is pretty similar to the previous one. In fact, we could say that, after the first Attention block and normalization, the architecture is basically the same.
However, the most noticeable differences are the inputs it has in the Attention blocks.
The first (at the beginning of the Decoder) is the target language embeddings shifted right (as we have already mentioned, the last predicted embedding).
The second one (in the middle of the module), passes as input the Encoder output and the Decoder embeddings that we have until that step.
Also, in here, we have two classes of Attention the first one being Masked Attention and the second a Encoder-Decoder Attention.
Notice that after the Decoder module we find two final layers, a linear layer and a softmax layer that will be in charge of choosing the most probable translation.

Full schema