Transformers XAI - Understand the basics of the state-of-the-art

If you ask what is a Transformer, in general, people will easily define a car that becomes a killer machine (or something like this).
However, in the NLP field a Transformer is not even close to any car, or any killer, nor anything like this.
In 2017, Google introduced to the world Attention is all you need, the paper that would not only change the meaning of what is a Transformer, but also revolutionize
and improve the results of natural language tasks until the moment.
The paper achieved never seen before results in the field, improving all the previous performances previously reached. It combined some of the best approaches until then: a seq-to-seq model, attention mechanisms and Encoder-Decoder architectures.
Up until that moment, the best approaches used a great variety of complex methods and architectures to achieve good results. Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs) like LSTMs, etc.
The attention mechanism allows the Neural Network to understand the sequence of words in a similar way humans do, permitting distinction between relevant and irrelevant words inside of it.
The Encoder-Decoder architecture alongside with the Multi-Head Attention implementation, allows enough parallelization to have a quality model in 12h with P100 GPUs.
The structure of the data and training methodology makes it suitable for different tasks like machine translation, language modelling, feature extraction and more! It also allows the possibility of adopting pretrained weights to easily fine-tune for your own purposes.
The attention mechanism, the positional encoding and the skip connections allow to know where in the sentence we are and not forget to what we have seen without overflowing the model or leading to exploding gradients due to long sentences.
All in all, Transformers are very powerful tools to develop NLP projects. Despite this, it is essential to understand the background and have a good
knowledge on how this works and runs internally.
In this project we provide an Explainable AI solution in order to comprehend the basis of the Transformers, so anyone who is interested in them can expand their notions
via visualization and interaction.
All the results and visualizations shown have been extracted using a real Transformer. The task we will focus on is translation, concretely from English to French. However,
the concepts explained are also suitable for other tasks.
The Transformer used is pretrained with the ACL 2014 ninth workshop
on statistical machine translation dataset and no fine-tuning has been applied. The details on the network and the implementation are described in Scaling Neural Machine Translation
and the source code is available in the transformer.wmt14.en-fr fairseq Github repository.
Before we begin, we will clarify some basic notions that will help us to understand how the Transformer works and how it achieves the results it does. To do so, we will cover up the most important mechanisms it applies to interpret the natural language the way humans do. All along, we will follow the steps by using analogies with the sentence My mum eats a carrot.
Seq-to-Seq (also known as Seq2Seq) is the abbreviation of sequence-to-sequence, which is the characteristic that indicates that a model works with sequences of data (rather than single units) and transforms them into another sequence in some other domain. In a machine translation context, this means that the algorithm works with a sequence of language units from the source language (English in our case) and translates the whole sequence into the target language (French).
\(\to\) See the difference between the word embeddings of different sentences and the subtraction of the sentences' embeddings
An embedding is the representation of some unit of the natural language in the form of a vector of real numbers.
For instance, an embedding could be a vector of 26 numbers where each of them represented the i-th letter of the alphabet and the quantity of the times such letter was found in a certain word.
However, embeddings tend to be more complex than this last example in which coin and icon would provide the same representation.
Recall that the vector represents the structure of the language we choose, not necessarily words: they could represent a letter, part of a word, a word or a whole sentence itself.
Although, the most common is for them to be a word or part of it.
\(\to\) Compare the differences between words and tokens with the same examples
In this Transformer architecture, each embedding represents a part of the word. The tokenization found depends on several aspects:
the algorithm of tokenization chosen, the frequency it has on the language and the encoding that is applied.
In addition, in this case, punctuation signs haver their own token (commas, apostrophes, etc) and a special one (<eos>) is used to indicate end of sentence (omitted in the plot). However, we
insist that this is characteristic of the chosen architecture and, thus, not a generalized or standard processing of the words.
One rapidly sees that the relationship between tokens and words is not linear, as it can be appreciated on the example.
The attention mechanism permits the network to identify which tokens are more related between them. In human terms, it is the
intuition that tells you that what is eaten is the carrot and who eats it is my mum. Mathematically, this is more abstract and harder to explain but
the base of all is the scaled dot-product attention and three specific vectors K, Q and V (shown on the right). We will dig into detail in Attention.
We will also see that the Transformer uses what is called multi-head attention and its types.
We have seen how the Neural Network identifies a word and how it relates their meanings, but how does it determine the order of the words?
Because obviously, saying 'My mum eats a carrot' is not the same as 'A carrot eats my mum'.
This is achieved thanks to what we call positional encoding, a vector of the same dimension as the embeddings that somehow contains information about the position
of the token within a sentence. Said vectors can be trained weights (optimized during training) or pre-defined functions that work well enough. In the Transformer,
the autors tried both methods using a mix of sinusoidal functions for the latter one, which they kept, as the results were pretty similar with both techniques.
Also known as skip connections or skip-layer connections, are those steps in which a vector is copied and ommits one ore more layers and then it is used at the output of a posterior layer along with the output of such layer itself. There, it can be concatenated or added.
They are useful to avoid vanishing gradient problems and they act in such a way that the network does not 'forget' the past as easily, allowing the model to perform better.
As we will see in the architecture, the Transformer counts with several of them and are important to achieve the results it has. For instance,
positional information does not get lost or vanished due to the residual connections.
As we have introduced, attention is, in simple words, the mechanism that stablishes the relevance of a token in relationship with another.
To do so, it is necessary to
define the three components used: key (K), query (Q) and values (V). These are the vectors that we combine (map) in such a way that produce an output (also a vector) whose values define the mentioned relationships.
The output is computed as a weighted sum of the values (V), where the weight assigned to each of them is computed by a compatibility function of the query (Q) with the corresponding key (K).
In other words, K and Q are used to define some weights that will be used against V to define the importance between tokens.
We have three blocks in the Transformer architecture that perform this evaluation, but first let's describe better what an Attention block does.
The attention itself is calculated as follows:
So each of the attention blocks perform this computation using the respective K, Q and V; which come from different sources according to the block they are found.
After performing the softmax, the result is scaled by \(d_k\) (the dimension of the embedding) and after it, it is multiplied by V. This way it is ensured that the vectors
we deal with are normalized at each step, a fact that is really helpful to avoid overflows and numeric problems.
Having all this, one is capable to generate the Attention matrix. One may have a look at the architecure of the Transformer, from which we could
extract three attention matrices (as we have three blocks that compute it).
Self-Attention
Masked Self-Attention
Encoder-Decoder Attention
Here we introduce real examples of the Self-Attention and Encoder-Decoder Attention matrices.
Up to this point, you might be wondering what exactly are heads in terms of Attention and what do they have to do in here.
Well, conceptually heads are not so willing to explain with an example because they are more like a helping 'feature' rather than a palpable thing itself.
Heads is the name assigned to the number of partitions we do to the input vectors (K, Q and V) in order to compute Attention separately. This allows the segmentation of the Attention
calculations and therefore the parallelization of this part of the model. Not only do we focus in smaller parts but also we permit them to be computed in a faster way,
hence a faster training.
So it is as simple as dividing the input vectors K, Q, and V into head parts. Then, they follow the same procedure as previously: a linear layer and the scale dot-product.
After it, we just concatenate the 'pieces' and pass through a last linear layer.
In the gif it can be observed the case where the number of heads is 3.
It's time to talk a little bit about the software part of the network. We are going to review, in general terms, which is the loss function, how the problem is presented and some details that might be taken into account when building and training such Neural Network.
The problem introduced is presented as a classification task in which the chosen 'class' will be the most probable word to be predicted. The loss function that is minimized is known as the Cross Entropy Loss and is defined as follows:
\( \displaystyle {L(y,\hat{y}) = - \sum_{k} y^{(k)} log\;( \hat{y}^{(k)} )} \)
Being \(k\) the total number of classes, \(y\) the target probabilities and \(\hat{y}\) the predicted probabilities. As can be seen, it is generalized to any number of classes one may desire. It is also very useful and suitable for classification problems, as it penalizes more the wrong predictions and rewards more the correct ones in comparison with what other losses would do.
We have already talked about masks. A mask is a special token (<MASK>) that replaces another token in a way that the Neural Network "does not take it into account". By this we mean that, if a token is replaced by it, the Transformer will have no information about it, so it will work as if it was not there. Masks are used during training in the first Decoder Attention Block as we have seen in the Architecture. Similarly, masks are applied in many tasks of NLP, for instance, when training a next-word-prediction network where the future words hidden by the mask so that the network does not cheat.
If you wanted to train a Transformer architecture by yourself you would have two options:
Train from scratch with your own data and with your own coded architecture (or one imported from some library).
This is a tricky one as you would need to take into account: what dimensions to choose, how to perform the data processing (tokenization, deal with case sensitivity, etc),
choose hyperparameters, optimizers and more. Let's be honest, coding from scratch is not really common unless you have a special interest on doing all that by yourself. It is more common to take
the models available in different libraries and train with your own data, however if you don't have a big amount of data the best option is the next one.
You can take a pretrained network and fine-tune.
One of the advantages of the Transformers is that they are suitable for many tasks and contexts, typically for NLP, even though they can be used outside the language processing field.
Here, we have used the Transformer to translate from English to French. Needless to say, having a proper dataset in any other language would serve to train a translator to said language.
Besides, translation is not the only task that this type of architecture can perform.
Due to the outstanding results it provided, many different architectures and varieties arised since the publication of the original paper in 2017 and there are many open-source libraries that provide the code, data and
architecure to be able to fine-tune your own network.
In this section we provide further detail of some of the concepts already explained, more insight on some structures or elements that are specific for this network or any curiosity or feature that we consider worth mentioning.
We have learned what an Encoder-Decoder architecture is. In addition to this, it is interesting to have a look at the so-called Auto-Encoder.
This type of network is pretty famous and used as it is a simple but effective unsupervised method.
It consists of a network that is trained to predict exactly the same as it receives as input (weird right?). The trick of it is in the bottleneck (latent data) that is smaller than the initial and final dimension.
This way we train an Autoencoder that compresses the most relevant information and it remains condensed in a lowered-dimension than the original.
It is really helpful for compression, image segmentation, face recognition and others.
Special tokens are also a helpful feature to take into account. We have introduced the end-of-sentence tokens used in the paper. However, it is also usual to find other tokens that do not directly reffer to a word: the beginning of sentence token <bos> (not used in this project), punctuation signs tokens: <apos>, <,>, <.>, etc; and even a special token for unseen words <unk>.
Byte Pair Encoding (or digram coding) is a simple form of data compression.
It consists in replacing the most common pair of consecutive bytes of the data with a byte that does not occur within that data.
In NLP, it is a technique that can be applied as part of the preprocessing of the data and can help to reduce, memory-wise, the size
of the sets (training, validation and testing) considerably. This way, the training is speeded up while the quality is not seen affected.
But be aware, when predicting new data (using the network already trained) the BPE must be applied too!
Label Smoothing is a regularization technique that introduces noise for the labels / target. By doing this, we prevent the network to work with absolute zeroes,
hence saving it from computational issues. It is an easy, yet really helpful, technique to help the model generalize instead of overfitting by following really
strict patterns which, in the end, are not found in real life or when predicting never-seen-before targets.
As a matter of fact, it is only used in classification problems that use the cross-entropy function.
As we already know, Positional Encoding is what provides the network with the order information within the words or tokens. Quoting the authors:
"Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence,
we must inject some information about the relative or absolute position of the tokens in the sequence."
However, what do sinusoidal functions have to do with order?
It is all related with the way bits change as they grow. If we observe the evolution of the bits when augmenting the numbers from one to five in binary, we observe the behaviour found in the right.
Somehow, every time we increase a number, this behaves at some level as a sinusoidal movement, here the reason why this type of function is appropiate for this task.
Therefore in the paper it is chosen a sine and a cosine function (shown below) that are alternated to simulate the change between numbers in a continuous form.
As this project intends to understand the overall ideas and intuitions behind the Transformer rather than digging into the details, we have avoided as much as possible
mentioning dimensions and specific numbers. Furthermore, they have many different factors that can affect them (the type of architecture, the structure of it, the way one may want to train the model,
the size of the data, the resources available, etc). However, there is one dimension that is interesting and useful in this type of task and does not always appear explained: the beam dimension.
We could define the beam dimension \((Be)\) as the hypothesis dimension we allow the model to investigate. As we already mentioned, the input of the Decoder is the last output and it can drag error.
That is why the network takes the top \(Be\) most probable tokens to be the translated, predicts each of the \(Be\) most probable next tokens and keeps, from the \(Be\) first hypothesized words, the one that provides less error. This is repeated at each step.
An example, if we want to translate our already known base sentence: My mum eats a carrot. We start the process, our beam dimension is \(Be = 2\), for the first word we hypothesize the translation to be ma and mon. The network will translate the following word taking into account
both articles: ma mère and mon mère and will conclude that, based on the error, ma is a better prediction.
This does not magically convert the network in the perfect translation machine, however it does lower the possibility of making a wrong prediction.
We have mentioned that after the Decoder, as latest step of the forward pass, we apply the softmax function. It is defined as:
\( softmax(x_{i}) = \displaystyle{\frac{\exp(x_i)}{\sum_j \exp(x_j)}}\)
The softmax function tries to replicate the behaviour of the maximum function. Why? Because this latter function is not differentiable and it would not be possible for the network to propagate the gradients to be optimized (plainly, we couldn't train the network as it would not be able to update the weights). For this reason a differentiable function that simulates the maximum is good enough, In addition, it is called soft as it still preserves a part of the non-maximum labels.Activation functions determine if a neuron will be activated according to the value it has received. If this happens, the weights connected to the mentioned neuron will be updated during the backpropagation. In this architecture we have seen the ReLU activation, which is the abbreviation for Rectified Linear Unit. This function activates the neuron in case the input received is higher than \(0\), that is:
\( ReLU(x) = max(0, x) \)
However there are lots of different activation functions, each of them more suitable for certain contexts and tasks. Between the most common ones, we can find the tangent activation function, sigmoid, linear... and many combinations or versions of it: Leaky ReLU, ELU, maxout, etc. The softmax function could also be included in this list.
The convolution is a type of operation usually performed in signal processing. It works with two functions: the main signal and the kernel signal and it expresses how the shape of one is modified by the other. Mathematically, it is written as:
\( y(t) = x(t) * h(t) = \displaystyle{ (x*h)(t):=\int _{-\infty }^{\infty }x(\tau )h(t-\tau )\,d\tau }\)
Being \(y(t)\) the convolution result, \(x(t)\) the signal and \(h(t)\) the kernel.
It has many uses and applications, even in Deep Learning. Long story short, Convolutional Neural Networks (CNNs) use the signal \(h(t) \)
as weights to be trained. Nowadays, they are one of the best networks to use in image and video recognition, image classification, etc. And as we have already
pointed out, the linear layer block weights act as a convolution of kernel size 1.
The Artificial Intelligence universe is wide and it will soon be infinite, if you are interested in the topic, here you can find a bunch of resources that cover a variety of related work. More state-of-the-art architectures, visualizing tools to understand them and social research upon ethics and bias inside NLP.