unit 3.1 - Transformer Network

Here we describe the network architecture of the Transformer, as presented in the original paper

Building up on the attention module, the Transformer core block is the following.

This core block is composed by a Multi-headed Attention (MHA) plus normalization (Norm) layer. Note that th MHA module has 3 inputs: Query, Keys and Value, just like in the single-headed attention module. The normalization also takes a bypass version of the input, similarly to a residual network. This allows some inputs to bypass some of the Transformer processing. Next there is an identical module with a Multi-layer Perceptron (MLP, here 2 linear layers) with identical normalization and bypass layers.

This entire block is repeated multiple times in a Transformer. 6 times in the original paper, but 12 or 24 or more in large LLMs today (year 2024).

Here we show the Encoder Network, the first portion of the original Transformer network. Some LLMs today are called “encoder only” and this is the form they take.

The encoder is basically the core Transformer module we have seen above, but with appropriately conditioned inputs. Inputs in this kind of neural networks are sequences. As you know sequences imply time and the order of the sequence inputs matter. But the Transformer core has no way of keeping track of individual inputs,a s they get all mixed-up in the processing layers. As such the Transformer uses a particular input conditioning called a Positional Encoding (Pos Enc), where sinusoidal signal are added to the input sequence to help recall individual inputs positions. See details in the original paper and in the code implementation.

The Encoder network is able to take a sequence, say a sentence, and embed it into a “code”. This kind of networks can for example be used as sentence encoders for translation of language (like in the original paper), or sentiment analysis.

Below you will see the full Transformer network, both encoder and decoder. Remember the original network was designed for language translation, and as such it uses an Encoder to simultaneously compress and analyze the input sentence, and a decoder that generates each output translated word one at a time. The Encoder processes information in parallel on the entire input sequence. The Decoder is instead auto-regressive, it outputs one item at a time, and each output is input to each of processing steps.

Note how the Encoder output is fed to the Decoder. It is fed as Query and Key for the Decoder MHA module. The Output symbols, or words or token are encoded with the Positional Encoder in the same ways as the inputs. The decoder uses 2x MHA modules, one to process the Outputs, and one to add the input embeddings.

The Outputs of the Decoder are categorical (one of the words or tokens) and are used to feed back as inputs to the Decoder.

For example when translating a sentence of 4 tokens the Decoder “Outputs” fed to the Decoder (as inputs… I know confusing!) are:

[token_out_1, 0, 0, 0]

[token_out_1, token_out_2, 0, 0]

[token_out_1, token_out_2, token_out_3, 0]

etc.

For a complete example of how the Transformer processes and learns, see this tutorial.