unit 3.2 - Transformer and LLM examples

Transformer implementation in PyTorch

For a well developed and commented Transformer implementation, see this repo or this one.

In particular this repo has a great example that you can run and train on your laptop very quickly.

This example learns to add numbers as a string of text. You can see below results after training for < 10 minutes on a 2023 Apple Macbook Pro.

GPT claims that 12 + 75 = 77 but gt is 87
GPT claims that 21 + 28 = 39 but gt is 49
GPT claims that 15 + 14 = 19 but gt is 29
GPT claims that 9 + 19 = 18 but gt is 28
GPT claims that 83 + 40 = 133 but gt is 123
test final score: 482/500 = 96.40% correct
...
iter_dt 12.52ms; iter 9490: train loss 0.05852
iter_dt 12.13ms; iter 9500: train loss 0.02480
train final score: 9500/9500 = 100.00% correct
test final score: 500/500 = 100.00% correct
...

Large Language models - LLM

What are Large Language Models or LLM? They are the core that powers ChatGPT, Gemini and many other modern AI tools (in the years 2024).

As we have seen a Transformer neural network is composed of an encoder and decoder. The Transformer encoder is often used to encoded entire sentences, and it is useful to turn language into embeddings. On the other hand, a Transformer decoder is capable of producing language, and thus is often referred as a “language model”.

The Transformer decoder, scaled to more and more parameters than in the original Transformer papers gave rise to “GPT” or Generative Pre-trained Transformer. These models, including GPT-2, GPT-3, etc. are decoder-only models pretrained on large-scale unsupervised text data. They are trained to predict the next word (token) from a series of words (tokens).

These models eventually scaled up to Trillion of parameters such as GPT-4 and beyond. They are the core that powers the LLM revolution of the last few years.

LLM visualization

See this interesting LLMvisualization.

Tokenization examples

Learn how LLM encode sentences with this tokenizer tool.