Back when GPT-3 came out, I wanted to understand how it works, so read the paper...

Back when GPT-3 came out, I wanted to understand how it works, so read the papers and made this post:

https://dugas.ch/artificial_curiosity/GPT_architecture.html

I hoped it would be simple enough for anyone who knows a bit of math / algebra to understand. But note that it doesn't go into the difference between GPT-3 and ChatGPT (which adds a RL training objective, among other things).