GPT-3's family tree

GPT-3's family tree
Photo by Daniel K Cheung / Unsplash

Until GPT-3 came along, I took machine learning for granted. When I heard Elon Musk say that AI was an existential risk for humanity I didn't take it seriously. If you've played around with machine learning, you saw how bound it was to the training data. How could it get to beyond-human reasoning capabilities?

GPT-3 showed me that, even if the path to general AI is still unclear, you can't take AI for granted anymore. So I decided to try to get a bit more of an understanding of machine learning beyond "neural networks are inspired by brains" and "backpropagation is weird." In order to force myself to have a more thorough understanding of the area, I've decided to write weekly about what I'm learning about AI research.

The question I wanted to answer this week was: if neural networks have been the state of the art for a while, why didn't GPT-3 emerge in 2010 or 2015? Why did it take til 2020 for something like that to come into existence? It turns out, the answer isn't just that no one was investing enough in AI, or that people just never thought to do this. There are specific technological advances in neural networks over the past ten years that have made GPT-3 possible.

This is intended for a general audience, but is hopefully technical enough that someone with a CS background would learn something from it.

Here's what you need to know before reading this: Neural networks are the standard in AI. Neural networks are really powerful, but until recently they could only perform well on specific tasks they were trained for. However, recently, Google, OpenAI, and Microsoft have been working on giant natural language processing models that are really good and make people argue about whether they can really think. (In my opinion, even the fact that people are arguing about this is a big deal, even if it's all really a show.) They built them by throwing more data and more processing power at the problem than anyone ever had before.

I have a thin background in AI, so I may have made mistakes or mischaracterized things here. If I did, let me know and I'll fix it! Also, my goal was to make this accessible enough to be intelligible to a non-programmer.

The GPT family tree

The basics of neural networks (1958-1980s)

I'm going to introduce some basic concepts in neural networks and machine learning that will help you understand why GPT-3 was a big deal. Feel free to skip if you're familiar with it.

So, what are neural networks?

Neural networks are the technology on which the most advanced machine learning is built. As their name suggests, they were initially inspired by neurons in the brain, which form a network. Like a neuron, each node is connected to some input and some output, and it will send a signal to the output only if it gets enough signals from its inputs. The first neural network was called the Perceptron and was built in 1958. It took visual data as input to the neurons in the form of black and white pixels. If the pixel was black, it counted as no data coming into the neuron. If the pixel was white, it had data. Then, the person building the neural network manually tuned when each neuron would fire based on its input data. It could do simple vision tasks - does this paper have the number 1 or or the number 2 on it? - but the big limitation was that it only had one layer of neurons. That meant that it could only say "this is a one if this pixel or this pixel are highlighted" - but it could never say "this is a one if this pixel is highlighted but this other is not." This, among other problems, meant that the Perceptron was never going to be able to do any complicated tasks with a computer.

Neuronal networks have layers, and each layer is a bit more complicated than the last. The first layer might just detect colors, while the next detects lines, and the one after that detects simple shapes. If you add enough layers, the network can detect complex patterns or objects - eventually building up to something that can detect objects as well as humans can.

You're welcome for finding this on Wikipedia for you

Neural networks have two modes of operation: execution and training

Execution is when the neurons "fire" based on the input data to do the task you need it to do - for example, identifying whether an image is of a cat or a dog. Training is when you give it a bunch of examples of cats and dogs to learn how to identify the two.

Before training, a neural network is a grid of random numbers. If you try to execute it, it guesses randomly out of all the options available to it based on those numbers. When you train a neural network to recognize letters in an image, when it guesses "This is an A" but the training data says "no, this is a B," the neural network will tune its own inputs based on what it learned. It says, "well, it looks like I need to adjust my knobs away from the neurons that told me it was a A and toward the neurons that told me it was a B." Maybe a neuron that detects a straight line on the right side of the image will become more heavily weighted toward A, and a neuron that detects a line at the bottom of the image will become more heavily weighted toward B. Remember - to start out with, those guesses came from random weights. But after training, they start to give useful results.

This, times a hundred billion, is how GPT-3 works too.

Below, you can see a visualization of what several layers of a neural network might detect. On the left, it detects simple lines and colors, which can be turned into patterns, which eventually match more complex shapes like parts of wheels and windows.

An image from all over the Internet that I found at

Supervised vs unsupervised learning

There are two major modes of learning: supervised and unsupervised learning. Supervised learning usually means you give a model specific examples to learn from. For example, computer vision is usually supervised. You show the model a picture of an apple with the word "apple." You show the model a picture of a banana with the word "banana." Supervised learning is easier to understand and debug, but it takes a lot of work to get good training data.

Unsupervised learning doesn't require a label for every example. There are many types of unsupervised learning, but a common type of unsupervised learning is "reinforcement learning." This is where you set out rules about what is good and what is bad and let the computer learn from those rules. A common example is training computers to play games, like the famous AlphaZero model that beat the world's best Go players. It was merely taught the rules of Go and then set out to play against itself. It learned what strategies were best based on its experience winning and losing games. This is also how you might train a self-driving car - instead of labeling each move it might make, you tell it "Get from point A to point B relatively fast. Don't hit anything. Good luck!" Then, it'll try to figure out the best way to accomplish those goals. (Obviously, the challenge for self driving cars is that you don't want it to learn by trial and error the way AlphaZero can - one of the many reasons self driving cars are hard to build.)

Neural networks make inroads in computer vision (1980's-2012)

Not important to understand this in any detail, but might be useful for visualization. Source:

From the 1980's to 2000's, neural networks were only used for specific applications because they were too processing-intensive. This is a recurring pattern in AI - neural networks were too expensive to run, so they weren't used. Nonetheless, a small number of researchers stayed dedicated, particularly in computer vision. Famously, neural networks were shown to be excellent at character recognition and banks began to use them to automatically process checks in the 1990's.

The big advance in this era was convolutional neural networks - CNNs. The basic idea of a CNN is that it splits an image into small chunks, then runs a neural network on those chunks, then you take the output of these neural networks that are next to each other and put them into another neural network - til you get one big output for the whole image. To some degree, this makes sense. If you look at the whole image all at once, you're going to focus too much at big features and not enough at small features. For example, you can have two pictures of the same person that are totally different colors based on the lighting. But you can't just look at the small features in isolation - you need to take not just the eyes or the mouth into account to recognize a face, but the whole face together. And if you have an image with all the features in the wrong places, you shouldn't recognize a face either. You need to direct the neural networks' attention to the right features.

NN revolution (2012-2017)

Another major advance was using GPUs to train neural networks. GPUs - graphical processing units - are used for a lot more than graphics today. For example, Bitcoin mining is most cost effective using a GPU. GPUs are good for Bitcoin mining and machine learning because both are highly parallel, high throughput applications. That is, both need to do many simple but fast operations at once rather than a long sequence of complex, indirect computation, which is what the CPU is for. (The CPU in your computer is probably made by Intel, while your GPU may be made by Nvidia.)

Once GPUs were used for neural network training, the state of the art in image recognition immediately made a big step forward. In 2012, a University of Toronto grad student Alex Krizhevsky built AlexNet using a CNN trained on GPUs to compete in an annual computer vision competition. It immediately cut the error rate of the next best image recognition technique in half. Since then, gains in image recognition have just continued. Since 2012, neural networks have been the dominant way to solve computer vision problems.

RNNs and encoder/decoder models

So far I've only been talking about the problem of identifying objects or writing in images - "computer vision." CV tends to make progress more quickly because its data is more abundant and easier to label.

But what's been going on with natural language processing during this time? After all, GPT-3 is an advance in NLP.

The major development for NLP in the mid-2010's was recurrent neural networks. Recurrent neural networks (RNNs), unlike the standard "feed-forward neural networks" we described above, can handle sequences of images or text, not just a fixed image. They do this by keeping track of memory between items in the sequence. This gives them a powerful property called Turing completeness. Any Turing complete machine can theoretically solve any computable problem - meaning RNNs can solve any computable problem.

RNNs were typically used for translation, and became the standard at Google Brain in the mid-2010s. These RNNs are encoder/decoder models, meaning have two parts: they encode input data into an intermediary "latent" representation, and then they decode it into a new output format. This works well for translation - they convert, say, English text into a latent format that encodes the meaning in some secret language only the model can understand, and then it can convert it into any other language (including back into English, if you'd like).

(Side note: This is part of how deep fakes work too - the network encodes data from one face for the facial expression, and then uses a decoder to put that facial expression on a new face. Obviously, though, they work on images, not text.)

In the diagram below, you can see how the input is encoded to a code h, which is decoded by the output layer.

Other NLP research during this era I'm less familiar with.

Autoencoder diagram from Wikipedia

Transformers! (2017-present)

So, with GPUs and RNNs, we have some really powerful neural networks that can process text. Let's say we want to build a chatbot. Why can't we just make a really big RNN and give it a lot of examples of queries and good responses to those queries? Three major reasons:

  1. "Vanishing gradient problem." What does this mean? Well, in this case, "gradient" means the "push" that we described earlier that the neural network learns from. And "vanishing" means that the push is so small that it doesn't actually train the network. Basically, the neural network has so much information and so many neurons that relevant information gets lost before it hits the neurons it needs to hit. For example, going back to the "A" vs "B" example - imagine that you have 10 layers between the "this shape has a pointy top" neuron and the "this is an A" neuron. During training, you see an A, and you try to turn all the knobs in between the pointy top neurons and the "A" neurons to be aligned. You change all the knobs in one layer, then a little less in the layer above that, and so on - but eventually the change is so small that no change actually happens. Why not just use fewer layers? Because the information you're trying to encode is too complex to represent in fewer layers. Simple computer vision can use fewer layers - AlexNet had 6 layers - but advanced NLP requires more layers - GPT-3 has 96. (A related problem is the "exploding gradient problem" - sometimes the training data overwhelms the network, and it erases old training information.)
  2. Speed. Keep in mind that GPUs are good at doing lots of simple tasks at the same time. Unfortunately, RNNs actually need to run many of their tasks one after another rather than at the same time because their decision-making patterns are so complex. (This kind of makes sense - depending on what the first word in a sentence is, then we need to think about the next word differently, and so on.) Each neuron has to wait for the previous neuron to fire, and that might result in a lot of waiting - meaning you're doing one long, complex task rather than many simple tasks. So RNNs run much more slowly than CNNs, which can be processed in parallel.
  3. Training data. We need data that says what the right answer is for AI to be trained. Language is extremely complex, since implicitly a lot of human thought is embedded in understanding and expressing language, so we need a lot of labeled data. Unlike images, which are relatively easy for humans to label, language is hard to get good training data for. And without a lot of data, there's no point in building a giant model like GPT-3.

How do we solve this?

Attention and transformers

In the late 2010's, researchers began to invest more in adding "attention" to RNNs that would learn how to focus on different parts of the text when processing different words. Part of the model learns how different words in the text relate to one another, and then can focus on the most important words to share with the RNN. This helped solve vanishing gradient problems, since the network wasn't trying to learn from so many different signals at once.

This led to big improvements in how "smart" the neural networks were, but they were still slow. The revolution came in 2017 when Google researchers published the paper "Attention is all you need," showing that you could get rid of the RNN entirely and just use a simple feed-forward neural network to achieve even better results. They called these models "transformers." These transformer models actually read the whole sentence in parallel and use the attention to understand relationships between words. This means that they can use GPUs to their full potential and run much faster. (I honestly don't understand why this is, or why transformers are still Turing complete. Need to figure that out.)

What about the training data? Well, the secret here was to use an encoder/decoder model, except to only keep the encoder part around. Remember that the encoder/decoder model converts the text into "thoughts" before turning it into the other language. But transformers are different. Instead of trying to convert English into French, transformers just try to take text and predict other text as output. This converts it from supervised to unsupervised - it just needs to accurately predict the next word to be useful. And, because the output in some sense matches the input, they no longer need the decoder. They can simply run the encoder in reverse to predict the next word.

And, by the way, it's now the late 2010's and there's plenty of text from the Internet that these models can be trained on. So, you just need to scrape text off the Internet and feed it into a transformer to get plenty of training data.

And there we have it: we have all the basic ingredients to make a Generative Pretrained Transformer.

Basics of how a transformer works, from

For a more detailed explanation of transformers, I recommend the following resources:

Conclusion: Transformers assemble!

After the 2017 paper "Attention is all you need," transformers took over NLP. In 2018, Google published BERT, their attempt to scale transformers up. They said they simply took the transformer model and gave it more neurons and more data and got excellent results. In 2019, OpenAI announced GPT-2, a model they said was so powerful it could be dangerous in the wrong hands. They bucked tradition and initially refused to release the model to the public. After slowly increasing access for several months, they decided that it was safe enough and released the whole model to the public.

In 2020, they finally launched GPT-3, which you probably already know about. (If you don't, check out the New York Times's article on the topic.)

Today, transformers keep on getting bigger - as many as 530 billion parameters - and are now being used for computer vision as well. They're changing AI research. Hopefully, I'll learn enough about that next to write something about that too.