Large Language Models

The development and popularity of Large Language Models have forced us to revisit concepts we thought we had fully understood. For instance, since Large Language Models are artificial intelligence products, we must reconsider what “intelligence” means. Initially, I defined intelligence as the ability to accomplish a goal. In this context, artificial intelligence is the ability of a non-biological agent to accomplish a goal. However, this definition becomes quickly intractable as it assumes that goals and objectives are already well-defined, limiting its usefulness.
I came across another definition that I find more compelling. It suggests that intelligence is the ability to make models. This aligns with Joshua Bach’s definition of general intelligence as “the ability to create and utilize models.” Another equivalent perspective (due to Francois Chollet) defines intelligence as “sensitivity to abstract analogies” or “the ability to adapt to new situations and solve novel problems.” Within this framework, if we have two agents, Agent A and Agent B, Agent A is more intelligent than Agent B if Agent A can successfully solve a more significant percentage of novel problems for any number of tasks. By “novel problems,” I mean issues the agent has never encountered before but can solve using prior knowledge or abstractions. Thus, according to Francois Chollet, intelligence is the ability to generalize and adapt through abstraction.
As human beings, our cognitive apparatus provides us with a valuable model for general intelligence. If we examine the essential components of our nervous system, we see that neurons are the building blocks. A neuron consists of a cell body (soma), dendrites (branches that receive signals), and an axon that transmits signals. The axon branches into synapses that connect to the dendrites of other neurons, forming a network. Information is passed between neurons via electrical signals, and neurons fire when certain chemical thresholds are reached.
When mathematically modeling the human nervous system, we organize these neurons into layers. A “deep network” refers to having many layers, with each layer containing multiple neurons. Connections between neurons across layers are quantified by what we call “weights,” while the threshold needed to activate a neuron is quantified by a “bias.” In this network model, each neuron is associated with two parameters: weight and bias.
To model this network mathematically, we use compositions of functions. A common approach involves applying an affine operator (matrix multiplication and addition of a constant vector) followed by an activation function, which introduces non-linearity. This sequence—affine operator followed by activation function—is repeated across the layers, from the input layer to the output layer. The activation function enables the network to handle complex, non-linear tasks, such as recognizing patterns in images or text.
For example, consider a neural network tasked with recognizing handwritten digits. Given an image of a digit, the network should classify it accurately, such as identifying it as “5” or “4.” This process starts with a “training set,” a collection of input-output pairs where the input is an image, and the output is the correct label. This training set can be thought of as a sample from the “ground truth” function we aim to approximate—a function that maps inputs to their correct outputs.
Initially, the weights and biases in the network are randomly assigned. As a result, the network’s early predictions are guesses. However, because we know the correct output for each input, we can measure the error—the magnitude of the difference between the predicted and accurate outputs. By averaging these errors over a batch of examples, we obtain a function (called the “loss function”) that measures how well the network performs.
The loss function is a multivariable function, and our goal is to minimize it by adjusting the weights and biases. The global minimum of the loss function corresponds to zero error. Using calculus, we know that the gradient of the loss function points in the direction of the steepest ascent, so moving in the opposite direction—gradient descent—leads us toward the minimum. This process of iteratively updating weights and biases to minimize error is called “learning.”
For Large Language Models such as ChatGPT, the architecture is based on a specialized neural network called a “transformer.” Transformers are designed to handle sequential data efficiently and are particularly suited for tasks like language processing. Their unique architecture allows them to model long-range dependencies and capture context effectively, making them the backbone of many state-of-the-art AI systems today.