by Anil Ananthaswamy
for
Quanta Magazine allows a neural network to figure out for itself what matters. The process might be what makes our own brains so successful...
An image might be labeled "tabby cat" or "tiger cat," for example, to "train" an artificial neural network to correctly distinguish a tabby from a tiger.
The strategy has been
both spectacularly successful and woefully deficient.
For example, a neural network might use the presence of grass to recognize a photo of a cow, because cows are typically photographed in fields.
For researchers interested in the intersection of animal and machine intelligence, moreover, this "supervised learning" might be limited in what it can reveal about biological brains.
Now some computational neuroscientists have begun to explore neural networks that have been trained with little or no human-labeled data.
These "self-supervised learning" algorithms have proved enormously successful at modeling human language and, more recently, image recognition.
In recent work, computational models of the mammalian visual and auditory systems built using self-supervised learning models have shown a closer correspondence to brain function than their supervised-learning counterparts.
To some neuroscientists, it seems as if the artificial networks are beginning to reveal some of the actual methods our brains use to learn.
Flawed Supervision
Brain models inspired by artificial neural networks came of age about 10 years ago, around the same time that a neural network named AlexNet revolutionized the task of classifying unknown images.
That network, like all neural networks, was made of layers of artificial neurons, computational units that form connections to one another that can vary in strength, or "weight."
If a neural network fails to classify an image correctly, the learning algorithm updates the weights of the connections between the neurons to make that misclassification less likely in the next round of training.
The algorithm repeats this process many times with all the training images, tweaking weights, until the network's error rate is acceptably low.
Alexei Efros, a computer scientist at the University of California, Berkeley, thinks that most modern AI systems are too reliant on human-created labels.
"They don't really learn the material," he said.
Around the same time, neuroscientists developed the first computational models of the primate visual system, using neural networks like AlexNet and its successors.
The union looked promising:
But as the field progressed, researchers realized the limitations of supervised training.
For instance, in 2017, Leon Gatys, a computer scientist then at the University of Tübingen in Germany, and his colleagues took an image of a Ford Model T, then overlaid a leopard skin pattern across the photo, generating a bizarre but easily recognizable image.
A leading artificial neural network correctly classified the original image as a Model T, but considered the modified image a leopard. It had fixated on the texture and had no understanding of the shape of a car (or a leopard, for that matter).
Self-supervised learning strategies are designed to avoid such problems. In this approach, humans don't label the data.
Rather, "the labels come from the data itself," said Friedemann Zenke, a computational neuroscientist at the Friedrich Miescher Institute for Biomedical Research in Basel, Switzerland.
Self-supervised algorithms essentially create gaps in the data and ask the neural network to fill in the blanks.
In a so-called large language model, for instance, the training algorithm will show the neural network the first few words of a sentence and ask it to predict the next word.
When trained with a massive corpus of text gleaned from the internet, the model appears to learn the syntactic structure of the language, demonstrating impressive linguistic ability - all without external labels or supervision.
A similar effort is underway in computer vision.
In late 2021, Kaiming He and colleagues revealed their "masked auto-encoder," which builds on a technique pioneered by Efros' team in 2016.
Machines Beat Humans on a Reading Test.
The self-supervised learning algorithm randomly masks images, obscuring almost three-quarters of each one.
The masked auto-encoder turns the unmasked portions into latent representations - compressed mathematical descriptions that contain important information about an object.
(In the case of an image, the latent representation might be a mathematical description that captures, among other things, the shape of an object in the image.)
A decoder then converts those representations back into full images.
The self-supervised learning algorithm trains the encoder-decoder combination to turn masked images into their full versions. Any differences between the real images and the reconstructed ones get fed back into the system to help it learn.
This process repeats for a set of training images until the system's error rate is suitably low.
In one example, when a trained masked auto-encoder was shown a previously unseen image of a bus with almost 80% of it obscured, the system successfully reconstructed the structure of the bus.
The latent representations created in a system such as this appear to contain substantially deeper information than previous strategies could include.
The system might learn the shape of a car, for example - or a leopard - and not just their patterns.
No last-minute cramming to pass tests.
Self-Supervised Brains
In systems such as this, some neuroscientists see echoes of how we learn.
Biological brains are thought to be continually predicting, say, an object's future location as it moves, or the next word in a sentence, just as a self-supervised learning algorithm attempts to predict the gap in an image or a segment of text.
And brains learn from their mistakes on their own, too - only a small part of our brain's feedback comes from an external source saying, essentially, "wrong answer."
The computational neuroscientist Blake Richards has helped create an AI
that mimics visual networks in living brains.
For example, consider the visual systems of humans and other primates.
These are the best studied of all animal sensory systems, but neuroscientists have struggled to explain why they include two separate pathways: the ventral visual stream, which is responsible for recognizing objects and faces, and the dorsal visual stream, which processes movement (the "what" and "where" pathways, respectively).
Richards and his team created a self-supervised model that hints at an answer.
They trained an AI that combined two different neural networks:
To train the combined AI, the team started with a sequence of, say, 10 frames from a video and let the ResNet process them one by one.
The recurrent network then predicted the latent representation of the 11th frame, while not simply matching the first 10 frames.
The self-supervised learning algorithm compared the prediction to the actual value and instructed the neural networks to update their weights to make the prediction better.
Richards' team found that an AI trained with a single ResNet was good at object recognition, but not at categorizing movement.
But,
To test the AI further, the team showed it a set of videos that researchers at the Allen Institute for Brain Science in Seattle had previously shown to mice.
The Allen researchers recorded the neural activity in the mouse visual cortex as the animals watched the videos.
Here too, Richards' team found similarities in the way the AI and the living brains reacted to the videos.
During training, one of the pathways in the artificial neural network became more similar to the ventral, object-detecting regions of the mouse's brain, and the other pathway became similar to the movement-focused dorsal regions.
The results suggest that our visual system has two specialized pathways because they help predict the visual future, said Richards; a single pathway isn't good enough.
Models of the human auditory system tell a similar story.
In June, a team led by Jean-Rémi King, a research scientist at Meta AI, trained an AI called Wav2Vec 2.0, which uses a neural network to transform audio into latent representations.
The researchers mask some of these representations, which then feed into another component neural network called a transformer.
During training, the transformer predicts the masked information. In the process the entire AI learns to turn sounds into latent representations - again, no labels needed.
The team used about 600 hours of speech data to train the network,
Jean-Rémi King helped train an AI that processes audio in a way that mimics the way brains do,
in part by predicting what should come next.
Once the system was trained, the researchers played it sections of audiobooks in English, French and Mandarin.
The researchers then compared the AI's performance against data from 412 people - a mix of native speakers of the three languages who had listened to the same stretches of audio while having their brains imaged in an fMRI scanner.
King said that his neural network and the human brains, despite the noisy and low-resolution fMRI images,
Uncured Pathologies
Not everyone is convinced.
Josh McDermott, a computational neuroscientist at the Massachusetts Institute of Technology, has worked on models of vision and auditory perception using both supervised and self-supervised learning.
His lab has designed what he calls "metamers,"
To an artificial neural network, however, metamers appear indistinguishable from real signals.
This suggests that the representations that form in the neural network's deeper layers, even with self-supervised learning, don't match the representations in our brains.
These self-supervised learning approaches,
The algorithms themselves also need more work.
For example, in Meta AI's Wav2Vec 2.0, the AI only predicts latent representations for a few tens of milliseconds' worth of sound - less time than it takes to utter a perceptually distinct noise, let alone a word.
Truly understanding brain function is going to require more than self-supervised learning.
For one thing, the brain is full of feedback connections, while current models have few such connections, if any.
An obvious next step would be to use self-supervised learning to train highly recurrent networks - a difficult process - and see how the activity in such networks compares to real brain activity.
The other crucial step would be to match the activity of artificial neurons in self-supervised learning models to the activity of individual biological neurons.
If the observed similarities between brains and self-supervised learning models hold for other sensory tasks, it'll be an even stronger indication that whatever magic our brains are capable of requires self-supervised learning in some form.
|