A Revolution in How Robots Learn
Brave New World Dept.A future generation of robots will not be programmed to complete specific tasks. Instead, they will use A.I. to teach themselves.By James SomersNovember 25, 2024“This is the year that people really realized that you can build general-purpose robots,” Carolina Parada, the leader of the robotics team at Google DeepMind, said recently.Illustration by Tameem SankariIn the first days of my son’s life, during the fall of 2023, he spent much of the time when he wasn’t sleeping or eating engaged in what some cognitive scientists call “motor babbling.” His arms and legs wiggled; his eyes wandered and darted, almost mechanically. One night, as he was drifting off to sleep, he smiled for the first time. As I admired him, wondering what he might be thinking about, his expression suddenly went blank—and then, in quick succession, he looked upset, then surprised, and then happy again. It was as if the equipment were being calibrated. That is apparently the purpose of motor babbling: random movements help the brain get acquainted with the body it’s in.Our intelligence is physical long before it is anything else. Most of our brain mass exists to coördinate the activity of our bodies. (Neuroscientists have found that even when you navigate an abstract space—contemplating, say, your company’s org chart—you use the same neural machinery you’d use to navigate a real space.) A disproportionate amount of the primary motor cortex, a region of the brain that controls movement, is devoted to body parts that move in more complicated ways. An especially large portion controls the face and lips; a similarly large portion controls the hands.A human hand is capable of moving in twenty-seven separate ways, more by far than any other body part: our wrists rotate, our knuckles move independently of one another, our fingers can spread or contract. The sensors in the skin of the hand are among the densest in the body, and are part of a network of nerves that run along the spinal cord. “People think of the spinal column as just wires,” Arthur Petron, a roboticist who earned his Ph.D. in biomechatronics at M.I.T., said. “No. It’s also brain tissue.” The hand, in particular, is so exquisitely sensitive that “it’s a vision sensor,” he said. “If you touch something in the dark, you can basically draw it.”I remember the week that my son’s hands came online. We had a spherical toy with a rattle inside, and for weeks he limply ignored it. One day, though, as if by accident, he managed to paw at it. The next day, he could hold on. Within a week, he grasped for it with some intention, and after two weeks he was turning it over in his hand. The remarkable thing about this progression is its extreme rapidity. How can one learn to use so complex a piece of equipment in just two weeks? My son himself seemed impressed. He would look at his palm and flex his fingers, as though wondering, What is this thing, and what else can it do?In the nineteen-eighties, Hans Moravec, a Canadian roboticist, described a paradox: the tasks that are easiest for humans to perform, such as using our hands to grasp things, are often the hardest for computers to do. This is still true even now that many refined tasks, such as writing prose or computer code, have practically been conquered already. In my job as a programmer, I use an A.I. to quickly solve coding tasks that once would have taken me an afternoon; this A.I. couldn’t type at my keyboard. It is all mind and no body. As a result, the most “A.I.-proof” professions may actually be old ones: plumbing, carpentry, child care, cooking. Steve Wozniak, the co-founder of Apple, once proposed a simple test that has yet to be passed: Can a robot go into your house and make you a cup of coffee?Until a few years ago, robotics seemed to be developing far more slowly than A.I. On YouTube, humanoid forms developed by Boston Dynamics, an industrial-robotics company, danced or leaped over obstacles, doing a sort of mechanical parkour. But these movements were scripted—the same robots couldn’t make you a cup of coffee. To fetch a coffee filter, a robot might need to navigate around a kitchen island, recognize a cupboard, and open the cupboard door without ripping it off its hinges. Simply peeling apart the sides of a coffee filter was long considered a feat of unfathomable difficulty. A hopelessness hung over the whole enterprise.Then some of A.I.’s achievements began spilling into robotics. Tony Zhao, a robotics researcher who started his academic career in A.I., at U.C. Berkeley, remembers reading about GPT-3, a large language model that OpenAI introduced in 2020, and feeling that he was a witness to history. “I’d seen language models before, but that was the first one that felt kind of alive,” he told me. Petron, the M.I.T. researcher, was working on another project at OpenAI—a robotic hand that could delicately spin the faces of a Rubik’s Cube. In August, 2022, researchers from Google showed that L.L.M.-powered r
In the first days of my son’s life, during the fall of 2023, he spent much of the time when he wasn’t sleeping or eating engaged in what some cognitive scientists call “motor babbling.” His arms and legs wiggled; his eyes wandered and darted, almost mechanically. One night, as he was drifting off to sleep, he smiled for the first time. As I admired him, wondering what he might be thinking about, his expression suddenly went blank—and then, in quick succession, he looked upset, then surprised, and then happy again. It was as if the equipment were being calibrated. That is apparently the purpose of motor babbling: random movements help the brain get acquainted with the body it’s in.
Our intelligence is physical long before it is anything else. Most of our brain mass exists to coördinate the activity of our bodies. (Neuroscientists have found that even when you navigate an abstract space—contemplating, say, your company’s org chart—you use the same neural machinery you’d use to navigate a real space.) A disproportionate amount of the primary motor cortex, a region of the brain that controls movement, is devoted to body parts that move in more complicated ways. An especially large portion controls the face and lips; a similarly large portion controls the hands.
A human hand is capable of moving in twenty-seven separate ways, more by far than any other body part: our wrists rotate, our knuckles move independently of one another, our fingers can spread or contract. The sensors in the skin of the hand are among the densest in the body, and are part of a network of nerves that run along the spinal cord. “People think of the spinal column as just wires,” Arthur Petron, a roboticist who earned his Ph.D. in biomechatronics at M.I.T., said. “No. It’s also brain tissue.” The hand, in particular, is so exquisitely sensitive that “it’s a vision sensor,” he said. “If you touch something in the dark, you can basically draw it.”
I remember the week that my son’s hands came online. We had a spherical toy with a rattle inside, and for weeks he limply ignored it. One day, though, as if by accident, he managed to paw at it. The next day, he could hold on. Within a week, he grasped for it with some intention, and after two weeks he was turning it over in his hand. The remarkable thing about this progression is its extreme rapidity. How can one learn to use so complex a piece of equipment in just two weeks? My son himself seemed impressed. He would look at his palm and flex his fingers, as though wondering, What is this thing, and what else can it do?
In the nineteen-eighties, Hans Moravec, a Canadian roboticist, described a paradox: the tasks that are easiest for humans to perform, such as using our hands to grasp things, are often the hardest for computers to do. This is still true even now that many refined tasks, such as writing prose or computer code, have practically been conquered already. In my job as a programmer, I use an A.I. to quickly solve coding tasks that once would have taken me an afternoon; this A.I. couldn’t type at my keyboard. It is all mind and no body. As a result, the most “A.I.-proof” professions may actually be old ones: plumbing, carpentry, child care, cooking. Steve Wozniak, the co-founder of Apple, once proposed a simple test that has yet to be passed: Can a robot go into your house and make you a cup of coffee?
Until a few years ago, robotics seemed to be developing far more slowly than A.I. On YouTube, humanoid forms developed by Boston Dynamics, an industrial-robotics company, danced or leaped over obstacles, doing a sort of mechanical parkour. But these movements were scripted—the same robots couldn’t make you a cup of coffee. To fetch a coffee filter, a robot might need to navigate around a kitchen island, recognize a cupboard, and open the cupboard door without ripping it off its hinges. Simply peeling apart the sides of a coffee filter was long considered a feat of unfathomable difficulty. A hopelessness hung over the whole enterprise.
Then some of A.I.’s achievements began spilling into robotics. Tony Zhao, a robotics researcher who started his academic career in A.I., at U.C. Berkeley, remembers reading about GPT-3, a large language model that OpenAI introduced in 2020, and feeling that he was a witness to history. “I’d seen language models before, but that was the first one that felt kind of alive,” he told me. Petron, the M.I.T. researcher, was working on another project at OpenAI—a robotic hand that could delicately spin the faces of a Rubik’s Cube. In August, 2022, researchers from Google showed that L.L.M.-powered robots had a surprising amount of common sense about physical tasks. When they asked a robot for a snack and a drink, it found a banana and a water bottle in the kitchen and brought them over.
Roboticists increasingly believe that their field is approaching its ChatGPT moment. Zhao told me that when he ran one of his latest creations he immediately thought of GPT-3. “It feels like something that I’ve never seen before,” he said. In the top labs, devices that once seemed crude and mechanical—robotic—are moving in a way that suggests intelligence. A.I.’s hands are coming online. “The last two years have been a dramatically steeper progress curve,” Carolina Parada, who runs the robotics team at Google DeepMind, told me. Parada’s group has been behind many of the most impressive recent robotics breakthroughs, particularly in dexterity. “This is the year that people really realized that you can build general-purpose robots,” she said. What is striking about these achievements is that they involve very little explicit programming. The robots’ behavior is learned.
On a cool morning this summer, I visited a former shopping mall in Mountain View, California, that is now a Google office building. On my way inside, I passed a small museum of the company’s past “moonshots,” including Waymo’s first self-driving cars. Upstairs, Jonathan Tompson and Danny Driess, research scientists in Google DeepMind’s robotics division, stood in the center of what looked like a factory floor, with wires everywhere.
At a couple of dozen stations, operators leaned over tabletops, engaged in various kinds of handicraft. They were not using their own hands—instead, they were puppeteering pairs of metallic robotic arms. The setup, known as ALOHA, “a low-cost open-source hardware system for bimanual teleoperation,” was once Zhao’s Ph.D. project at Stanford. At the end of each arm was a claw that rotated on a wrist joint; it moved like the head of a velociraptor, with a slightly stiff grace. One woman was using her robotic arms to carefully lower a necklace into the open drawer of a jewelry case. Behind her, another woman prized apart the seal on a ziplock bag, and nearby a young man swooped his hands forward as his robotic arms folded a child’s shirt. It was close, careful work, and the room was quiet except for the wheeze of mechanical joints opening and closing. “It’s quite surprising what you can and can’t do with parallel jaw grippers,” Tompson said, as he offered me a seat at an empty station. “I’ll show you how to get started.”
I wrapped my fingers around two handles. When I pushed or pulled with one of my hands, its robot-claw counterpart followed suit. Tompson put some toys and a highlighter on the table. With my right hand, I pawed weakly at a small plastic diamond, hoping to push it through a diamond-shaped hole in a block. “This is kind of tough,” I said. My brain had decided with impressive speed that these claws were my new hands, but hadn’t yet wired them up correctly. The diamond would not do what I wanted. I felt for my son, who’d had the same trouble with one of his first toys.
“Passing it back and forth between hands makes reorienting things much easier,” Tompson suggested.
I had forgotten I even had a left hand. I practiced opening and closing the left claw and found that I could easily pass the diamond back and forth. Driess chimed in: “You see it has no force feedback, but you realize that it doesn’t matter at all.” When I closed the grippers around the diamond, I couldn’t feel anything—but I finally managed to fit the shape through the hole.
Gaining confidence, I grabbed a highlighter in my left claw and pulled the cap off with my right. Tompson said that they’d given a similar task to their operators. Near my feet were two pedals, one labelled “Success,” the other “Failure.” You might cap and uncap highlighters for hours, tapping the right pedal if you got it and the left if you fumbled. Then the A.I., using a technique called imitation learning, could try to mimic the successful runs without anyone behind the claws. If you’ve ever seen a tennis instructor guide a student’s arm through a proper backhand, that’s imitation learning.
I eyed a computer underneath the table. Driess explained that there are four cameras that gather data, along with sensors that track the orientation of the robot in space. The data are distilled by a series of neural networks into what’s called a policy—essentially a computer program that tells a robot what to do. An assembly-line robot arm might have a very simple policy: rotate ten degrees clockwise, pick up an item, drop it, rotate back, repeat. The policies being trained here were far more complex—a summation of all the operators’ successful runs.
Driess began typing at a console nearby. He wanted to show me a policy that put shirts on clothes hangers. “This policy was trained on how many demonstrations?” Tompson asked.
“Eight thousand,” Driess answered.
I imagined an operator hanging a shirt eight thousand times. Behind us, someone arrived for a new shift and shook his wrists out. “They never operate for more than an hour at a time without an hour break in between,” Tompson said.
When the policy was ready, Tompson laid a child’s polo shirt on the table and Driess hit Enter. Suddenly, the ALOHA I’d been driving began driving itself. The hands came alive and moved with purpose toward the shirt, like the magic broomsticks in “Fantasia.”
The right claw grabbed one corner of the shirt, motor whirring, and lifted it toward a little plastic coatrack with a hanger on it. The other claw grabbed the hanger. The next steps were to thread the hanger into one shoulder, secure that side, and do the same with the other shoulder. The robot halted a moment, then recovered. Finally, it placed the shirt and hanger on the rack.
“I’ll call that a success,” Tompson said, tapping the right pedal. I could see the intricacy of the task: your eyes help your hands make tiny adjustments as you go. ALOHA is one of the simplest and cheapest sets of robotic arms out there, yet operators have pushed the boundaries of robotic dexterity with it. “You can peel eggs,” Tompson said. Zhao had managed to fish a contact lens out of its case and place it on a toy frog’s eye. (Other precise tasks, such as sewing, remain difficult.)
In the early days of Google Books, roomfuls of contractors turned millions of pages by hand in order to unlock the knowledge inside. The roomful of ALOHAs was unlocking the subtle physical details of everyday life, arguably one of the last expanses of unrecorded human activity. The data they generate will help to train what roboticists have taken to calling “large behavioral models.”
I asked Tompson and Driess to show me the policy that their robot had become famous for. “There is a professor, a very good professor, who said that he will retire as soon as a robot policy can tie shoes,” Driess said. Tompson plunked a shoe down on the table.
When the claws came alive, they grabbed the two ends of the shoelace, formed them into loops, and wove them through each other. As the claws came apart, we cheered: the robot had tied a shoelace.
“So did he retire?” I asked. Apparently not. One of the ultimate dreams of A.I. is generalization: how does your policy do when pushed beyond its training data? They’d trained the policy on only two or three shoes.
“If I gave it my shoe,” I ventured, “would it just totally fail?”
“We could try,” Tompson said. I removed my right sneaker, with apologies to anyone forced to handle it. Tompson gamely placed it on the table, while Driess reloaded the policy.
“To set expectations,” Driess said, “this is a task that is thought of as being impossible.”
Tompson eyed his new experimental subject with some trepidation. “Very short shoelaces,” he said.
The policy booted up, and the claws set to work. This time, they poked at the shoelace without getting a grip. “Do you give consent for your shoe to be destroyed?” Driess joked, as the hands grabbed at the tongue. Tompson let them try for a few more seconds before hitting the Failure pedal.
Experts in child development like to say that at around nine months old babies develop the pincer grip, or the ability to hold something small between their thumb and forefinger. That frames the problem in terms of the hand. Equally important, though, is the knowledge the maneuver requires. Children have to learn how hard you can squeeze a piece of avocado before it slides out of your fingers, or a Cheerio before it crumbles.
From the moment my son was born, he’s been engaged in what A.I. researchers call “next-token prediction.” As he reaches for a piece of banana, his brain is predicting what it will feel like on his fingertips. When it slips, he learns. This is more or less the method that L.L.M.s like ChatGPT use in their training. As an L.L.M. hoovers up prose from the Internet, it hides from itself the next chunk of text, or token, in a sentence. It guesses the hidden token on the basis of the ones that came before, then unhides the token to see how close its guess was, learning from any discrepancies. The beauty of this method is that it requires very little human intervention. You can just feed the model raw knowledge, in the form of an Internet’s worth of tokens.
As grownups, we have an indescribably rich model of the physical world, the result of a lifetime of tokens. Try this: Look at any object or surface around you and imagine what it would taste like. You are probably right, and that has something to do with the years you spent crawling around, putting everything in your mouth. Like all adults, I practice dexterity without even meaning to: when I manage to put a duvet into its cover; when I open a sealed bag of dog treats with one hand. The difference between me and my son is that most of my predictions are accurate. I don’t reach for a stream of water thinking I might be able to hold on to it. The exceptions stand out. Not long ago, at a restaurant, a friend told me to poke a sculpture that appeared to be made of glass, and it wobbled, almost like rubber. Model updated.
We can tie shoelaces better than ALOHA not because it has primitive, unsensing claws but because every shoe—every arrangement of laces, the way they bend and fall each time you lift them—is different. There is no Internet-size archive of the ways in which physical objects interact. Instead, researchers have come up with several competing methods of teaching robots.
One camp is betting on simulation. Nvidia, the giant A.I. chipmaker, has made software for creating “digital twins” of industrial processes, which allows computers to practice motions before robots actually try them. OpenAI used simulated data when training its robotic hand to spin a Rubik’s Cube; copies of the hand, practicing in parallel, carried out simulations that would have taken a real robot ten thousand years to perform. The appeal is obvious: all you need to generate more data is more computing power, and robots can learn like Neo in “The Matrix,” when he downloaded kung fu. Robot hands and Rubik’s cubes can’t be simulated perfectly, however. Even a paper towel becomes unpredictable when you crumple or rip it. Last year, Nvidia published a paper showing that researchers could teach a simulated robot hand to spin a pen in its fingers, the way a bored student might, an action that essentially requires the pen to be in flight most of the time. But the paper makes no mention of whether an actual robot could perform the trick.
For this reason, imitation learning seems to have an edge over simulation. Figure, an American startup, has raised more than six hundred million dollars to build an elaborate “humanoid” robot with a head, a torso, arms, legs, and five-fingered hands. Its most impressive feat of dexterity so far is “singulating a pepperoni,” according to Brett Adcock, Figure’s founder: it can peel one slice from the rest of the sausage. “If you want to do what humans can do,” Adcock told me, “then you need a robot that can interact the same way humans can with that environment.” (Tesla, 1x, Agility, and dozens of Chinese competitors have built humanoids.) Geordie Rose, a co-founder of Sanctuary AI, a robotics and artificial-intelligence startup based in Vancouver, argued that it’s easier to collect data for robots that move like us. “If I asked you to pick up a cup with, say, an octopus robot with eight suction-cup tentacles, you’d have no idea what to do, right?” he said. “But, if it’s a hand, you just do it.” Sanctuary’s sleek humanoid, called Phoenix, learns partly by being tele-operated by humans. The “pilot” dons haptic gloves, an exosuit that covers the upper body, and a virtual-reality headset that shows what the robot “sees.” Every movement, down to the slightest bend of the pilot’s pinkie, is replicated on the robot. Phoenix learns in much the same way as ALOHA, but it’s far more expressive.
Of course, if robots have to be taught every skill by hand, it’s going to take a long time, and a lot of exosuits, for them to become useful. When I want to bake bread, I don’t ask Paul and Prue from “The Great British Bake Off” to come over and pilot my arms; I just watch an episode of the show. “It’s the holy grail, right?” Tompson, from the ALOHA project, said. “You can imagine a model watching YouTube videos to learn basically whatever you want it to do.” But a YouTube video doesn’t tell you the precise angle of a baker’s elbow or the amount of force in her fingers as she kneads. To take advantage of demonstrations at a distance, a robot would need to be able to map its hands onto a person’s. That requires a foundation: a mental model of the physical world and of the body in it, and a repertoire of simple skills.
Early in our lives, humans learn how to learn. A few months ago, my son was sitting on a rocking horse, disappointed that it wasn’t moving. He looked over his shoulder and saw a girl on her own horse kicking her legs to make it rock. Monkey see, monkey do. After a few tries, the horse started going, and a smile broke out on his face. Practitioners of A.I. like to talk about “the flywheel,” an analogy to a disk that, once it gets going, is hard to stop. When the flywheel is really spinning, robots explore the world more efficiently, and they start to improve more quickly. That is how a robot might leap from one regime, like needing to be puppeteered, to another, like learning on its own.
One of the older buildings on the Google campus contains a Ping-Pong table with a big industrial robot arm on one side—the kind you’d see in a car factory, but in this case holding a paddle. On the afternoon of my visit, Saminda Abeyruwan, a research engineer, was sitting at a computer console on the other side of the net, and Pannag Sanketi, a software engineer, told him to “turn on the binary.” The arm whirred to attention.
Videos of this robot from 2022 had not made me excited to play against it. In the lingo of my middle-school tennis team, the robot appeared to be a “pusher”—cheaply returning the ball, with no ambition, and barely challenging beginners. But apparently the system had improved in the past two years. Fei Xia, another researcher, warned me, “Be careful with the forehand.”
Abeyruwan hit a practice serve to the machine. The whole apparatus—the arm was mounted on a gantry—moved like a printer head, loudly, and faster than seemed plausible. The paddle swung through the air with a lovely, rising stroke, shooting the ball back across the net. Abeyruwan, quick on his feet, rallied, but on the third shot the arm ripped a forehand at an angle. 0–1.
“I don’t want to play it too much,” Abeyruwan said. “It’s going to adapt to my weaknesses.” He offered me the paddle.
One downside of not being a robot is that you can’t just load a policy into your memory. It usually takes me about fifteen minutes to find my rhythm at a Ping-Pong table. I lobbed a ball to my opponent, hoping to warm up. Back came a deep, fast crosscourt shot, which sailed just past the end of the table.
“It’s pretty savage, this thing,” I said. It seemed to be trying to paint the corners.
“We changed it to make it more competitive,” Sanketi said. “In the process, what happened is that it’s more aggressive.”
A lot of its balls were going long. I took some pace off my own shots, and suddenly it found its range. Now that it was getting into rallies, it went for steeper and steeper angles. More balls were going to my backhand side. “You can feel it adapting to you,” I said.
As it exploited my weakness, I tried to exploit back, putting some cut on the ball. It blew its return into the net. “Spin it doesn’t like,” I said. The team had tried to use a motion-tracking system to estimate the tilt of paddles as players struck the ball, but it wasn’t sensitive enough.
There were other limitations. “It’s very risky to get close to the table,” Sanketi said, so the robot always hovers at least two inches above the table, which reduces the amount of topspin it can put on its returns. A lot of my balls were coming in fast and low, thank you very much, and the robot had a hard time getting under them. Sanketi suspected that this explained many of the long misses. But there was also just the fact that it had never played me before. In the lingo, my playing style was “out of distribution,” like shoes with unusually short laces.
“The way that we would fix this is, we have all the balls that it missed,” Sanketi went on. “We put it in the flywheel and train again. Next time you come, it will play better.” In the course of four weeks this summer, with data from only a couple of dozen players, the robot had progressed from dopey beginner to high intermediate. “Is the goal to get it to superhuman performance?” I asked.
“Yeah,” Sanketi said. Behind him, there was another Ping-Pong table with a similar setup, except that there was a robot on each side. I could see where this was going.
DeepMind, which was founded as a London-based A.I. research laboratory in 2010, is best known for a model called AlphaGo, which beat the world champion in the ancient board game Go. AlphaGo was originally fed a database of matches so that it could imitate human experts. Later, a newer version trained solely via “self-play,” sparring with a copy of itself. The model became an astonishingly efficient learner—the crowning example of a technique known as “reinforcement learning,” in which an A.I. teaches itself not by imitating humans but by trial and error. Whenever the model chanced onto a good move, the decisions that led it there were reinforced, and it got better. After just thirty hours of this training, it had become one of the best players on the planet.
Collecting data in the physical world, however, is much harder than doing so inside a computer. Google DeepMind’s best Go model can play a virtual game in seconds, but physics limits how fast a ball can ping and pong. The company’s Ping-Pong robots take up an entire room, and there are only three; the researchers had to invent a Rube Goldberg contraption using fans, funnels, and hoppers to feed loose balls back into robot-vs.-robot games. Right now, Sanketi explained, the robots are better at offense than defense, which ends games prematurely. “There’s nothing to keep the rally going,” he said. That’s why the team had to keep training their robots against people.
A Ping-Pong robot that could beat all comers sounded like classic DeepMind: a singularly impressive, whimsical, legible achievement. It would also be useful—imagine a tireless playing partner that adjusts as you improve. But Parada, the robotics lead, told me that the project might actually be winding down. Google, which acquired DeepMind in 2014 and merged it with an in-house A.I. division, Google Brain, in 2023, is not known for daring A.I. products. (They have a reputation for producing stellar and somewhat esoteric research that gets watered down before it reaches the market.) What the Ping-Pong bot has shown, Parada told me, is that a robot can “think” fast enough to compete in sport and, by interacting with humans, can get better and better at a physical skill. Together with the surprising capabilities of the ALOHAs, these findings suggested a path to human levels of dexterity.
Robots that teach themselves, by way of reinforcement learning, were long thought to be a dead end in robotics. A basic problem is what’s called curriculum design: how do you encourage learners to stretch their abilities without utterly failing? In a simulated game of Go, there are a finite number of moves and specific conditions for victory; an algorithm can be rewarded for any move that leads there. But in the physical world there are an uncountable number of moves. When a robot attempts to spin a pen, where there are so many more ways to fail than to succeed, how does it even determine that it’s making progress? The Rubik’s Cube researchers had to manually engineer rewards into their system, as if laying bread crumbs for the robot to follow: by fiat, the robot won points for maneuvers that humans know to be useful, such as twisting a face exactly ninety degrees.
What’s mysterious about humans is that we intrinsically want to learn new things. We come up with our own rewards. My son wanted to master the use of his hands because he was determined to taste everything in sight. That motivated him to practice other new abilities, like crawling or reaching behind his back. In short, he designed the curriculum himself. By the time he attempts something complicated, he has already developed a vocabulary of basic moves, which helps him avoid many obviously doomed strategies, like twitching wildly—the kind of thing that an untrained robot will do. A robot with no clear curriculum and no clear rewards accomplishes little more than hurting itself.
The robots of our imagination—RoboCop, the Terminator—are much sturdier than humans, but most real robots are delicate. “If you use a robot arm to knock a table or push something, it is likely to break,” Rich Walker, whose company, Shadow Robot, made the hand that OpenAI used in its Rubik’s Cube experiments, told me. “Long-running reinforcement-learning experiments are abusing to robots. Untrained policies are torture.” This turns out to profoundly limit how much they can learn. A breakable robot can’t explore the physical world like a baby can. (Babies are surprisingly tough, and parents usually intervene before they can swallow toys or launch themselves off the bed.)
For the past several years, Shadow Robot has been developing what looks like a medieval gauntlet with three fingers, all of which are opposable, like thumbs. A layer of gel under the “skin” of the fingertips is decorated with tiny dots that are filmed by an embedded camera; the pattern deforms under pressure. This helps the robot’s “brain” sense when a finger touches something, and how firmly. Shadow’s original hand needed to be re-started or serviced every few hours, but this one has been run for hundreds of hours at a time. Walker showed me a video of the fingers surviving blows from a mallet.
On a recent video call, I saw a few of the new Shadow hands in one of Google DeepMind’s labs in London, hanging inside enclosures like caged squid. The fingers were in constant motion, fast enough that they almost blurred. I watched one of the hands pick up a Lego-like yellow block and attempt to slot it into a matching socket. For a person, the task is trivial, but a single three-fingered robotic hand struggles to reposition the block without dropping it. “It’s a very unstable task by construction,” Francesco Nori, the engineering lead of DeepMind’s robotics division, explained. With just three digits, you frequently need to break contact with the block and reëstablish it again, as if tossing it between your fingers. Subtle changes in how tightly you grip the block affect its stability. To demonstrate, Nori put his phone between his thumb and forefinger, and as he loosened his grip it spun without falling. “You need to squeeze enough on the object, but not too much, because you need to reorient the object in your hand,” he said.
At first, the researchers asked operators to don three-fingered gloves and train their policy with imitation learning, ALOHA style. But the operators got tired after thirty minutes, and there was something un-ergonomic about operating a hand that was only sort of like your own. Different operators solved the task in different ways; the policy they trained had only a two-per-cent success rate. The range of possible moves was too large. The robot didn’t know what to imitate.
The team turned instead to reinforcement learning. They taught the robot to mine successful simulations in a clever way—by slicing each demonstration into a series of sub-tasks. The robot then practiced the sub-tasks, moving from those that were easier to those that were harder. In effect, the robot followed its own curriculum. Trained this way, the robot learned more from less data; sixty-four per cent of the time, it fit the block into the socket.
When the team first started running their policy, the block was bright yellow. But the task has been performed so many times that dust and metal from the robot’s fingers have blackened the edges. “This data is really valuable,” Maria Bauza, a research scientist on the project, said. The data would refine their simulation, which would improve the real-life policy, which would refine the simulation even more. Humans wouldn’t have to be anywhere in the loop.
At Google, as at many of the leading academic and industrial research labs, you can start to feel as if you’re in a droid repair shop in “Star Wars.” In Mountain View, while I was watching one of the ALOHAs in action, a friendly-looking little wheeled bot, reminiscent of something from “WALL-E,” stood by. Around the corner was a gigantic pair of arms, which a researcher on the project described as capable of breaking bones “without too much difficulty.” (The robot has safeguards to prevent it from doing so.) It was stacking blocks—a sort of super-ALOHA. The London lab is home to a team of twenty-inch-high humanoid soccer bots. Historically, every make and model of robot was an island: the code you used to control one couldn’t control another. But researchers are now dreaming of a day when a single artificial intelligence can control any type of robot.
Computer scientists used to develop different models to translate between, say, English and French or French and Spanish. Eventually, these converged into models that could translate between any pair of languages. Still, translation was considered a different problem than something like speech transcription or image recognition. Each had its own research teams or companies devoted to it. Then large language models came along. Shockingly, they could not only translate languages but also pass a bar exam, write computer code, and more besides. The hodgepodge melted into a single A.I., and the learning accelerated. The latest version of ChatGPT can talk to you aloud in dozens of languages, on any topic, and sing to you, and even gauge your tone. Anything it can do, it can do better than stand-alone models once dedicated to that individual task.
The same thing is happening in robotics. For most of the history of the field, you could write an entire dissertation about a narrow subfield such as vision, planning, locomotion, or the really hard one, dexterity. But “foundation models” like GPT-4 have largely subsumed models that help robots with planning and vision, and locomotion and dexterity will probably soon be subsumed, too. This is even becoming true across different “embodiments.” Recently, a large consortium of researchers showed that data can be shared successfully from one kind of machine to another. In “Transformers,” the same brain controls Optimus Prime whether he’s a humanoid or a truck. Now imagine that it can also control an industrial arm, a fleet of drones, or a four-legged cargo robot.
The human brain is plastic when it comes to the machinery it can command: even if you have never used a prosthetic limb, you have probably felt a wrench or a tennis racquet become like an extension of your body. Drive past a double-parked car and you know, intuitively, whether your passenger-side mirror is likely to get clipped. There’s every reason to believe that a future generation of A.I. will acquire the motor plasticity of a real brain. “Ultimately, what we will see is like one intelligence,” Keerthana Gopalakrishnan, a research scientist who works on robots at Google DeepMind, told me. To this end, Figure, the humanoid startup, has partnered with OpenAI to give large language models corporeal form; OpenAI has begun hiring a robotics team after a years-long hiatus.
Chelsea Finn, a Stanford robotics professor who contributed to the early development of ALOHA, worked at Google for several years. But she left the company not long ago to co-found a startup called Physical Intelligence, which aims to build software that can control any robot. (Driess, who’d shown me the ALOHAs, joined her.) About a month ago, Physical Intelligence announced its first “generalist robot policy.” In a video, a two-armed robot empties a clothes dryer into a basket, wheels the basket over to a table, and then folds shirts and shorts and places them in a stack. “The first time I saw the robot fold five items in a row from a laundry basket, it was probably the most excited I’ve been about a research result,” Finn told me. The A.I. driving this remarkable display, called π₀, can reportedly control half a dozen different embodiments, and can with one policy solve multiple tasks that might challenge an ALOHA: bagging groceries, assembling a box, clearing a dinner table. It works by combining a ChatGPT-esque model, which has broad knowledge of the world and can understand images, with imitation learning. “It’s definitely just the start,” Finn said.
When we think about a future with robots, we tend to imagine Rosie, from “The Jetsons”: a humanoid doing chores. But the robot revolution isn’t going to end with people-shaped machines folding shirts. I live in New York City, and almost everything I can see was made by human hands. Central Park looks natural, but it was once a mostly featureless swamp. Thousands of laborers spent years creating the reservoir, the lake, the rolling hills. Their hands pushed shovels into the ground to build hillsides, lit fuses to blast rock away, and nestled saplings into the soil.
A few years ago, at a recycling center near the airport in Zurich, Switzerland, a very large hand was at work. It was an autonomous excavator, developed by researchers at ETH Zürich, and it was building a retaining wall. With a hydraulic gripper on the end of its arm, it picked up a boulder, turning it as if contemplating a piece of fruit. The excavator motored toward a growing pile—the wall-to-be, which followed a plan laid out in software—and an algorithm predicted how the new stone would settle onto the others. The excavator, loosening its grip, placed the stone just so, then lumbered back to pick up another. When the sixty-five-metre wall was finished, it contained almost a thousand boulders and pieces of reclaimed concrete. It formed the edge of a new park. The robot worked about as quickly as an experienced laborer with an excavator.
Ryan Luke Johns, a lead researcher on the project, runs a company called Gravis Robotics, whose motto is “Tap your finger, move a mountain.” He foresees that “adaptive reuse” of materials could displace concrete, and that construction will become cheaper and more charming. Robots could make new Central Parks. It’s easy to see the appeal—and to imagine the risks of loosing so much strength upon the world. Already we have found A.I. difficult to control. For safety reasons, chatbots are restricted from producing certain kinds of content—misinformation, pornography, instructions for building bioweapons—but they are routinely “jailbroken” by amateurs with simple prompts. If an A.I. that talks about weapons is dangerous, picture an A.I. that is a weapon: a humanoid soldier, a sniper drone, a bomb that can think. If robotics models turn out to be embodiment-agnostic, the same kind of policy that today beats people at Ping-Pong might someday shoot somebody. “The drone manufacturers are dealing with this now,” one M.I.T. scientist told me. “They can say, ‘We will only sell to certain folks, and we will never sell a drone with a weapon.’ That doesn’t really stop somebody from, you know . . .” In the war in Ukraine, consumer drones designed for aerial photography have been turned into remote-controlled explosives. If such drones become autonomous, militaries could claim that they did not order this or that attack—their robots did. “You cannot punish an inanimate object,” Noel Sharkey, an emeritus professor of computer science at the University of Sheffield, in England, has written. “Being able to allocate responsibility is essential to the laws of war.” It is estimated that more than ninety countries have military robot programs, mostly involving drones. Several of the world’s leading military powers have not agreed to a U.N. resolution that could constrain their use of these robots.
Peaceful robots could unsettle our lives, too. I spoke to the founder of a small startup that is developing a semi-autonomous humanoid housekeeping robot. The idea is that, when you’re at work, the robot could wheel out of your closet and tidy up; if anything goes wrong, an operator in India or the Philippines could take over. This approach could save a lot of time and money. On the other hand, it could take jobs away from people. When I asked what would become of housekeepers who make a living doing such work locally, the founder said that they could apply to receive dividends. “There is an incentive within capitalism to replace labor by capital, to replace people by machines,” Mark Coeckelbergh, a philosophy professor at the University of Vienna, who specializes in the ethics of A.I., told me. He pointed out that the word “robot” comes from the Czech robota, for “forced labor.” “But not all tasks should be taken over by robots. We have it in our hands. It’s kind of an exercise to think, What kinds of jobs do we want humans to do?”
Speculating on the future of A.I.-powered robots is like trying to imagine the Industrial Revolution from the perspective of a nineteenth-century hatmaker. We are just too used to physical know-how being confined in one body. I remember where I was when I first learned to spin a pen: in an empty classroom in Mason Hall, on the University of Michigan campus. I had seen a friend do it, then practiced. It took a few hours. If other people want to learn the same trick, they also have to practice. But, if roboticists lift physical know-how into the virtual plane, they will be able to distribute it as easily as a new smartphone app. Once one robot has learned how to tie shoes, all of them can do it. Imagine copying and pasting not just a recipe for an omelette but the very act of making it.
Early in my son’s life, he had a blood test come up wonky, and we had to take him to a series of blood draws. It is not easy to draw blood from the arm of an eight-week-old. In the midst of one fairly horrific episode, we protested so much that one phlebotomist said to another, “Should we get Marsha?,” speaking of a nurse who was particularly good at finding a vein. In Marsha came, and she found the vein without further fuss. She should have her hands insured.
One day, an A.I. will guide a whirring hand made of metal, perhaps with gel in its fingertips, toward a newborn’s arm to draw blood. It’s hard to know whether to celebrate or fear that day. I may never have to reckon with it. I suspect my son will, though. When the thought occurs to me, I put his little hand in mine, and squeeze. ♦