Home / News & Analysis / AI edges closer to understanding 3D space the way we do

AI edges closer to understanding 3D space the way we do

If I show you single picture of a room, you can tell me right away that there’s a table with a chair in front of it, they’re probably about the same size, about this far from each other, with the walls this far away — enough to draw a rough map of the room. Computer vision systems don’t have this intuitive understanding of space, but the latest research from DeepMind brings them closer than ever before.

The new paper from the Google -owned research outfit was published today in the journal Science (complete with news item). It details a system whereby a neural network, knowing practically nothing, can look at one or two static 2D images of a scene and reconstruct a reasonably accurate 3D representation of it. We’re not talking about going from snapshots to full 3D images (Facebook’s working on that) but rather replicating the intuitive and space-conscious way that all humans view and analyze the world.

When I say it knows practically nothing, I don’t mean it’s just some standard machine learning system. But most computer vision algorithms work via what’s called supervised learning, in which they ingest a great deal of data that’s been labeled by humans with the correct answers — for example, images with everything in them outlined and named.

This new system, on the other hand, has no such knowledge to draw on. It works entirely independently of any ideas of how to see the world as we do, like how objects’ colors change towards their edges, how they get bigger and smaller as their distance changes, and so on.

It works, roughly speaking, like this. One half of the system is its “representation” part, which can observe a given 3D scene from some angle, encoding it in a complex mathematical form called a vector. Then there’s the “generative” part, which, based only on the vectors created earlier, predicts what a different part of the scene would look like.

(A video showing a bit more of how this works is available here.)

Think of it like someone hand you a couple pictures of a room, then asking you to draw what you’d see if you were standing in a specific spot in it. Again, this is simple enough for us, but computers have no natural ability to do it; their sense of sight, if we can call it that, is extremely rudimentary and literal, and of course machines lack imagination.

Yet there are few better words that describe the ability to say what’s behind something when you can’t see it.

“It was not at all clear that a neural network could ever learn to create images in such a precise and controlled manner,” said lead author of the paper, Ali Eslami, in a release accompanying the paper. “However we found that sufficiently deep networks can learn about perspective, occlusion and lighting, without any human engineering. This was a super surprising finding.”

It also allows the system to accurately recreate a 3D object from a single viewpoint, such as the blocks shown here:

I’m not sure I could do that.

Obviously there’s nothing in any single observation to tell the system that some part of the blocks extends forever away from the camera. But it creates a plausible version of the block structure regardless that is accurate in every way. Adding one or two more observations requires the system to rectify multiple views, but results in an even better representation.

This kind of ability is critical for robots especially because they have to navigate the real world by sensing it and reacting to what they see. With limited information, such as some important clue that’s temporarily hidden from view, they can freeze up or make illogical choices. But with something like this in their robotic brains, they could make reasonable assumptions about, say, the layout of a room without having to ground-truth every inch.

“Although we need more data and faster hardware before we can deploy this new type of system in the real world,” Eslami said, “it takes us one step closer to understanding how we may build agents that learn by themselves.”

Read more

Check Also

Facebook’s new AI research is a real eye-opener

There are plenty of ways to manipulate photos to make you look better, remove red eye or lens flare, and so on. But so far the blink has proven a tenacious opponent of good snapshots. That may change with research from Facebook that replaces closed eyes with open ones in a remarkably convincing manner. It’s far from the only example of intelligent “in-painting,” as the technique is called when a program fills in a space with what it thinks belongs there. Adobe in particular has made good use of it with its “context-aware fill,” allowing users to seamlessly replace undesired features, for example a protruding branch or a cloud, with a pretty good guess at what would be there if it weren’t. But some features are beyond the tools’ capacity to replace, one of which is eyes. Their detailed and highly variable nature make it particularly difficult for a system to change or create them realistically. Facebook, which probably has more pictures of people blinking than any other entity in history, decided to take a crack at this problem. It does so with a Generative Adversarial Network, essentially a machine learning system that tries to fool itself into thinking its creations are real. In a GAN, one part of the system learns to recognize, say, faces, and another part of the system repeatedly creates images that, based on feedback from the recognition part, gradually grow in realism. From left to right: “Exemplar” images, source images, Photoshop’s eye-opening algorithm, and Facebook’s method. In this case the network is trained to both recognize and replicate convincing open eyes. This could be done already, but as you can see in the examples at right, existing methods left something to be desired. They seem to paste in the eyes of the people without much consideration for consistency with the rest of the image. Machines are naive that way: they have no intuitive understanding that opening one’s eyes does not also change the color of the skin around them. (For that matter, they have no intuitive understanding of eyes, color, or anything at all.) What Facebook’s researchers did was to include “exemplar” data showing the target person with their eyes open, from which the GAN learns not just what eyes should go on the person, but how the eyes of this particular person are shaped, colored, and so on. The results are quite realistic: there’s no color mismatch or obvious stitching because the recognition part of the network knows that that’s not how the person looks. In testing, people mistook the fake eyes-opened photos for real ones, or said they couldn’t be sure which was which, more than half the time. And unless I knew a photo was definitely tampered with, I probably wouldn’t notice if I was scrolling past it in my newsfeed. Gandhi looks a little weird, though. It still fails in some situations, creating weird artifacts if a person’s eye is partially covered by a lock of hair, or sometimes failing to recreate the color correctly. But those are fixable problems. You can imagine the usefulness of an automatic eye-opening utility on Facebook that checks a person’s other photos and uses them as reference to replace a blink in the latest one. It would be a little creepy, but that’s pretty standard for Facebook, and at least it might save a group photo or two.

Leave a Reply

Your email address will not be published. Required fields are marked *

Disclaimer: Trading in bitcoins or other digital currencies carries a high level of risk and can result in the total loss of the invested capital. theonlinetech.org does not provide investment advice, but only reflects its own opinion. Please ensure that if you trade or invest in bitcoins or other digital currencies (for example, investing in cloud mining services) you fully understand the risks involved! Please also note that some external links are affiliate links.