New! Sign up for our email newsletter on Substack.

Seeing the whole from some of the parts

Upon looking at photographs and drawing on their past experiences, humans can often perceive depth in pictures that are, themselves, perfectly flat. However, getting computers to do the same thing has proved quite challenging.

The problem is difficult for several reasons, one being that information is inevitably lost when a scene that takes place in three dimensions is reduced to a two-dimensional (2D) representation. There are some well-established strategies for recovering 3D information from multiple 2D images, but they each have some limitations. A new approach called โ€œvirtual correspondence,โ€ which was developed by researchers at MIT and other institutions, can get around some of these shortcomings and succeed in cases where conventional methodology falters.

The standard approach, called โ€œstructure from motion,โ€ is modeled on a key aspect of human vision. Because our eyes are separated from each other, they each offer slightly different views of an object. A triangle can be formed whose sides consist of the line segment connecting the two eyes, plus the line segments connecting each eye to a common point on the object in question. Knowing the angles in the triangle and the distance between the eyes, itโ€™s possible to determine the distance to that point using elementary geometry โ€” although the human visual system, of course, can make rough judgments about distance without having to go through arduous trigonometric calculations. This same basic ideaย  โ€” of triangulation or parallax views โ€” has been exploited by astronomers for centuries to calculate the distance to faraway stars. ย 

Triangulation is a key element of structure from motion. Suppose you have two pictures of an object โ€” a sculpted figure of a rabbit, for instance โ€” one taken from the left side of the figure and the other from the right. The first step would be to find points or pixels on the rabbitโ€™s surface that both images share. A researcher could go from there to determine the โ€œposesโ€ of the two cameras โ€” the positions where the photos were taken from and the direction each camera was facing. Knowing the distance between the cameras and the way they were oriented, one could then triangulate to work out the distance to a selected point on the rabbit. And if enough common points are identified, it might be possible to obtain a detailed sense of the objectโ€™s (or โ€œrabbitโ€™sโ€) overall shape.

Considerable progress has been made with this technique, comments Wei-Chiu Ma, a PhD student in MITโ€™s Department of Electrical Engineering and Computer Science (EECS), โ€œand people are now matching pixels with greater and greater accuracy. So long as we can observe the same point, or points, across different images, we can use existing algorithms to determine the relative positions between cameras.โ€ But the approach only works if the two images have a large overlap. If the input images have very different viewpoints โ€” and hence contain few, if any, points in common โ€” he adds, โ€œthe system may fail.โ€

During summer 2020, Ma came up with a novel way of doing things that could greatly expand the reach of structure from motion. MIT was closed at the time due to the pandemic, and Ma was home in Taiwan, relaxing on the couch. While looking at the palm of his hand and his fingertips in particular, it occurred to him that he could clearly picture his fingernails, even though they were not visible to him.

That was the inspiration for the notion of virtual correspondence, which Ma has subsequently pursued with his advisor, Antonio Torralba, an EECS professor and investigator at the Computer Science and Artificial Intelligence Laboratory, along with Anqi Joyce Yang and Raquel Urtasun of the University of Toronto and Shenlong Wang of the University of Illinois. โ€œWe want to incorporate human knowledge and reasoning into our existing 3D algorithmsโ€ Ma says, the same reasoning that enabled him to look at his fingertips and conjure up fingernails on the other side โ€” the side he could not see.

Structure from motion works when two images have points in common, because that means a triangle can always be drawn connecting the cameras to the common point, and depth information can thereby be gleaned from that. Virtual correspondence offers a way to carry things further. Suppose, once again, that one photo is taken from the left side of a rabbit and another photo is taken from the right side. The first photo might reveal a spot on the rabbitโ€™s left leg. But since light travels in a straight line, one could use general knowledge of the rabbitโ€™s anatomy to know where a light ray going from the camera to the leg would emerge on the rabbitโ€™s other side. That point may be visible in the other image (taken from the right-hand side) and, if so, it could be used via triangulation to compute distances in the third dimension.

Virtual correspondence, in other words, allows one to take a point from the first image on the rabbitโ€™s left flank and connect it with a point on the rabbitโ€™s unseen right flank. โ€œThe advantage here is that you donโ€™t need overlapping images to proceed,โ€ Ma notes. โ€œBy looking through the object and coming out the other end, this technique provides points in common to work with that werenโ€™t initially available.โ€ And in that way, the constraints imposed on the conventional method can be circumvented.

One might inquire as to how much prior knowledge is needed for this to work, because if you had to know the shape of everything in the image from the outset, no calculations would be required. The trick that Ma and his colleagues employ is to use certain familiar objects in an image โ€” such as the human form โ€” to serve as a kind of โ€œanchor,โ€ and theyโ€™ve devised methods for using our knowledge of the human shape to help pin down the camera poses and, in some cases, infer depth within the image. In addition, Ma explains, โ€œthe prior knowledge and common sense that is built into our algorithms is first captured and encoded by neural networks.โ€

The teamโ€™s ultimate goal is far more ambitious, Ma says. โ€œWe want to make computers that can understand the three-dimensional world just like humans do.โ€ That objective is still far from realization, he acknowledges. โ€œBut to go beyond where we are today, and build a system that acts like humans, we need a more challenging setting. In other words, we need to develop computers that can not only interpret still images but can also understand short video clips and eventually full-length movies.โ€

A scene in the film โ€œGood Will Huntingโ€ demonstrates what he has in mind. The audience sees Matt Damon and Robin Williams from behind, sitting on a bench that overlooks a pond in Bostonโ€™s Public Garden. The next shot, taken from the opposite side, offers frontal (though fully clothed) views of Damon and Williams with an entirely different background. Everyone watching the movie immediately knows theyโ€™re watching the same two people, even though the two shots have nothing in common. Computers canโ€™t make that conceptual leap yet, but Ma and his colleagues are working hard to make these machines more adept and โ€” at least when it comes to vision โ€” more like us.

The teamโ€™s work will be presented next week at the Conference on Computer Vision and Pattern Recognition.

There's no paywall here

If our reporting has informed or inspired you, please consider making a donation. Every contribution, no matter the size, empowers us to continue delivering accurate, engaging, and trustworthy science and medical news. Independent journalism requires time, effort, and resourcesโ€”your support ensures we can keep uncovering the stories that matter most to you.

Join us in making knowledge accessible and impactful. Thank you for standing with us!