Machine Learning Boils Down Stories that Wearable Cameras Tell

Computers will someday soon automatically provide short video digests of a day in your life, your family vacation or an eight-hour police patrol, say computer scientists at The University of Texas at Austin.

The researchers are working to develop tools to help make sense of the vast quantities of video that are going to be produced by wearable camera technology such as Google Glass and Looxcie.

“The amount of what we call ‘egocentric’ video, which is video that is shot from the perspective of a person who is moving around, is about to explode,” said Kristen Grauman, associate professor of computer science in the College of Natural Sciences. “We’re going to need better methods for summarizing and sifting through this data.”

Grauman and her colleagues developed a superior technique that uses machine learning to automatically analyze recorded videos and assemble a better short “story” of the footage than what is available from existing methods.

Better video summarization should prove important in helping military commanders managing data coming in from soldiers’ cameras, investigators trying to sift through cellphone video data in the wake of disasters like the Boston Marathon bombing, and senior citizens using video summaries of their days to compensate for memory loss, said Grauman.

“There’s research showing that if people suffering from memory loss wear a camera that takes a snapshot once a minute, and then they review those images at the end of the day, it can help their recall,” said Grauman. “That’s pretty inspiring. What if instead of images that were selected just because they were a minute apart, they had a video or photographic summary that was selected because it told a good story? Maybe that would help even more. That’s the kind of thing we’re hoping to achieve.”

Grauman, her postdoc Lu Zheng and doctoral student Yong Jae Lee presented their method, which they call “story-driven” video summarization, at the IEEE Conference on Computer Vision and Pattern Recognition this summer.

Their findings are based on video amassed by volunteers wearing commercially available Looxcie cameras, which cost about $200, record five hours of video at a stretch, connect to smartphones and fit in an ear as a large Bluetooth device does.

“The task is to take a very long video and automatically condense it into very short video clips, or a series of stills, that convey the essence of the story,” said Grauman. “To do that, though, we first have to ask: What makes a good visual story? Our answer is that beyond displaying important persons, objects and scenes, it must also convey how one thing leads to the next.”

To tackle the challenge, Grauman and her colleagues took a two-step approach. The first step involved using machine learning techniques to teach their system to “score” the significance of objects in view based on egocentric factors such as how often the objects appeared in the center of the frame, which is a good proxy for where the camera wearer’s gaze is, or whether they are touched by the wearer’s hands.

“If you give us a region in the video, then we will give back an importance level, based on all those properties that we have extracted and learned how to combine,” said Grauman. “So at that point you can select frames that will maximize the importance.”

The next step was to use those important frames, through the video, and look for early ones that influence later ones. To do that they adapted a method developed by researchers at Carnegie Mellon University that could predict how one news article leads to another, assembling a series of articles to transition from a starting point to a known end point.

For the text work, researchers used word frequencies and correlations across articles to quantify influence. For the video work, Grauman and Lu used their significant objects and frames to do the same. Then they were able to identify a chain of video clips that efficiently filled in the story from beginning to end.

“We ran human ‘taste tests’ comparing our method to previous methods,” said Grauman, “and between 75 and 90 percent of people evaluating the summaries, depending on the datasets and method being compared, found that our system is superior.”

Grauman said that as video summarization techniques continue to improve, they will become invaluable aids not just to people with very specialized needs, like police investigators and those suffering from memory loss, but to everyday Web surfers as well.

“My hope is that we’ll be able to get video browsing much closer to what we experience with image browsing,” she said. “Consider browsing 50 images on a webpage. It’s manageable, since you can scroll down and see them all in one pass. Now imagine trying to browse 50 videos online. It’s simply not efficient. We need summarization algorithms in order to improve video search considerably.”

Related