Can Computers See Better than Humans?
Vision is one of the most important ways we learn about the world. Of the five human senses – vision, sound, touch, taste and smell –vision is arguably the most complex. From the moment we’re born and first open our eyes, we are taking in millions of images every single day.
As computers have become more advanced, there’s been a drive to make them more capable of learning, so that certain tasks can be automated more efficiently. This is known as artificial intelligence (AI), or machine learning. Computer vision is an important aspect of machine learning, because enabling computers to “see” and analyze images quickly also enables them to understand the information contained in those images.
In every image, there are numerous pieces of data that make up our understanding of what that image is. Consider the image below. As humans, we look at this image and see a cat playing with a toy mouse.
We know this because we have seen thousands of images of cats during our lives, we’ve probably seen cats in real life; maybe we even have a cat that lives in our own house. We also know that the cat is playing with a toy mouse because we’ve seen that image and/or behavior. We recognize everything in the photo, and it makes sense to us.
But a computer must be trained to understand what’s going on in the photo. It has to be shown thousands (if not millions) of images of cats. Even once it can recognize cats, how will it understand what the cat is playing with? It must also be shown pictures of actual mice and mouse toys to understand that the cat is playing with a toy and not preying on an actual mouse. And to recognize the stone fireplace or rug or hardwood floor in the background, the computer must learn what those things look like, too.
Data Is Key
All of this requires lots of machine learning. Each of these pieces of information in the image can be coded as metadata – information that describes or categorizes the components of the image.
As ubiquitous as cat photos are on the internet, the implications of computer vision extend much further, particularly as we start to think about how it can be used in streaming media. Metadata embedded in photos and videos can be catalogued, sorted, tagged and organized into libraries and algorithms that allow computers to make all sorts of decisions automatically.
Some early versions of this kind of vision have been in place for several years. Take security cameras: for at least ten years, we’ve been able to program a security camera to recognize when someone crosses a boundary, which triggers a security alert.
Once alerted, the person with access to the camera feed can determine if the person is doing something acceptable or unacceptable. In the future, these applications may not need human monitoring, because a computer will be able to determine an appropriate course of action based on what it sees in the live video feed. A real-time analysis will be powered by a computer vision algorithm trained on millions of images and scenarios.
Computer vision requires lots of data. It must repeatedly analyze images until it understands the content and context in such a way that it can recognize them. Once a model is built, however, that model can be replicated and reused on that computer or another. Computers don’t have to learn to recognize cats every time they are presented with an image of one.
In the not-too-distant future, computers will be able to see just as well as humans, and even better in some cases. The advent of cloud computing technology has enabled all of the computing horsepower required for these complex computations to be done much more efficiently. You can train the models you have using vast cloud resources and then combine them with other existing models in order to train them faster and at lower cost.
Other Use Cases for Computer Vision
Obviously, cat photos are only one small example. Computer vision will unlock numerous use cases over the coming years. Autonomous vehicles, for instance, will use data from computer vision in conjunction with data from radar, sonar and LiDAR to help aid navigation and safe operation in traffic. These cameras and sensors will work together in real time to identify other vehicles, pedestrians and infrastructure in order to help avoid collisions and improve traffic flow.
In the world of media and entertainment, computer vision can be trained to help identify, classify and differentiate user-generated content. For instance, a computer will be able to tell the difference between a video featuring travel tips for New York City, and another by the same creator that features a piano lesson. This type of automated categorization will dramatically improve search results and has obvious benefits for advertisers as well.
In the context of a large content provider like Netflix, with a known database of existing content, computer vision can help categorization and search. For example, you may be interested in films set in the 1880s. Metadata capturing historical landmarks, costumes and such would be able to offer a list of films set in that period based on an advanced analysis of the entire library. This can result in much better personalization and recommendations from our content providers than possible today.
Over time, this will have impacts on content production as well, because content producers and advertisers will be able to create richer content from the start, based on a knowledge of which computer vision models will be running on their database.
Content Adaptive Encoding
Today, most video is encoded using standard formats such as H.266/VVC, H.265/HEVC or H.264/AVC. These encoding formats specify the compression that will be used to convert the video from a raw format to a compressed format that allows it to be transmitted more quickly (and cheaply) over the internet.
All of these traditional compression formats work using a historical block-based compression technique. In these techniques, all video is compressed in the same way regardless of content. A newer approach is being pioneered now, known as Content Adaptive Encoding.
In this method, video can be encoded differently based on what’s in it. For example, video of a lecture with a college professor sitting still in front of a simple background can be encoded differently than a complex music video filmed from six different cameras and featuring twenty-five performers and a dynamic light show.
As computer vision becomes more mature, it will recognize and interpret all the various aspects of an image, including the people, locations, and other things living and inanimate in the frame. From these, and their relationships to each other, context and meaning can be inferred and put to use in various ways.
This is because one way video is compressed is using computer vision to look at changes from frame to frame. In our lecture example, the picture doesn’t change much from frame to frame: her mouth moves as she talks, she blinks her eyes and moves her head to look at notes on the lectern, but the background doesn’t change much, if at all.
But in a music video, there is near-constant motion: multiple camera angles, dancers and musicians moving constantly and lights changing color and intensity. We can encode these videos very differently because of their different content, and gain efficiencies.
When encoding an entire feature-length movie (which is a huge video file, even once compressed), you can encode quieter, more still scenes differently than those with lots of action and movement. A content-adaptive approach to encoding will use AI and computer vision to determine the best and most efficient type of encoding for each scene so the file achieves the optimal balance of high quality and efficient streaming.
Essentially, computer vision is about creating more metadata information about the video content. Metadata unlocks opportunities for personalization, recommendation and monetization, in addition to better engagement and retention.