The Golden Age of Computer Vision

It is an exciting yet overwhelming time to be a computer vision researcher. Last Tuesday, I had the privilege to give the opening remarks at CVPR 2019 to an audience of 9277 attendees. As one of four program chairs, my job was to manage paper decision process with 132 area chairs, 2887 reviewers, and 14104 authors submitting 5160 papers, and to organize the program of 1296 posters and 288 talks. This was the largest computer vision conference in history, but there's another conference in just four months. So much is happening - who can keep up?
Number of submitted (blue) and accepted (green) papers in CVPR by year.

Computer vision is no longer just an academic pursuit. Billions of dollars are being spent on applications from smarter cameras to autonomous driving. Most professors are spending at least half their time in industry, and even freshly minted PhD students can reap salaries well into the six figures. But is it a bubble? How do we separate the breakthroughs ripe for commercialization from the hyped up proofs of concept?

First, let's very briefly review how we got here:

  • 1963: Robert's classic "Blocks World" paper constructs 3D objects from images, using carefully engineered features and rules.
  • 1981: Lucas and Kanade propose, in just six pages, effective algorithms for motion tracking and stereo vision. Advances in geometric vision and image processing follow.
  • 1996: Rowley, Baluja, and Kanade describe the first modern object detection, a neural network trained to detect faces. Digital images proliferate, and data replaces rules.
  • 2012: Millions of labeled images and GPU processing provide the fuel for Krizhevsky, Sutskever, and Hinton to prove the power of deep learning, achieving half the error of competing approaches. Data replaces hand-crafted features.
  • 2019: Face recognition, body tracking, and detection of common objects work like magic. Depth predictions from single images look great. But only noobs try to solve a problem with less than 100,000 labeled images. A data annotation industry is born.

So, here is the dirty open secret of computer vision's success: it's memorization, not intelligence. Let's look at single view depth prediction as an example. I cut my teeth on this problem in 2005, proposing the first method to automatically create 3D models from outdoor images. The key was learning to "recognize" geometry by labeling pixels into ground, vertical, and support, and using perspective geometry rules to construct a simple model of scene geometry. It worked about 30% of the time.

An early approach to single-view 3D reconstruction: a little data, hand-crafted features, and some math.

3D from single view is a popular topic now, with ~35 papers at CVPR 2019 alone. There are methods to produce scene layout from panoramas, object meshes from an image, and depth maps from one view. However, as pointed out by our group and UCI in 2018, and researchers at Freiburg and Intel at CVPR 2019, many methods that seem to interpret geometry are actually just memorizing during learning and retrieving examples similar to the input to make predictions. Predicted 3D models may look good, but the methods don't generalize to novel shapes or scenes.

So let's consider two problems of great interest to construction:

  1. Depth from image. Wouldn't it be great if you could just snap a picture in the field and send it to the office for 3D measurement and QA/QC? Goodbye expensive laser scanners and cumbersome photogrammetry. Nice dream, now open your eyes. Matterport recently publicized depth prediction from 360 panoramas, an impressive feat of data collection and machine learning. The relative depths are pretty good, and the edges are in the right places. The requirement to put a Ricoh Theta on a tripod at a known height takes out some of the variation due to unknown camera parameters and pose. But it's still not quantitatively accurate enough for use, and the encoder-decoder strategy is a form of memorization, so prediction in highly varied construction scenes is likely to be error-prone for a long time to come. At Reconstruct, we recently rolled out 3D reconstruction from 360 video, which works reliably due to good old-fashioned correspondence and optimization. For now, 3D is best left to the drone and video capture and scanners. That said, I am very excited about the potential of combining deep methods for single view recognition and segmentation with multiview methods of producing precise geometry.
  2. Automated construction progress monitoring. At Reconstruct, we align point clouds and images to BIM, so it should be easy to automatically to compare built to plan and assess progress, right? We've got patents and papers on some basic methods, but it's not as easy as it looks, and not ready for prime time. The big challenges are the sheer variety of building elements and tasks, incomplete observations, and the need to assess both geometry and material properties(e.g. sheetrock vs. painted wall), combined with the challenge of obtaining labeled data. Some have claimed to have automatic progress monitoring, but lacking the data and expertise, I don't find those claims credible for broad application. But, with the right data and recent advances in semantic segmentation, this could be achievable in the next year or two, at least for coarsely measuring work put in place.

In summary, if someone claims to have newly solved a hard recognition or prediction problem, ask yourself this question: Do they have enough data, just like the type that I care about, that their method could have memorized all the answers? This requires that (1) they have TONS of data; (2) they have spent millions on annotation, or have an automated way to get supervision (e.g. Matterport depth scanners); (3) the prediction problem is simple enough and your domain limited enough that it's likely to be covered by their data and labels. There's a reason for the multi-billion dollar image annotation industry, and, so far, there's no substitute for data.