Three Fundamental Dimensions for Thinking About Machine Learning Systems

April 4, 2015, 7:08 am

≫ Next: Deep Learning vs Probabilistic Graphical Models vs Logic

≪ Previous: Venture Pitch Contest at CVPR 2015 in Boston, MA

Today, let's set cutting-edge machine learning and computer vision techniques aside. You probably already know that computer vision (or "machine vision") is the branch of computer science / artificial intelligence concerned with recognizing objects like cars, faces, and hand gestures in images. And you also probably know that Machine Learning algorithms are used to drive state-of-the-art computer vision systems. But what's missing is a birds-eye view of how to think about designing new learning-based systems. So instead of focusing on today's trendiest machine learning techniques, let's go all the way back to day 1 and build ourselves a strong foundation for thinking about machine learning and computer vision systems.

Allow me to introduce three fundamental dimensions which you can follow to obtain computer vision masterdom. The first dimension is mathematical, the second is verbal, and the third is intuitive.

On a personal level, most of my daily computer vision activities directly map onto these dimensions. When I'm at a coffee shop, I prefer the mathematical - pen and paper are my weapons of choice. When it's time to get ideas out of my head, there's nothing like a solid founder-founder face-to-face meeting, an occasional MIT visit to brainstorm with my scientist colleagues, or simply rubberducking (rubber duck debugging) with developers. And when it comes to engineering, interacting with a live learning system can help develop the intuition necessary to make a system more powerful, more efficient, and ultimately much more robust.

Mathematical: Learn to love the linear classifier

At the core of machine learning is mathematics, so you shouldn't be surprised that I include mathematical as one of the three fundamental dimensions of thinking about computer vision.

The single most important concept in all of machine learning which you should master is the idea of the classifier. For some of you, classification is a well-understood problem; however, too many students prematurely jump into more complex algorithms line randomized decision forests and multi-layer neural networks, without first grokking the power of the linear classifier. Plenty of data scientists will agree that the linear classifier is the most fundamental machine learning algorithm. In fact, when Peter Norvig, Director of Research at Google, was asked "Which AI field has surpassed your expectations and surprised you the most?" in his 2010 interview, he answered with "machine learning by linear separators."

The illustration below depicts a linear classifier. In two dimensions, a linear classifier is a line which separates the positive examples from the negative examples. You should first master the 2D linear classifier, even though in most applications you'll need to explore a higher-dimensional feature space. My personal favorite learning algorithm is the linear support vector machine, or linear SVM. In a SVM, overly-confident data points do not influence the decision boundary. Or put in another way, learning with these confident points is like they aren't even there! This is a very useful property for large-scale learning problems where you can't fit all data into memory. You're going to want to master the linear SVM (and how it relates to Linear Discriminant Analysis, Linear Regression, and Logistic Regression) if you're going to pass one of my whiteboard data-science interviews.

Linear Support Vector Machine from the SVM Wikipedia page

An intimate understanding of the linear classifier is necessary to understand how deep learning systems work. The neurons inside a multi-layer neural network are little linear classifiers, and while the final decision boundary is non-linear, you should understand the underlying primitives very well. Loosely speaking, you can think of the linear classifier as a simple spring system and a more complex classifiers as a higher-order assembly of springs.

Also, there are going to be scenarios in your life as a data-scientist where a linear classifier should be the first machine learning algorithm you try. So don't be afraid to use some pen and paper, get into that hinge loss, and master the fundamentals.

Further reading: Google's Research Director talks about Machine Learning. Peter Norvig's Reddit AMA on YouTube from 2010.
Further reading: A demo for playing with linear classifiers in the browser. Linear classifier Javascript demo from Stanford's CS231n: Convolutional Neural Networks for Visual Recognition.
Further reading: My blog post: Deep Learning vs Machine Learning vs Pattern Recognition

Verbal: Talk about you vision (and join a community)

As you start acquiring knowledge of machine learning concepts, the best way forward is to speak up. Learn something, then teach a friend. As counterintuitive as it sounds, when it comes down to machine learning mastery, human-human interaction is key. This is why getting a ML-heavy Masters or PhD degree is ultimately the best bet for those adamant about becoming pioneers in the field. Daily conversations are necessary to strengthen your ideas. See Raphael's "The School of Athens" for a depiction of what I think of as the ideal learning environment. I'm sure half of those guys were thinking about computer vision.

The School of Athens by Raphael

An ideal ecosystem for collaboration and learning about computer vision

If you're not ready for a full-time graduate-level commitment to the field, consider a.) taking an advanced undergraduate course in vision/learning from your university, b.) a machine learning MOOC, or c.) taking part in a practical and application-focused online community/course focusing on computer vision.

During my 12-year academic stint, I made the observation that talking to your peers about computer vision and machine learning is more important that listening to teachers/supervisors/mentors. Of course, there's much value in having a great teacher, but don't be surprised if you get 100x more face-to-face time with your friends compared to student-teacher interactions. So if you take an online course like Coursera's Machine Learning MOOC, make sure to take it with friends. Pause the video and discuss. Go to dinner and discuss. Write some code and discuss. Rinse, lather, repeat.

Coursera's Machine Learning MOOC taught by Andrew Ng

Another great opportunity is to follow Adrian Rosebrock's pyimagesearch.com blog, where he focuses on python and computer vision applications.

Further reading: Old blog post: Why your vision lab needs a reading group

Homework assignment: First somebody on the street and teach them about machine learning.

Intuitive: Play with a real-time machine learning system

The third and final dimension is centered around intuition. Intuition is the ability to understand something immediately, without the need for conscious reasoning. The following guidelines are directed towards real-time object detection systems, but can also transfer over to other applications like learning-based attribution models for advertisements, high-frequency trading, as well as numerous tasks in robotics.

To gain some true insights about object detection, you should experience a real-time object detection system. There's something unique about seeing a machine learning system run in real-time, right in front of you. And when you get to control the input to the system, such as when using a webcam, you can learn a lot about how the algorithms work. For example, seeing the classification score go down as you occlude the object of interest, and seeing the detection box go away when the object goes out of view is fundamental to building intuition about what works and what elements of a system need to improve.

I see countless students tweaking an algorithm, applying it to a static large-scale dataset, and then waiting for the precision-recall curve to be generated. I understand that this is the hard and scientific way of doing things, but unless you've already spent a few years making friends with every pixel, you're unlikely to make a lasting contribution this way. And it's not very exciting -- you'll probably fall asleep at your desk.

Using a real-time feedback loop (see illustration below), you can learn about the patterns which are intrinsically difficult to classify, as well what environmental variations (lights, clutter, motion) affect your system the most. This is something which really cannot be done with a static dataset. So go ahead, mine some intuition and play.

Visual Debugging: Designing the vision.ai real-time gesture-based controller in Fall 2013

Visual feedback is where our work at vision.ai truly stands out. Take a look at the following video, where we show a live example of training and playing with a detector based on vision.ai's VMX object recognition system.

NOTE: There a handful of other image recognition systems out there which you can turn into real-time vision systems, but be warned that optimization for real-time applications requires some non-trivial software engineering experience. We've put a lot of care into our system so that the detection scores are analogous to a linear SVM scoring strategy. Making the output of a non-trivial learning algorithm backwards-compatible with a linear SVM isn't always easy, but in my opinion, well-worth the effort.

Extra Credit: See comments below for some free VMX by vision.ai beta software licenses so you can train some detectors using our visual feedback interface and gain your own machine vision intuition.

Conclusion

The three dimensions, namely mathematical, verbal, and intuitive provide different ways for advancing your knowledge of machine learning and computer vision systems. So remember to love the linear classifier, talk to your friends, and use a real-time feedback loop when designing your machine learning system.

↧

Deep Learning vs Probabilistic Graphical Models vs Logic

April 8, 2015, 12:05 pm

≫ Next: Making Visual Data a First-Class Citizen

≪ Previous: Three Fundamental Dimensions for Thinking About Machine Learning Systems

Today, let's take a look at three paradigms that have shaped the field of Artificial Intelligence in the last 50 years: Logic, Probabilistic Methods, and Deep Learning. The empirical, "data-driven", or big-data / deep-learning ideology triumphs today, but that wasn't always the case. Some of the earliest approaches to AI were based on Logic, and the transition from logic to data-driven methods has been heavily influenced by probabilistic thinking, something we will be investigating in this blog post.

Let's take a look back Logic and Probabilistic Graphical Models and make some predictions on where the field of AI and Machine Learning is likely to go in the near future. We will proceed in chronological order.

Image from Coursera's Probabilistic Graphical Models course

1. Logic and Algorithms (Common-sense "Thinking" Machines)

A lot of early work on Artificial Intelligence was concerned with Logic, Automated Theorem Proving, and manipulating symbols. It should not be a surprise that John McCarthy's seminal 1959 paper on AI had the title "Programs with common sense."

If we peek inside one of most popular AI textbooks, namely "Artificial Intelligence: A Modern Approach," we immediately notice that the beginning of the book is devoted to search, constraint satisfaction problems, first order logic, and planning. The third edition's cover (pictured below) looks like a big chess board (because being good at chess used to be a sign of human intelligence), features a picture of Alan Turing (the father of computing theory) as well as a picture of Aristotle (one of the greatest classical philosophers which had quite a lot to say about intelligence).

The cover of AIMA, the canonical AI text for undergraduate CS students

Unfortunately, logic-based AI brushes the perception problem under the rug, and I've argued quite some time ago that understanding how perception works is really the key to unlocking the secrets of intelligence. Perception is one of those things which is easy for humans and immensely difficult for machines. (To read more see my 2011 blog post, Computer Vision is Artificial Intelligence). Logic is pure and traditional chess-playing bots are very algorithmic and search-y, but the real world is ugly, dirty, and ridden with uncertainty.

I think most contemporary AI researchers agree that Logic-based AI is dead. The kind of world where everything can be perfectly observed, a world with no measurement error, is not the world of robotics and big-data. We live in the era of machine learning, and numerical techniques triumph over first-order logic. As of 2015, I pity the fool who prefers Modus Ponens over Gradient Descent.

Logic is great for the classroom and I suspect that once enough perception problems become "essentially solved" that we will see a resurgence in Logic. And while there will be plenty of open perception problems in the future, there will be scenarios where the community can stop worrying about perception and start revisiting these classical ideas. Perhaps in 2020.

Further reading: Logic and Artificial Intelligence from the Stanford Encyclopedia of Philosophy

2. Probability, Statistics, and Graphical Models ("Measuring" Machines)

Probabilistic methods in Artificial Intelligence came out of the need to deal with uncertainty. The middle part of the Artificial Intelligence a Modern Approach textbook is called "Uncertain Knowledge and Reasoning" and is a great introduction to these methods. If you're picking up AIMA for the first time, I recommend you start with this section. And if you're a student starting out with AI, do yourself a favor and don't skimp on the math.

Intro to PDFs from Penn State's Probability Theory and Mathematical Statistics course

When most people think about probabilistic methods they think of counting. In laymen's terms it's fair to think of probabilistic methods as fancy counting methods. Let's briefly take a look at what used to be the two competing methods for thinking probabilistically.

Frequentist methods are very empirical -- these methods are data-driven and make inferences purely from data. Bayesian methods are more sophisticated and combine data-driven likelihoods with magical priors. These priors often come from first principles or "intuitions" and the Bayesian approach is great for combining heuristics with data to make cleverer algorithms -- a nice mix of the rationalist and empiricist world views.

What is perhaps more exciting that then Frequentist vs. Bayesian flamewar, is something known as Probabilistic Graphical Models. This class of techniques comes from computer science, and even though Machine Learning is now a strong component of a CS and a Statistics degree, the true power of statistics only comes when it is married with computation.

Probabilistic Graphical Models are a marriage of Graph Theory with Probabilistic Methods and they were all the rage among Machine Learning researchers in the mid 2000s. Variational methods, Gibbs Sampling, and Belief Propagation were being pounded into the brains of CMU graduate students when I was in graduate school (2005-2011) and provided us with a superb mental framework for thinking about machine learning problems. I learned most of what I know about Graphical Models from Carlos Guestrin and Jonathan Huang. Carlos Guestrin is now the CEO of GraphLab, Inc (now known as Dato) which is a company that builds large scale products for machine learning on graphs and Jonathan Huang is a senior research scientist at Google.

The video below is a high level overview of GraphLab, but it serves a very nice overview of "graphical thinking" and how it fits into the modern data scientist's tool-belt. Carlos is an excellent lecturer and his presentation is less about the company's product and more about ways for thinking about next generation machine learning systems.

A Computational Introduction to Probabilistic Graphical Modelsby GraphLab, Inc CEO Prof. Carlos Guestrin
If you think that deep learning is going to solve all of your machine learning problems, you should really take a look at the above video. If you're building recommender systems, an analytics platform for healthcare data, designing a new trading algorithm, or building the next generation search engine, Graphical Models are perfect place to start.

Further reading:
Belief Propagation Algorithm Wikipedia Page
An Introduction to Variational Methods for Graphical Models by Michael Jordan et al.
Michael Jordan's webpage (one of the titans of inference and graphical models)

3. Deep Learning and Machine Learning (Data-Driven Machines)

Machine Learning is about learning from examples and today's state-of-the-art recognition techniques require a lot of training data, a deep neural network, and patience. Deep Learning emphasizes the network architecture of today's most successful machine learning approaches. These methods are based on "deep" multi-layer neural networks with many hidden layers. NOTE: I'd like to emphasize that using deep architectures (as of 2015) is not new. Just check out the following "deep" architecture from 1998.

LeNet-5 Figure From Yann LeCun's seminal "Gradient-based learning

applied to document recognition" paper.

When you take a look at modern guide about LeNet, it comes with the following disclaimer:

"To run this example on a GPU, you need a good GPU. It needs at least 1GB of GPU RAM. More may be required if your monitor is connected to the GPU.

When the GPU is connected to the monitor, there is a limit of a few seconds for each GPU function call. This is needed as current GPUs can’t be used for the monitor while doing computation. Without this limit, the screen would freeze for too long and make it look as if the computer froze. This example hits this limit with medium-quality GPUs. When the GPU isn’t connected to a monitor, there is no time limit. You can lower the batch size to fix the time out problem."

It really makes me wonder how Yann was able to get anything out of his deep model back in 1998. Perhaps it's not surprising that it took another decade for the rest of us to get the memo.

UPDATE: Yann pointed out (via a Facebook comment) that the ConvNet work dates back to 1989. "It had about 400K connections and took about 3 weeks to train on the USPS dataset (8000 training examples) on a SUN4 machine." -- LeCun

A Deep Network from Yann's work at Bell Labs from 1989

NOTE: At roughly the same time (~1998) two crazy guys in California were trying to cache the entire internet inside the computers in their garage (they started some funny-sounding company which starts with a G). I don't know how they did it, but I guess sometimes to win big you have to do things that don't scale. Eventually the world will catch up.

Further reading:
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, November 1998.

Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard and L. D. Jackel: Backpropagation Applied to Handwritten Zip Code Recognition, Neural Computation, 1(4):541-551, Winter 1989

Deep Learning code: Modern LeNet implementation in Theano and docs.

Conclusion

I don't see traditional first-order logic making a comeback anytime soon. And while there is a lot of hype behind deep learning, distributed systems and "graphical thinking" is likely to make a much more profound impact on data science than heavily optimized CNNs. There is no reason why deep learning can't be combined with a GraphLab-style architecture, and some of the new exciting machine learning work in the next decade is likely to be a marriage of these two philosophies.

You can also check out a relevant post from last month:
Deep Learning vs Machine Learning vs Pattern Recognition

Discuss on Hacker News

↧

Making Visual Data a First-Class Citizen

April 24, 2015, 3:13 pm

≫ Next: Deep Learning vs Big Data: Who owns what?

≪ Previous: Deep Learning vs Probabilistic Graphical Models vs Logic

“Above all, don't lie to yourself. The man who lies to himself and listens to his own lie comes to a point that he cannot distinguish the truth within him, or around him, and so loses all respect for himself and for others. And having no respect he ceases to love.” ― Fyodor Dostoyevsky, The Brothers Karamazov

City Forensics: Using Visual Elements to Predict Non-Visual City Attributes

To respect the power and beauty of machine learning algorithms, especially when they are applied to the visual world, let's take a look at three recent applications of learning-based "computer vision" to computer graphics. Researchers in computer graphics are known for producing truly captivating illustrations of their results, so this post is going to be very visual. Now is your chance to sit back and let the pictures do the talking.

Can you predict things simply by looking at street-view images?

Let's say you're going to visit an old-friend in a foreign country for the first time. You've never visited this country before and have no idea what kind of city/neighborhood your friend lives in. So you decide to get a sneak peak -- you enter your friend's address into Google Street View.

Most people can look at Google Street View images in a given location and estimate attributes such as "sketchy,""rural,""slum-like,""noisy" for the given neighborhood. TLDR; A person is a pretty good visual recommendation engine.

Can you predict if this looks like a safe location?

(Screenshot of Street view for Manizales, Colombia on Google Earth)

Can a computer program predict things by looking at images? If so, then these kinds of computer programs could be used to automatically generate semantic map layovers (see the crime prediction overlay from the first figure), help organize fast-growing cities (computer vision meets urban planning?), and ultimately bring about a new generation of match-making "visual recommendation engines" (a whole suite of new startups).

Before I discuss the research paper behind this idea, here are two cool things you could do (in theory) with a non-visual data prediction algorithm. There are plenty of great product ideas in this space -- just be creative.

Startup Idea #1: Avoiding sketchy areas when traveling abroad
A Personalized location recommendation engine could be used to find locations in a city that I might find interesting (techie coffee shop for entrepreneurs, a park good for frisbee) subject to my constraints (near my current location, in a low-danger area, low traffic). Below is the kind of place you want to avoid if you're looking for a coffee and a place to open up your laptop to do some work.

Google Street Maps, Morumbi São Paulo: slum housing (image from geographyfieldwork.com)

Startup Idea #2: Apartment Pricing and Marketing from Images
Visual recommendation engines could be used to predict the best images to represent an apartment for an Airbnb listing. It would be great if Airbnb had a filter that would let you upload videos of your apartment, and it would predict that set of static images that best depict your apartment to maximize earning potential. I'm sure that Airbnb users would pay extra for this feature if it was available for a small extra charge. The same computer vision prediction idea can be applied to home pricing on Zillow, Craigslist, and anywhere else that pictures of for-sale items are shared.

Google image search result for "Good looking apartment". Can computer vision be used to automatically select pictures that will make your apartment listing successful on Airbnb?

Part I. City Forensics: Using Visual Elements to Predict Non-Visual City Attributes

The Berkeley Computer Graphics Group has been working on predicting non-visual attributes from images, so before I describe their approach, let me discuss how Berkeley's Visual Elements relate to Deep Learning.

Predicting Chicago Thefts from San Francisco data. Predicting Philadelphia Housing Prices from Boston data. From City Forensics paper.

Deep Learning vs Mid-level Patch Discovery (Technical Discussion)
You might think that non-visual data prediction from images (if even possible) will require a deep understanding of the image and thus these approaches must be based on a recent ConvNet deep learning method. Obviously, knowing the locations and categories associated with each object in a scene could benefit any computer vision algorithm. The problem is that such general purpose CNN recognition systems aren't powerful enough to parse Google Street View images, at least not yet.

Another extreme is to train classifiers on entire images. This was initially done when researchers were using GIST, but there are just too many nuisance pixels inside a typical image, so it is better to focus your machine learning a subset of the image. But how do you choose the subset of the image to focus on?

There exist computer vision algorithms that can mine a large dataset of images and automatically extract meaningful, repeatable, and detectable mid-level visual patterns. These methods are not label-based and work really well when there is an underlying theme tying together a collection of images. The set of all Google Street View Images from Paris satisfies this criterion. Large collections of random images from the internet must be labeled before they can be used to produce the kind of stellar results we all expect out of deep learning. The Berkeley Group uses visual elements automatically mined from images as the core representation. Mid-level visual patterns are simply chunks of the image which correspond to repeatable configurations -- they sometimes contain entire objects, parts of objects, and popular multiple object configurations. (See Figure below) The mid-level visual patterns form a visual dictionary which can be used to represent the set of images. Different sets of images (e.g., images from two different US cities) will have different mid-level dictionaries. These dictionaries are similar to "Visual Words" but their creation uses more SVM-like machinery.

The patch mining algorithm is known as mid-level patch discovery. You can think of mid-level patch discovery as a visually intelligent K-means clustering algorithm, but for really really large datasets. Here's a figure from the ECCV 2012 paper which introduced mid-level discriminative patches.

Unsupervised Discovery of Mid-Level Discriminative Patches

Unsupervised Discovery of Mid-Level Discriminative Patches. Saurabh Singh, Abhinav Gupta and Alexei A. Efros. In European Conference on Computer Vision (2012).

I should also point out that non-final layers in a pre-trained CNN could also be used for representing images, without the need to use a descriptor such as HOG. I would expect the performance to improve, so the questions is perhaps: How long until somebody publishes an awesome unsupervised CNN-based patch discovery algorithm? I'm a handful of researchers are already working on it. :-)

Related Blog Post: From feature descriptors to deep learning: 20 years of computer vision

The City Forensics paper from Berkeley tries to map the visual appearance of cities (as obtained from Google Street View Images) to non-visual data like crime statistics, housing prices and population density. The basic idea is to 1.) mine discriminative patches from images and 2.) train a predictor which can map these visual primitives to non-visual data. While the underlying technique is that of mid-level patch discovery combined with Support Vector Regression (SVR), the result is an attribute-specific distribution over GPS coordinates. Such a distribution should be appreciated for its own aesthetic value. I personally love custom data overlays.

City Forensics: Using Visual Elements to Predict Non-Visual City Attributes. Sean Arietta, Alexei A. Efros, Ravi Ramamoorthi, Maneesh Agrawala. In IEEE Transactions on Visualization and Computer Graphics (TVCG), 2014.

Part II. The Selfie 2.0: Computer Vision as a Sidekick

Sometimes you just want the algorithm to be your sidekick. Let's talk about a new and improved method for using vision algorithms and the wisdom of the crowds to select better pictures of your face. While you might think of an improved selfie as a silly application, you do want to look "professional" in your professional photos, sexy in your "selfies" and "friendly" in your family pictures. An algorithm that helps you get the desired picture is an algorithm the whole world can get behind.

Attractiveness versus Time. From MirrorMirror Paper.

The basic idea is to collect a large video of a single person which spans different emotions, times of day, different days, or whatever condition you would like to vary. Given this video, you can use crowdsourcing to label frames based on a property like attractiveness or seriousness. Given these labeled frames, you can then train a standard HOG detector and predict one of these attributes on new data. Below if a figure which shows the 10 best shots of the child (lots of smiling and eye contact) and the worst 10 shots (bad lighting, blur, red-eye, no eye contact).

10 good shots, 10 worst shots. From MirrorMirror Paper.

You can also collect a video of yourself as you go through a sequence of different emotions, get people to label frames, and build a system which can predict an attribute such as "seriousness".

Faces ranked from Most serious to least serious. From MirrorMirror Paper.

In this work, labeling was necessary for taking better selfies. But if half of the world is taking pictures, while the other half is voting pictures up and down (or Tinder-style swiping left and right), then I think the data collection and data labeling effort won't be a big issue in years to come. Nevertheless, this is a cool way of scoring your photos. Regarding consumer applications, this is something that Google, Snapchat, and Facebook will probably integrate into their products very soon.

Mirror Mirror: Crowdsourcing Better Portraits. Jun-Yan Zhu, Aseem Agarwala, Alexei A. Efros, Eli Shechtman and Jue Wang. In ACM Transactions on Graphics (SIGGRAPH Asia), 2014.

Part III. What does it all mean? I'm ready for the cat pictures.

This final section revisits an old, simple, and powerful trick in computer vision and graphics. If you know how to compute the average of a sequence of numbers, then you'll have no problem understanding what an average image (or "mean image") is all about. And if you're read this far, don't worry, the cat picture is coming soon.

Computing average images (or "mean" images) is one of those tricks that I was introduced to very soon after I started working at CMU. Antonio Torralba, who has always had "a few more visualization tricks" up his sleeve, started computing average images (in the early 2000s) to analyze scenes as well as datasets collected as part of the LabelMe project at MIT. There's really nothing more to the basic idea beyond simply averaging a bunch of pictures.

Teaser Image from AverageExplorer paper.

Usually this kind of averaging is done informally in research, to make some throwaway graphic, or make cool web-ready renderings. It's great seeing an entire paper dedicated to a system which explores the concept of averaging even further. It took about 15 years of use until somebody was bold enough to write a paper about it. When you perform a little bit of alignment, the mean pictures look really awesome. Check out these cats!

Aligned cat images from the AverageExplorer paper.

I want one! (Both the algorithm and a Platonic cat)

The AverageExplorer paper extends simple image average with some new tricks which make the operations much more effective. I won't say much about the paper (the link is below), just take at a peek at some of the coolest mean cats I've ever seen (visualized above) or a jaw-dropping way to look at community collected landmark photos (Oxford bridge mean image visualized below).

Aligned bridges from AverageExplorer paper.

I wish Google would make all of Street View look like this.

AverageExplorer: Interactive Exploration and Alignment of Visual Data Collections. Jun-Yan Zhu, Yong Jae Lee, and Alexei A. Efros. In SIGGRAPH 2014.

Averaging images is a really powerful idea. Want to know what your magical classifier is tuned to detect? Compute the top detections and average them. Soon enough you'll have a good idea of what's going on behind the scenes.

Conclusion

Allow me to mention the mastermind that helped bring most of these vision+graphics+learning applications to life. There's an inimitable charm present in all of the works of Prof. Alyosha Efros -- a certain aesthetic that is missing from 2015's overly empirical zeitgeist. He used to be at CMU, but recently moved back to Berkeley.

Being able to summarize several of years worth of research into a single computer generated graphic can go a long way to making your work memorable and inspirational. And maybe our lives don't need that much automation. Maybe general purpose object recognition is too much? Maybe all we need is a little art? I want to leave you with a YouTube video from a recent 2015 lecture by Professor A.A. Efros titled "Making Visual Data a First-Class Citizen." If you want to hear the story in the master's own words, grab a drink and enjoy the lecture.

"Visual data is the biggest Big Data there is (Cisco projects that it will soon account for over 90% of internet traffic), but currently, the main way we can access it is via associated keywords. I will talk about some efforts towards indexing, retrieving, and mining visual data directly, without the use of keywords."― A.A. Efros, Making Visual Data a First-Class Citizen

↧

Deep Learning vs Big Data: Who owns what?

May 5, 2015, 12:11 pm

≫ Next: Dyson 360 Eye and Baidu Deep Learning at the Embedded Vision Summit in Santa Clara

≪ Previous: Making Visual Data a First-Class Citizen

In order to learn anything useful, large-scale multi-layer deep neural networks (aka Deep Learning systems) require a large amount of labeled data. There is clearly a need for big data, but only a few places where big visual data is available. Today we'll take a look at one of the most popular sources of big visual data, peek inside a trained neural network, and ask ourselves some data/model ownership questions. The fundamental question to keep in mind is the following, "Are the learned weights of a neural network derivate works of the input images?" In other words, when deep learning touches your data, who owns what?

Background: The Deep Learning "Computer Vision Recipe"
One of today's most successful machine learning techniques is called Deep Learning. The broad interest in Deep Learning is backed by some remarkable results on real-world data interpretation tasks dealing with speech[1], text[2], and images[3]. Deep learning and object recognition techniques have been pioneered by academia (University of Toronto, NYU, Stanford, Berkeley, MIT, CMU, etc), picked up by industry (Google, Facebook, Snapchat, etc), and are now fueling a new generation of startups ready to bring visual intelligence to the masses (Clarifai.com, Metamind.io, Vision.ai, etc). And while it's still not clear where Artificial Intelligence is going, Deep Learning will be a key player.

Related blog post: Deep Learning vs Machine Learning vs Pattern Recognition
Related blog post: Deep Learning vs Probabilistic Graphical Models vs Logic

For visual object recognition tasks, the most popular models are Convolutional Neural Networks (also known as ConvNets or CNNs). They can be trained end-to-end without manual feature engineering, but this requires a large set of training images (sometimes called big data, or big visual data). These large neural networks start out as a Tabula Rasa (or "blank slate") and the full system is trained in an end-to-end fashion using a heavily optimized implementation of Backpropagation (informally called "backprop"). Backprop is nothing but the chain rule you learned in Calculus 101 and today's deep neural networks are trained in almost the same way they were trained in the 1980s. But today's highly-optimized implementations of backprop are GPU-based and can process orders of magnitude more data than was approachable in the pre-internet pre-cloud pre-GPU golden years of Neural Networks. The output of the deep learning training procedure is a set of learned weights for the different layers defined in the model architecture -- millions of floating point numbers representing what was learned from the images. So what's so interesting about the weights? It's the relationship between the weights and the original big data, that will be under scrutiny today.

"Are weights of a trained network based on ImageNet a derived work, a cesspool of millions of copyright claims? What about networks trained to approximate another ImageNet network?"
[This question was asked on HackerNews by kastnerkyle in the comments of A Revolutionary Technique That Changed Machine Vision.]

In the context of computer vision, this question truly piqued my interest, and as we start seeing robots and AI-powered devices enter our homes I expect much more serious versions of this question to arise in the upcoming decade. Let's see how some of these questions are being addressed in 2015.

1. ImageNet: Non-commercial Big Visual Data

Let's first take a look at the most common data source for Deep Learning systems designed to recognize a large number of different objects, namely ImageNet[4]. ImageNet is the de-facto source of big visual data for computer vision researchers working on large scale object recognition and detection. The dataset debuted in a 2009 CVPR paper by Fei-Fei Li's research group and was put in place to replace both PASCAL datasets (which lacked size and variety) and LabelMe datasets (which lacked standardization). ImageNet grew out of Caltech101 (a 2004 dataset focusing on image categorization, also pioneered by Fei-Fei Li) so personally I still think of ImageNet as something like "Stanford10^N". ImageNet has been a key player in organizing the scale of data that was required to push object recognition to its new frontier, the deep learning phase.

ImageNet has over 15 million images in its database as of May 1st, 2015.

Problem: Lots of extremely large datasets are mined from internet images, but these images often come with their own copyright. This prevents collecting and selling such images, and from a commercial point of view, when creating such a dataset, some care has to be taken. For research to keep pushing the state-of-the-art on real-world recognition problems, we have to use standard big datasets (representative of what is found in the ~~real-world~~ internet), foster a strong sense of community centered around sharing results, and maintain the copyrights of the original sources.

Solution: ImageNet decided to publicly provide links to the dataset images so that they can be downloaded without having to be hosted on an University-owned server. The ImageNet website only serves the image thumbnails and provides a copyright infringement clause together with instructions where to file a DMCA takedown notice. The dataset organizers provide the entire dataset only after signing a terms of access, prohibiting commercial use. See the ImageNet clause below (taken on May 5th, 2015).

"ImageNet does not own the copyright of the images. ImageNet only provides thumbnails and URLs of images, in a way similar to what image search engines do. In other words, ImageNet compiles an accurate list of web images for each synset of WordNet. For researchers and educators who wish to use the images for non-commercial research and/or educational purposes, we can provide access through our site under certain conditions and terms."

2. Caffe: Unrestricted Use Deep Learning Models

Now that we have a good idea where to download big visual data and an understanding of the terms that apply, let's take a look at the the other end of the spectrum: the output of the Deep Learning training procedure. We'll take a look at Caffe, one of the most popular Deep Learning libraries, which was engineered to handle ImageNet-like data. Caffe provides an ecosystem for sharing models (the Model Zoo), and is becoming an indispensable tool for today's computer vision researcher. Caffe is developed at the Berkeley Vision and Learning Center (BVLC) and by community contributors -- it is open source.

Slide from DIY Deep Learning for Vision with Caffe

Problem: As a project that started at a University, Caffe's goal is to be the de-facto standard for creating, training, and sharing Deep Learning models. The shared models were initially licensed for non-commercial use, but the problem is that a new wave of startups is using these techniques, so there must be a licensing agreement which allows Universities, large companies, and startups to explore the same set of pretrained models.

Solution: The current model licensing for Caffe is unrestricted use. This is really great for a broad range of hackers, scientists, and engineers. The models used to be shared with a non-commercial clause. Below is the entire model licensing agreement from the Model License section of Caffe (taken on May 5th, 2015).

"The Caffe models bundled by the BVLC are released for unrestricted use.

These models are trained on data from the ImageNet project and training data includes internet photos that may be subject to copyright.

Our present understanding as researchers is that there is no restriction placed on the open release of these learned model weights, since none of the original images are distributed in whole or in part. To the extent that the interpretation arises that weights are derivative works of the original copyright holder and they assert such a copyright, UC Berkeley makes no representations as to what use is allowed other than to consider our present release in the spirit of fair use in the academic mission of the university to disseminate knowledge and tools as broadly as possible without restriction."

3. Vision.ai: Dataset generation and training in your home

Deep Learning learns a summary of the input data, but what happens if a different kind of model memorizes bits and pieces of the training data? And more importantly what if there are things inside the memorized bits which you might not want shared with outsiders? For this case study, we'll look at Vision.ai, and their real-time computer vision server which is designed to simultaneously create a dataset and learn about an object's appearance. Vision.ai software can be applied to real-time training from videos as well as live webcam streams.

Instead of starting with big visual data collected from internet images (like ImageNet), the vision.ai training procedure is based on a person waving an object of interest in front of the webcam. The user bootstraps the learning procedure with an initial bounding box, and the algorithm continues learning hands-free. As the algorithm learns, it is stores a partial history of what it previously saw, effectively creating its own dataset on the fly. Because the vision.ai convolutional neural networks are designed for detection (where an object only occupies a small portion of the image), there is a large amount of background data presented inside the collected dataset. At the end of the training procedure you get both the Caffe-esque bit (the learned weights) and the ImageNet bit (the collected images). So what happens when it's time to share the model?

A user training a cup detector using vision.ai's real-time detector training interface

Problem: Training in your home means that potentially private and sensitive information is contained inside the backgrounds of the collected images. If you train in your home and make the resulting object model public, think twice about what you're sharing. Sharing can also be problematic if you have trained an object detector from a copyrighted video/images and want to share/sell the resulting model.

Solution: When you save a vision.ai model to disk, you get both a compiled model and the full model. The compiled model is the full model sans the images (thus much smaller). This allows you to maintain fully editable models on your local computer, and share the compiled model (essentially only the learned weights), without the chance of anybody else peeking into your living room. Vision.ai's computer vision server called VMX can run both compiled and uncompiled models; however, only uncompiled models can be edited and extended. In addition, vision.ai provides their vision server as a standalone install, so that all of the training images and computations can reside on your local computer. In brief, vision.ai's solution is to allow you to choose whether you want to run the computations in the cloud or locally, and whether you want to distribute full models (with background images) or the compiled models (solely what is required for detection). When it comes to sharing the trained models and/or created datasets, you are free to choose your own licensing agreement.

4. Open Problems for Licensing Memory-based Machine Learning Models

Deep Learning methods aren't the only techniques applicable to object recognition. What if our model was a Nearest-Neighbor classifier using raw RGB pixels? A Nearest Neighbor Classifier is a memory based classifier which memorizes all of the training data -- the model is the training data. It would be contradictory to license the same set of data differently if one day it was viewed as training data and another day as the output of a learning algorithm. I wonder if there is a way to reconcile the kind of restrictive non-commercial licensing behind ImageNet with the unrestricted licensing use strategy of Caffe Deep Learning Models. Is it possible to have one hacker-friendly data/model license agreement to rule them all?

Conclusion

Don't be surprised if neural network upgrades come as part of your future operating system. As we transition from a data economy (sharing images) to a knowledge economy (sharing neural networks), legal/ownership issues will pop up. I hope that the three scenarios I covered today (big visual data, sharing deep learning models, and training in your home) will help you think about the future legal issues that might come up when sharing visual knowledge. When AI starts generating its own art (maybe by re-synthesizing old pictures), legal issues will pop up. And when your competitor starts selling your models and/or data, legal issues will resurface. Don't be surprised if the MIT license vs. GPL license vs. Apache License debate resurges in the context of pre-trained deep learning models. Who knows, maybe AI Law will become the next big thing.

References
[1] Deep Speech: Accurate Speech Recognition with GPU-Accelerated Deep Learning: NVIDIA dev blog post about Baidu's work on speech recognition using Deep Learning. Andrew Ng is working with Baidu on Deep Learning.

[2] Text Understanding from Scratch: Arxiv paper from Facebook about end-to-end training of text understanding systems using ConvNets. Yann Lecun is working with Facebook on Deep Learning.

[3] ImageNet Classification with Deep Convolutional Neural Networks. Seminal 2012 paper from the Neural Information and Processing Systems (NIPS) conference which showed breakthrough performance from a deep neural network. Paper came out of University of Toronto, but now most of these guys are now at Google. Geoff Hinton is working with Google on Deep Learning.

[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei, ImageNet: A Large-Scale Hierarchical Image Database. IEEE Computer Vision and Pattern Recognition (CVPR), 2009.

Jia Deng is now assistant professor at Michigan University and he is growing his research group. If you're interested in starting a PhD in deep learning and vision, check out his call for prospective students. This might be a younger version of Andrew Ng.

Richard Socher is the CTO and Co-Founder of MetaMind, and new startup in the Deep Learning space. They are VC-backed and have plenty of room to grow.

Jia Li is now Head of Research at Snapchat, Inc. I can't say much, but take a look at the recent VentureBeat article: Snapchat is quietly building a research team to do deep learning on images, videos. Jia and I overlapped at Google Research back in 2008.

Fei-Fei Li is currently the Director of the Stanford Artificial Intelligence Lab and the Stanford Vision Lab. See the article on Wired: If we want our machines to think, we need to teach them to see. Yann, you have some competition.

Yangqing Jia created the Caffe project during his PhD at UC Berkeley. He is now a research scientist at Google.

Tomasz Malisiewicz is the Co-Founder of Vision.ai, which focuses on real-time training of vision systems -- something which is missing in today's Deep Learning systems. Come say hi at CVPR.

↧

Dyson 360 Eye and Baidu Deep Learning at the Embedded Vision Summit in Santa Clara

May 6, 2015, 10:16 am

≫ Next: Deep down the rabbit hole: CVPR 2015 and beyond

≪ Previous: Deep Learning vs Big Data: Who owns what?

Bringing Computer Vision to the Consumer

Mike Aldred

Electronics Lead, Dyson Ltd

While vision has been a research priority for decades, the results have often remained out of reach of the consumer. Huge strides have been made, but the final, and perhaps toughest, hurdle is how to integrate vision into real world products. It’s a long road from concept to finished machine, and to succeed, companies need clear objectives, a robust test plan, and the ability to adapt when those fail.

Image from ExtremeTech: Dyson 360 Eye: Dyson’s ‘truly intelligent’ robotic vacuum cleaner is finally here

The Dyson 360 Eye robot vacuum cleaner uses computer vision as its primary localization technology. 10 years in the making, it was taken from bleeding edge academic research to a robust, reliable and manufacturable solution by Mike Aldred and his team at Dyson.

Mike Aldred’s keynote at next week's Embedded Vision Summit (May 12th in Santa Clara) will chart some of the high and lows of the project, the challenges of bridging between academia and business, and how to use a diverse team to take an idea from the lab into real homes.

Enabling Ubiquitous Visual Intelligence Through Deep Learning

Ren Wu

Distinguished Scientist, Baidu Institute of Deep Learning

Deep learning techniques have been making headlines lately in computer vision research. Using techniques inspired by the human brain, deep learning employs massive replication of simple algorithms which learn to distinguish objects through training on vast numbers of examples. Neural networks trained in this way are gaining the ability to recognize objects as accurately as humans. Some experts believe that deep learning will transform the field of vision, enabling the widespread deployment of visual intelligence in many types of systems and applications. But there are many practical problems to be solved before this goal can be reached. For example, how can we create the massive sets of real-world images required to train neural networks? And given their massive computational requirements, how can we deploy neural networks into applications like mobile and wearable devices with tight cost and power consumption constraints?

Ren Wu’s morning keynote at next week's Embedded Vision Summit (May 12th in Santa Clara) will share an insider’s perspective on these and other critical questions related to the practical use of neural networks for vision, based on the pioneering work being conducted by his team at Baidu.

Vision-as-a-Service: Democratization of Vision for Consumers and Businesses

Herman Yau

Co-founder and CEO, Tend

Hundreds of millions of video cameras are installed around the world—in businesses, homes, and public spaces—but most of them provide limited insights. Installing new, more intelligent cameras requires massive deployments with long time-to-market cycles. Computer vision enables us to extract meaning from video streams generated by existing cameras, creating value for consumers, businesses, and communities in the form of improved safety, quality, security, and health. But how can we bring computer vision to millions of deployed cameras? The answer is through “Vision-as-a-Service” (VaaS), a new business model that leverages the cloud to apply state-of-the-art computer vision techniques to video streams captured by inexpensive cameras. Centralizing vision processing in the cloud offers some compelling advantages, such as the ability to quickly deploy sophisticated new features without requiring upgrades of installed camera hardware. It also brings some tough challenges, such as scaling to bring intelligence to millions of cameras.

Image From Distributed Computing: Three Best-Use Cases

Herman Yau's talk at next week's Embedded Vision Summit (May 12th in Santa Clara) will explain the architecture and business model behind VaaS, show how it is being deployed in a wide range of real-world use cases, and highlight some of the key challenges and how they can be overcome.

Embedded Vision Summit on May 12th, 2015

There will be many more great presentations at the upcoming Embedded Vision Summit. From the range of topics, it looks like any startup with interest in computer vision will be able to benefit from attending. The entire day is filled with talks by great presenters (Gary Bradski will talk about the latest developments in OpenCV). You can see the list of speakers: Embedded Vision Summit 2015 List of speakers or the day's agenda Embedded Vision Summit 2015 Agenda.

Embedded Vision Summit 2015 Registration (249$ for the one day event + food)

Demos during lunch: The Technology Showcase at the Embedded Vision Summit will highlight demonstrations of technology for computer vision-based applications and systems from the following companies.

The vision topics covered will be: Deep Learning, CNNs, Business, Markets, Libraries, Standards, APIs, 3D Vision, and Processors. I will be there with my vision.ai team, together with some computer vision guys from KnitHealth, Inc, a new SF-based Health Vision Company. If you're interested in meeting with us, let's chat at the Vision Summit.

What kind of startups and companies should attend? Definitely robotics. Definitely vision sensors. Definitely those interested in deep learning hardware implementations. Seems like even half of the software engineers at Google could benefit from learning about their favorite deep learning algorithms being optimized for hardware.

↧

Deep down the rabbit hole: CVPR 2015 and beyond

June 26, 2015, 7:45 pm

≫ Next: The Deep Learning Gold Rush of 2015

≪ Previous: Dyson 360 Eye and Baidu Deep Learning at the Embedded Vision Summit in Santa Clara

CVPR is the premier Computer Vision conference, and it's fair to think of it as the Olympics of Computer Vision Research. This year it was held in my own back yard -- less than a mile away from lovely Cambridge, MA! Plenty of my MIT colleagues attended, but I wouldn't be surprised if Google had the largest showing at CVPR 2015. I have been going to CVPR almost every year since 2004, so let's take a brief tour at what's new in the exciting world of computer vision research.

Down the rabbit hole art by frostyshadows

A lot has changed. Nothing has changed. Academics used to be on top, defending their Universities and the awesomeness happening inside their non-industrial research labs. Academics are still on top, but now defending their Google, Facebook, Amazon, and Company X affiliations. And with the hiring budget to acquire the best and a heavy publishing-oriented culture, don't be surprised if the massive academia exodus continues for years to come. It's only been two weeks since CVPR, and Google has since then been busy making ConvNet art, showing the world that if you want to do the best Deep Learning research, they are King.

An army of PhD students and Postdocs simply cannot defeat an army of Software Engineers and Research Scientists. Back in the day, students used to typically depart after a Computer Vision PhD (there used to be few vision research jobs and Wall Street jobs were tempting). Now the former PhD students run research labs at big companies which have been feverishly getting into vision. It seems there aren't enough deep experts to fill the deep demand.

Datasets used to be the big thing -- please download my data! Datasets are still the big thing -- but we regret to inform you that your university’s computational resources won’t make the cut (but at Company X we’re always hiring, so come join us, and help push the frontier of research together).

Related Article:Under LeCun's Leadership, Facebook's AI Research Lab is beefing up their research presence

If you want to check out the individual papers, I recommend Andrej Karpathy's online navigation tool for CVPR 2015 papers or take a look at the vanilla listing of CVPR 2015 papers on the CV foundation website. Zoya Bylinskii, an MIT PhD Candidate, also put together a list of interesting CVPR 2015 papers.

The ConvNet Revolution: There's a pre-trained network for that

Machine Learning used to be the Queen. Machine Learning is now the King. Machine Learning used to be shallow, but today's learning approaches are so deep that the diagrams barely fit on a single slide. Grad students used to pass around jokes about Yann LeCun and his insistence that machine learning will one day do the work of the feature engineering stage. Now it seems that the entire vision community gets to ignore you when you insist that “manual feature engineering” is going to save the day. Yann LeCun gave a keynote presentation with the intriguing title "What's wrong with Deep Learning" and it seems that Convolutional Neural Networks (also called CNNs or ConvNets) are everywhere at CVPR.

Figure from Karpathy's Convolutional Neural Network tutorial

It used to be hard to publish ConvNet research papers at CVPR, it's now hard to get a CVPR paper if you didn't at least compare against a ConvNet baseline. Got a new cool problem? Oooh, you didn’t try a ConvNet-based baseline? Well, that explains why nobody cares.

But it's not like the machines are taking over the job of the vision scientist. Today's vision scientist is much more of an applied machine learning hacker than anything else, and because of the strong CNN theme, it is much easier to understand and re-implement today's vision systems. What we're seeing at CVPR is essentially a revisiting of the classic problems like segmentation and motion, using this new machinery. As Samson Timoner phrased it at the local Boston Vision Meetup, when Mutual Information was popular, the community jumped on that bandwagon -- it's ConvNets this time around. But it's not just a trend, the non-CNN competition is getting crushed.

Figure from Bharath Hariharan's Hypercolumns CVPR 2015 paper on segmentation using CNNs

There's still plenty to be done by a vision scientist, and a solid formal education in mathematics is more important than ever. We used to train via gradient descent. We still train via gradient descent. We used to drink Coffee, now we all drink Caffe. But behind the scenes, it is still mathematics.

Related Page:Caffe Model Zoo where you can download lots of pretrained ConvNets

Deep down the rabbit hole

CVPR 2015 reminds of the pre-Newtonian days of physics. A lot of smart scientists were able to predict the motions of objects using mathematics once the ingenious Descartes taught us how to embed our physical thinking into a coordinate system. And it's pretty clear that by casting your computer vision problem in the language of ConvNets, you are going to beat just about anybody doing computer vision by hand. I think of Yann LeCun (one of the fathers of Deep Learning) as a modern day Descartes, only because I think the ground-breaking work is right around the corner. His mental framework of ConvNets is like a much needed coordinate system -- we might not know what the destination looks like, but we now know how to build a map.

Deep Networks are performing better every month, but I’m still waiting for Isaac to come in and make our lives even easier. I want a simplification. But I'm not being pessimistic -- there is a flurry of activity in the ConvNet space for a very good reason (in case you didn't get to attend CVPR 2015), so I'll just be blunt: ConvNets fuckin' work! I just want the F=ma of deep learning.

Open Source Deep Learning for Computer Vision: Torch vs Caffe

CVPR 2015 started off with some excellent software tutorials on day one. There is some great non-alpha deep learning software out there and it has been making everybody's life easier. At CVPR, we had both a Torch tutorial and a Caffe tutorial. I attended the DIY Deep Learning Caffe tutorial and it was a full house -- standing room only for slackers like me who join the party only 5 minutes before it starts. Caffe is much more popular that Torch, but when talking to some power users of Deep Learning (like +Andrej Karpathy and other DeepMind scientists), a certain group of experts seems to be migrating from Caffe to Torch.

Caffe is developed at Berkeley, has a vibrant community, Python bindings, and seems to be quite popular among University students. Prof. Trevor Darrell at Berkeley is even looking for a Postdoc to help the Caffe effort. If I was a couple of years younger and a fresh PhD, I would definitely apply.

Instead of following the Python trend, Torch is Lua-based. There is no need for an interpreter like Matlab or Python -- Lua gives you the magic console. Torch is heavily used by Facebook AI Research Labs and Google's DeepMind Lab in London. For those afraid of new languages like Lua, don't worry -- Lua is going to feel "easy" if you've dabbled in Python, Javascript, or Matlab. And if you don't like editing protocol buffer files by hand, definitely check out Torch.

It's starting to become clear that the future power of deep learning is going to come with its own self-contained software package like Caffe or Torch, and not from a dying breed of all-around tool-belts like OpenCV or Matlab. When you share creations made in OpenCV, you end up sharing source files, but with the Deep Learning toolkits, you end up sharing your pre-trained networks. No longer do you have to think about a combination of 20 "little" algorithms for your computer vision pipeline -- you just think about which popular network architecture you want, and then the dataset. If you have the GPUs and ample data, you can do full end-to-end training. And if your dataset is small/medium, you can fine-tune the last few layers. You can even train a linear classifier on top of the final layer, if you're afraid of getting your hands dirty -- just doing that will beat the SIFTs, the HOGs, the GISTs, and all that was celebrated in the past two decades of computer vision.

Related Article:Torch vs Theano on fastml.com
Related Code:Andrea Vedaldi's MatConvNet Deep Learning Library for MATLAB users

The way in which ConvNets are being used at CVPR 2015 makes me feel like we're close to something big. But before we strike gold, ConvNets still feel like a Calculus of Shadows, merely "hoping" to get at something bigger, something deeper, and something more meaningful. I think the flurry of research which investigates visualization algorithms for ConvNets suggests that even the network architects aren't completely sure what is happening behind the scenes.

The Video Game Engine Inside Your Head: A different path towards Machine Intelligence

Josh Tenenbaum gave an invited talk titled The Video Game Engine Inside Your Head at the Scene Understanding Workshop on the last day of the CVPR 2015 conference. You can read a summary of his ideas in a short Scientific American article. While his talk might appear to be unconventional by CVPR standards, it is classic Tenenbaum. In his world, there is no benchmark to beat, no curves to fit to shadows, and if you allow my LeCun-Descartes analogy, then in some sense Prof. Tenenbaum might be the modern day Aristotle. As Prof. Jianxiong Xiao introduced Josh with a grand intro, he was probably right -- this is one of the most intelligent speakers you can find. He speaks 100 words a second, you can't help but feel your brain enlarge as you listen.

One of Josh's main research themes is going beyond the shadows of image-based recognition. Josh's work is all about building mental models of the world, and his work can really be thought of as analysis-by-synthesis. Inside his models is something like a video game engine, and he showed lots of compelling examples of inferences that are easy for people, but nearly impossible for the data-driven ConvNets of today. It's not surprising that his student is working at Google's DeepMind this summer.

A couple of years ago, Probabilistic Graphical Models (the marriage of Graph Theory and Probabilistic Methods) used to be all the rage. Josh gave us a taste of Probabilistic Programming, and while we're not yet seeing these new methods dominate the world of computer vision research, keep your eyes open. He mentioned a recent Nature paper (citation below) from another well respected machine intelligence research, which should keep the trendsetters excited for quite some time. Just take a look at the bad-ass looking Julia code below:

Probabilistic machine learning and artificial intelligence. Zoubin Ghahramani. Nature 521, 452–459 (28 May 2015) doi:10.1038/nature14541

To see some of Prof. Tenenbaum's ideas in action, take a look at the following CVPR 2015 paper, titled Picture: A Probabilistic Programming Language for Scene Perception. Congrats to Tejas D. Kulkarni, the first author, an MIT student, who got the Best Paper Honorable Mention prize for this exciting new work. Google DeepMind, you're going to have one fun summer.

Picture: A Probabilistic Programming Language for Scene Perception

Object Detectors Emerge in Deep Scene CNNs

There were lots of great presentation as the Scene Understanding Workshop, and another talk that truly stood out was about a new large-scale dataset (MIT Places) and a thorough investigation of what happens when you train with scenes vs. objects.

Antonio Torralba from MIT gave the talk about the Places Database and an in-depth analysis of what is learned when you train on object-centric databases like ImageNet vs. Scene-scentric databases like MIT Places. You can check out "Object Detectors Emerge" slides or their ArXiv paper to learn more. Great work by an upcoming researcher, Bolei Zhou!

Overheard at CVPR: ArXiv Publishing Frenzy & Baidu Fiasco

In the long run, the recent trend of rapidly pushing preprints to ArXiv.org is great for academic and industry research alike. When you have a large collection of experts exploring ideas at very fast rates, waiting 6 months until the next conference deadline just doesn't make sense. The only downside is that it makes new CVPR papers feel old. It seems like everybody has already perused the good stuff the day it went up on ArXiv. But you get your "idea claim" without worrying that a naughty reviewer will be influenced by your submission. Double blind reviewing, get ready for a serious revamp. We now know who's doing what, significantly before publication time. Students, publish-or-perish just got a new name. Whether the ArXiv frenzy is a good or a bad thing, is up to you, and probably more a function of your seniority than anything else. But the CV buzz is definitely getting louder and will continue to do so.

The Baidu cheating scandal might appear to be big news for outsiders just reading the Artificial Intelligence headlines, but overfitting to the testing set is nothing new in Computer Vision. Papers get retracted, grad students often evaluate their algorithms on test sets too many times, and the truth is that nobody's perfect. When it's important to be #1, don't be surprised that your competition is being naughty. But it's important to realize the difference between ground-breaking research and petty percentage chasing. We all make mistakes, and under heavy pressure, we're all likely to show our weaknesses. So let's laugh about it. Let's hire the best of the best, encourage truly great research, and stop chasing percentages. The truth is that a lot of the top performing methods are more similar than different.

Conclusion
CVPR has been constantly growing in attendance. We now have Phd Students, startups, Professors, recruiters, big companies, and even undergraduates coming to the show. Will CVPR become the new SIGGRAPH?

CVPR attendance plot from Changbo Hu.

ConvNets are here to stay, but if we want ConvNets to be more than than a mere calculus of shadows, there's still ample work do be done. Geoff Hinton's capsules keep popping up during midnight discussions. "I want to replace unstructured layers with groups of neurons that I call 'capsules' that are a lot more like cortical columns" -- Geoff Hinton during his Reddit AMA. A lot of people (like Prof. Abhinav Gupta from CMU) are also talking about unsupervised CNN training, and my prediction is that learning large ConvNets from videos without annotations is going to be big at next year's CVPR.

Most importantly, when the titans of Deep Learning get to mention what's wrong with their favorite methods, I only expect the best research to follow. Happy computing and remember, never stop learning.

↧

The Deep Learning Gold Rush of 2015

November 7, 2015, 12:35 am

≫ Next: ICCV 2015: Twenty one hottest research papers

≪ Previous: Deep down the rabbit hole: CVPR 2015 and beyond

In the last few decades, we have witnessed major technological innovations such as personal computers and the internet finally reach the mainstream. And with mobile devices and social networks on the rise, we're now more connected than ever. So what's next? When is it coming? And how will it change our lives? Today I'll tell you that the next big advance is well underway and it's being fueled by a recent technique in the field of Artificial Intelligenceknown as Deep Learning.

The California Gold Rush of 2015 is all about Deep Learning.

It's everywhere, you just don't know how to look.

All of today's excitement in Artificial Intelligence and Machine Learning stems from ground-breaking results in speech and visual object recognition using Deep Learning[1]. These algorithms are being applied to all sorts of data, and the learned deep neural networks outperform traditional expert systems carefully designed by scientists and engineers. End-to-end learning of deep representations from raw data is now possible due to a handful of well-performing deep learning recipes (ConvNets, Dropout, ReLUs, LSTM, DQN, ImageNet). But if there's one final takeaway that we can extract from decades of machine learning research, is that for many problems going deep isn't a choice, it's often a requirement.

Most of the apps and services you're already using (AirBnB, Snapchat, Twitch.tv, Uber, Yelp, LinkedIn, etc) are quite data-hungry and before you know it, they're all going to go mega-deep. So whether you need to revitalize your data science team with deep learning or you're starting an AI-from-day-one operation, it's pretty clear that everybody is rushing to get some of this Silicon Valley Gold.

From Titans to Gold Miners: Your atypical Gold Rush

Like all great gold rushes, this movement is led by new faces, which are pouring into Silicon Valley like droves. But these aren't your typical unskilled immigrants willing to pick up a hammer, nor your fresh computer science grads with some app-writing skills. The key deep learning players of today (known as the Titans of Deep Learning) are computer science professors and researchers (seldom born in the USA) leaving their academic posts and bringing their students and ideas straight into Silicon Valley.

"Turn on, Tune in, Dropout" -- Timothy Leary

Recently, Google and Facebook announced that their operations are now being powered by Deep Learning [2,3]. And with most Deep Learning Titans representing the tech giants (Yann LeCun at Facebook Research, Geoffrey Hinton at Google, Andrew Ng at Baidu), Deep Learning is likely to become one of the most sought after tech skills. With Toyota to invest in $1 Billion in Robotics and Artificial Intelligence Research (November 6, 2015), the announcement of YC Research (October 7, 2015), and the new Google Brain Residency Program "Pre-doc" AI jobs (October 26, 2015), Silicon Valley just got a whole lot more interesting.

Silicon Valley re-defines itself, yet again

To understand why it took so long for Deep Learning to take-off, let's take a brief look at the key technologies which defined Silicon Valley over the last 50 years. The following timeline gives an overview of where Silicon Valley has been and where it's going.

Figure Adapted from Steve Blank's Secret History of Silicon Valley

1970s: Semiconductors
The story of the digital-era starts with semiconductors. "Silicon" in "Silicon Valley" originally referred to the silicon chip or integrated circuit innovations as well as the location (close to Stanford) of much tech-related activity. The dominant firm from that time period was Fairchild Semiconductor International and it eventually gave rise to more recognizable companies like Intel. For a more detailed discussion of this birthing era, take a look at Steve Blank's Secret History of Silicon Valley[4].

Read more about Fairchild at TechCrunch's First Trillion-Dollar Startup

1980s: Personal Computers
Initially computers were quite large and used solely by research labs, government, and big businesses. But it was the personal computer which turned computer programming from a hobby into a vital skill. You no longer needed to be an MIT student to program on one of these badboys. While both Microsoft and Apple were founded in 1975 and 1976, respectively, they persevered due to their pioneering work in graphical user interfaces. This was the birth of the modern user-friendly Operating System. IBM approached Microsoft in 1980, regarding its upcoming personal computer, and from then on Microsoft would be King for a very long time.

See Mac-history's article on Microsoft's relationship with Apple

1990s: Internet
While the nerds at Universities were posting ascii messages on newsgroups in the 90s, service providers in the 1990s like AOL helped make the internet accessible to everyone. Remember getting all those AOL disks in the mail? Buying a chunk of digital real state (your own domain name) became possible and anybody with a dial up connection and some primitive text/HTML skills could start posting online content. With a mission statement like "organize the world's information", it was eventually Google that got the most of out the late 90s dot-com bubble, and remains a very strong player in all things tech.

2000s: Mobile and Social
While the dot-com bubble was about creating an online presence for startups and established companies, the way we use the internet has dramatically changed since 2001. A ton of new social communities have emerged, and due to Facebook we're now stars in our own reality show. Social and advertising have essentially turned the modern internet into a mainstream TV-like experience. The internet is no longer only for the nerds. The kings of this era (Google and Facebook) are also the biggest players in the Deep Learning space, because they have the largest user bases and in-house apps which can benefit most from machine learning.

2010-2015: Deep Learning comes to the party
Spend more than a day in Silicon Valley and you'll hear the popular expression, "Software is eating the world." Rampant spreading of software was only possible once the internet (1990s) AND mobile devices (2000s) became essential parts of our lives. No longer do we physically mail floppy disks, and social media fuels any app that goes viral. What traditional software is missing (or has been missing up until now) is the ability to improve over time from everyday use. If that same software is able to connect to a large Deep Learning system and start improving, then we have a game-changer on our hands. This is already happening with online advertising, digital assistants like Siri, and smart auto-responders like Google's new email auto-reply feature.

The hierarchical award-winning "AlexNet" Deep Learning architecture

Visualized using MIT's Toolbox for Deep Learning Neuron Visualization

Massive hiring of deep learning experts by the leading tech companies has only begun, but we also should be on the lookout for new ventures built on top of Deep Learning, not just a revitalization of last decade's successes. On this front, keep a close look at the following Deep Learning Cloud Service upstarts: Richard Socher from MetaMind, Matthew Zeiler from Clarifai, and Carlos Guestrin from Dato.

2015-2020: Deep Learning Revitalizes Robotics
Recently it has been shown that Deep Learning can be used to help robots learn tasks involving movement, object manipulation, and decision making[6,7,8,9]. Before Deep Learning, lots of different pieces of robotic software and hardware would have to be developed independently and then hacked together for demo day. Today, you can use one of a handful of "Deep Learning for Robotics recipes" and start watching your robot learn the task you care about.

Robots Learns to Grasp using Deep Learning at Carnegie Mellon University.

Supersizing Self-supervision: Learning to Grasp from 50K Tries and 700 Robot Hours [6]

With their 2013 acquisition of Boston Dynamics (a hardware play), 2014 acquisition of DeepMind (a software play), and a serious autonomous car play, Google is definitely early to the Robotics party. But the noteworthy bits are happening at the intersection of deep learning and robotics. I suggest taking a closer look at the Robotics research of Pieter Abbeel of Berkeley, Abhinav Gupta of Carnegie Mellon, and Ashutosh Saxena of Stanford -- all likely stars in the next Deep Learning for Robotics race. As long as Rodney Brooks keeps creating innovative Robotics platforms like Baxter, my expectations for Robotics are off the charts.

Conclusion

Unlike in 1849, the Deep Learning Gold Rush of 2015 is not going to bring some 300,000 gold-seekers in boats to California's mainland. This isn't a bring-your-own-hammer kind of game -- the Titans have already descended from their Ivory Towers and handed us ample mining tools. But it won't hurt to gain some experience with traditional "shallow" machine learning techniques so you can appreciate the power of Deep Learning.

I hope you enjoyed today's read and have a better sense of how Silicon Valley is undergoing a transformation. And remember, today's wave of Deep Learning upstart CEOs have PhDs, but once Deep Learning software becomes more user-friendly (TensorFlow?), maybe you won't have to wait so long to dropout.

References

[1] Krizhevsky, A., Sutskever, I. and Hinton, G. E. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS 2012.
[2] D'Onfro, J. Google is 're-thinking' all of its products to include machine learning. Business Insider. October 22, 2015.
[3] D'Onfro, J. How Facebook will use artificial intelligence to organize insane amounts of data into the perfect News Feed and a personal assistant with superpowers. Business Insider. November 3, 2015.
[4] Blank, S. Secret History of Silicon Valley. 2008.
[5] Donglai Wei, Bolei Zhou, Antonio Torralba William T. Freeman. mNeuron: A Matlab Plugin to Visualize Neurons from Deep Models. 2015.
[6] Lerrel Pinto, Abhinav Gupta. Supersizing Self-supervision: Learning to Graspfrom 50K Tries and 700 Robot Hours. arXiv. 2015.
[7] Sergey Levine, Chelsea Finn, Trevor Darrell, Pieter Abbeel. End-to-End Training of Deep Visuomotor Policies. In RSS 2015.
[8] Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533.
[9] Ian Lenz, Ross Knepper, and Ashutosh Saxena. DeepMPC: Learning Deep Latent Features for Model Predictive Control. In Robotics Science and Systems (RSS), 2015

↧

ICCV 2015: Twenty one hottest research papers

December 8, 2015, 8:57 pm

≫ Next: Why SLAM Matters, The Future of Real-Time SLAM, and Deep Learning vs SLAM

≪ Previous: The Deep Learning Gold Rush of 2015

"Geometry vs Recognition" becomes ConvNet-for-X

Computer Vision used to be cleanly separated into two schools: geometry and recognition. Geometric methods like structure from motion and optical flow usually focus on measuring objective real-world quantities like 3D "real-world" distances directly from images and recognition techniques like support vector machines and probabilistic graphical models traditionally focus on perceiving high-level semantic information (i.e., is this a dog or a table) directly from images.

The world of computer vision ~~is changing fast~~ has changed. We now have powerful convolutional neural networks that are able to extract just about anything directly from images. So if your input is an image (or set of images), then there's probably a ConvNet for your problem. While you do need a large labeled dataset, believe me when I say that collecting a large dataset is much easier than manually tweaking knobs inside your 100K-line codebase. As we're about to see, the separation between geometric methods and learning-based methods is no longer easily discernible.

By 2016 just about everybody in the computer vision community will have tasted the power of ConvNets, so let's take a look at some of the hottest new research directions in computer vision.

ICCV 2015's Twenty One Hottest Research Papers

This December in Santiago, Chile, the International Conference of Computer Vision 2015 is going to bring together the world's leading researchers in Computer Vision, Machine Learning, and Computer Graphics.

To no surprise, this year's ICCV is filled with lots of ConvNets, but this time the applications of these Deep Learning tools are being applied to much much more creative tasks. Let's take a look at the following twenty one ICCV 2015 research papers, which will hopefully give you a taste of where the field is going.

1. Ask Your Neurons: A Neural-Based Approach to Answering Questions About Images Mateusz Malinowski, Marcus Rohrbach, Mario Fritz

"We propose a novel approach based on recurrent neural networks for the challenging task of answering of questions about images. It combines a CNN with a LSTM into an end-to-end architecture that predict answers conditioning on a question and an image."

2. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, Sanja Fidler

"To align movies and books we exploit a neural sentence embedding that is trained in an unsupervised way from a large corpus of books, as well as a video-text neural embedding for computing similarities between movie clips and sentences in the book."

3. Learning to See by Moving Pulkit Agrawal, Joao Carreira, Jitendra Malik

"We show that using the same number of training images, features learnt using egomotion as supervision compare favourably to features learnt using class-label as supervision on the tasks of scene recognition, object recognition, visual odometry and keypoint matching."

4. Local Convolutional Features With Unsupervised Training for Image Retrieval Mattis Paulin, Matthijs Douze, Zaid Harchaoui, Julien Mairal, Florent Perronin, Cordelia Schmid

"We introduce a deep convolutional architecture that yields patch-level descriptors, as an alternative to the popular SIFT descriptor for image retrieval."

5. Deep Networks for Image Super-Resolution With Sparse Prior Zhaowen Wang, Ding Liu, Jianchao Yang, Wei Han, Thomas Huang

"We show that a sparse coding model particularly designed for super-resolution can be incarnated as a neural network, and trained in a cascaded structure from end to end."

6. High-for-Low and Low-for-High: Efficient Boundary Detection From Deep Object Features and its Applications to High-Level Vision Gedas Bertasius, Jianbo Shi, Lorenzo Torresani

"In this work we show how to predict boundaries by exploiting object level features from a pretrained object-classification network."

7. A Deep Visual Correspondence Embedding Model for Stereo Matching Costs Zhuoyuan Chen, Xun Sun, Liang Wang, Yinan Yu, Chang Huang

"A novel deep visual correspondence embedding model is trained via Convolutional Neural Network on a large set of stereo images with ground truth disparities. This deep embedding model leverages appearance data to learn visual similarity relationships between corresponding image patches, and explicitly maps intensity values into an embedding feature space to measure pixel dissimilarities."

8. Im2Calories: Towards an Automated Mobile Vision Food Diary Austin Meyers, Nick Johnston, Vivek Rathod, Anoop Korattikara, Alex Gorban, Nathan Silberman, Sergio Guadarrama, George Papandreou, Jonathan Huang, Kevin P. Murphy

"We present a system which can recognize the contents of your meal from a single image, and then predict its nutritional contents, such as calories."

9. Unsupervised Visual Representation Learning by Context Prediction Carl Doersch, Abhinav Gupta, Alexei A. Efros

"How can one write an objective function to encourage a representation to capture, for example, objects, if none of the objects are labeled?"

10. Deep Neural Decision Forests Peter Kontschieder, Madalina Fiterau, Antonio Criminisi, Samuel Rota Bulò

"We introduce a stochastic and differentiable decision tree model, which steers the representation learning usually conducted in the initial layers of a (deep) convolutional network."

11. Conditional Random Fields as Recurrent Neural Networks Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, Philip H. S. Torr

"We formulate mean-field approximate inference for the Conditional Random Fields with Gaussian pairwise potentials as Recurrent Neural Networks."

12. Flowing ConvNets for Human Pose Estimation in Videos Tomas Pfister, James Charles, Andrew Zisserman

"We investigate a ConvNet architecture that is able to benefit from temporal context by combining information across the multiple frames using optical flow."

13. Dense Optical Flow Prediction From a Static Image Jacob Walker, Abhinav Gupta, Martial Hebert

"Given a static image, P-CNN predicts the future motion of each and every pixel in the image in terms of optical flow. Our P-CNN model leverages the data in tens of thousands of realistic videos to train our model. Our method relies on absolutely no human labeling and is able to predict motion based on the context of the scene."

14. DeepBox: Learning Objectness With Convolutional Networks Weicheng Kuo, Bharath Hariharan, Jitendra Malik

"Our framework, which we call DeepBox, uses convolutional neural networks (CNNs) to rerank proposals from a bottom-up method."

15. Active Object Localization With Deep Reinforcement Learning Juan C. Caicedo, Svetlana Lazebnik

"This agent learns to deform a bounding box using simple transformation actions, with the goal of determining the most specific location of target objects following top-down reasoning."

16. Predicting Depth, Surface Normals and Semantic Labels With a Common Multi-Scale Convolutional Architecture David Eigen, Rob Fergus

"We address three different computer vision tasks using a single multiscale convolutional network architecture: depth prediction, surface normal estimation, and semantic labeling."

17. HD-CNN: Hierarchical Deep Convolutional Neural Networks for Large Scale Visual Recognition Zhicheng Yan, Hao Zhang, Robinson Piramuthu, Vignesh Jagadeesh, Dennis DeCoste, Wei Di, Yizhou Yu

"We introduce hierarchical deep CNNs (HD-CNNs) by embedding deep CNNs into a category hierarchy. An HD-CNN separates easy classes using a coarse category classifier while distinguishing difficult classes using fine category classifiers."

18. FlowNet: Learning Optical Flow With Convolutional Networks Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Häusser, Caner Hazırbaş, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, Thomas Brox

"We construct appropriate CNNs which are capable of solving the optical flow estimation problem as a supervised learning task."

19. Understanding Deep Features With Computer-Generated Imagery Mathieu Aubry, Bryan C. Russell

"Rendered images are presented to a trained CNN and responses for different layers are studied with respect to the input scene factors."

20. PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization Alex Kendall, Matthew Grimes, Roberto Cipolla

"Our system trains a convolutional neural network to regress the 6-DOF camera pose from a single RGB image in an end-to-end manner with no need of additional engineering or graph optimisation."

21. Visual Tracking With Fully Convolutional Networks Lijun Wang, Wanli Ouyang, Xiaogang Wang, Huchuan Lu

"A new approach for general object tracking with fully convolutional neural network."

Conclusion

While some can argue that the great convergence upon ConvNets is making the field less diverse, it is actually making the techniques easier to comprehend. It is easier to "borrow breakthrough thinking" from one research direction when the core computations are cast in the language of ConvNets. Using ConvNets, properly trained (and motivated!) 21 year old graduate student are actually able to compete on benchmarks, where previously it would take an entire 6-year PhD cycle to compete on a non-trivial benchmark.

See you next week in Chile!

↧

Why SLAM Matters, The Future of Real-Time SLAM, and Deep Learning vs SLAM

January 13, 2016, 1:20 am

≫ Next: Deep Learning Trends @ ICLR 2016

≪ Previous: ICCV 2015: Twenty one hottest research papers

Last month's International Conference of Computer Vision (ICCV) was full of Deep Learning techniques, but before we declare an all-out ConvNet victory, let's see how the other "non-learning" geometric side of computer vision is doing. Simultaneous Localization and Mapping, or SLAM, is arguably one of the most important algorithms in Robotics, with pioneering work done by both computer vision and robotics research communities. Today I'll be summarizing my key points from ICCV's Future of Real-Time SLAM Workshop, which was held on the last day of the conference (December 18th, 2015).

Today's post contains a brief introduction to SLAM, a detailed description of what happened at the workshop (with summaries of all 7 talks), and some take-home messages from the Deep Learning-focused panel discussion at the end of the session.

SLAM visualizations. Can you identify any of these SLAM algorithms?

Part I: Why SLAM Matters

Visual SLAM algorithms are able to simultaneously build 3D maps of the world while tracking the location and orientation of the camera (hand-held or head-mounted for AR or mounted on a robot). SLAM algorithms are complementary to ConvNets and Deep Learning: SLAM focuses on geometric problems and Deep Learning is the master of perception (recognition) problems. If you want a robot to go towards your refrigerator without hitting a wall, use SLAM. If you want the robot to identify the items inside your fridge, use ConvNets.

Basics of SfM/SLAM: From point observation and intrinsic camera parameters, the 3D structure of a scene is computed from the estimated motion of the camera. For details, see openMVG website.

SLAM is a real-time version of Structure from Motion (SfM). Visual SLAM or vision-based SLAM is a camera-only variant of SLAM which forgoes expensive laser sensors and inertial measurement units (IMUs). Monocular SLAM uses a single camera while non-monocular SLAM typically uses a pre-calibrated fixed-baseline stereo camera rig. SLAM is prime example of a what is called a "Geometric Method" in Computer Vision. In fact, CMU's Robotics Institute splits the graduate level computer vision curriculum into a Learning-based Methods in Vision course and a separate Geometry-Based Methods in Vision course.

Structure from Motion vs Visual SLAM

Structure from Motion (SfM) and SLAM are solving a very similar problem, but while SfM is traditionally performed in an offline fashion, SLAM has been slowly moving towards the low-power / real-time / single RGB camera mode of operation. Many of the today’s top experts in Structure from Motion work for some of the world’s biggest tech companies, helping make maps better. Successful mapping products like Google Maps could not have been built without intimate knowledge of multiple-view geometry, SfM, and SLAM. A typical SfM problem is the following: given a large collection of photos of a single outdoor structure (like the Colliseum), construct a 3D model of the structure and determine the camera's poses. The image collection is processed in an offline setting, and large reconstructions can take anywhere between hours and days.

SfM Software: Bundler is one of the most successful SfM open source libraries

Here are some popular SfM-related software libraries:

Bundler, an open-source Structure from Motion toolkit
Libceres, a non-linear least squares minimizer (useful for bundle adjustment problems)
Andrew Zisserman's Multiple-View Geometry MATLAB Functions

Visual SLAM vs Autonomous Driving

While self-driving cars are one of the most important applications of SLAM, according to Andrew Davison, one of the workshop organizers, SLAM for Autonomous Vehicles deserves its own research track. (And as we'll see, none of the workshop presenters talked about self-driving cars). For many years to come it will make sense to continue studying SLAM from a research perspective, independent of any single Holy-Grail application. While there are just too many system-level details and tricks involved with autonomous vehicles, research-grade SLAM systems require very little more than a webcam, knowledge of algorithms, and elbow grease. As a research topic, Visual SLAM is much friendlier to thousands of early-stage PhD students who’ll first need years of in-lab experience with SLAM before even starting to think about expensive robotic platforms such as self-driving cars.

Google's Self-Driving Car's perception system. From IEEE Spectrum's "How Google's Self-Driving Car Works"

Related: March 2015 blog post, Mobileye's quest to put Deep Learning inside every new car.

Part II: The Future of Real-time SLAM

Now it's time to officially summarize and comment on the presentations from The Future of Real-time SLAM workshop. Andrew Davison started the day with an excellent historical overview of SLAM called 15 years of vision-based SLAM, and his slides have good content for an introductory robotics course.

For those of you who don’t know Andy, he is the one and only Professor Andrew Davison of Imperial College London. Most known for his 2003 MonoSLAM system, he was one of the first to show how to build SLAM systems from a single “monocular” camera at a time when just everybody thought you needed a stereo “binocular” camera rig. More recently, his work has influenced the trajectory of companies such as Dyson and the capabilities of their robotic systems (e.g., the brand new Dyson360).

I remember Professor Davidson from the Visual SLAM tutorial he gave at the BMVC Conference back in 2007. Surprisingly very little has changed in SLAM compared to the rest of the machine-learning heavy work being done at the main vision conferences. In the past 8 years, object recognition has undergone 2-3 mini revolutions, while today's SLAM systems don't look much different than they did 8 years ago. The best way to see the progress of SLAM is to take a look at the most successful and memorable systems. In Davison’s workshop introduction talk, he discussed some of these exemplary systems which were produced by the research community over the last 10-15 years:

MonoSLAM
PTAM
FAB-MAP
DTAM
KinectFusion

Davison vs Horn: The next chapter in Robot Vision
Davison also mentioned that he is working on a new Robot Vision book, which should be an exciting treat for researchers in computer vision, robotics, and artificial intelligence. The last Robot Vision book was written by B.K. Horn (1986), and it’s about time for an updated take on Robot Vision.

A new robot vision book?

While I’ll gladly read a tome that focuses on the philosophy of robot vision, personally I would like the book to focus on practical algorithms for robot vision, like the excellent Multiple View Geometry book by Hartley and Zissermann or Probabilistic Robotics by Thrun, Burgard, and Fox. A "cookbook" of visual SLAM problems would be a welcome addition to any serious vision researcher's collection.

Related: Davison's 15-years of vision-based SLAM slides

Talk 1: Christian Kerl on Continuous Trajectories in SLAM

The first talk, by Christian Kerl, presented a dense tracking method to estimate a continuous-time trajectory. The key observation is that most SLAM systems estimate camera poses at a discrete number of time steps (either they key frames which are spaced several seconds apart, or the individual frames which are spaced approximately 1/25s apart).

Continuous Trajectories vs Discrete Time Points. SLAM/SfM usually uses discrete time points, but why not go continuous?

Much of Kerl’s talk was focused on undoing the damage of rolling shutter cameras, and the system demo’ed by Kerl paid meticulous attention to modeling and removing these adverse rolling shutter effects.

Undoing the damage of rolling shutter in Visual SLAM.

Related: Kerl's Dense continous-time tracking and mapping slides.
Related: Dense Continuous-Time Tracking and Mapping with Rolling Shutter RGB-D Cameras (C. Kerl, J. Stueckler, D. Cremers), In IEEE International Conference on Computer Vision (ICCV), 2015. [pdf]

Talk 2: Semi-Dense Direct SLAM by Jakob Engel

LSD-SLAM came out at ECCV 2014 and is one of my favorite SLAM systems today! Jakob Engel was there to present his system and show the crowd some of the coolest SLAM visualizations in town. LSD-SLAM is an acronym for Large-Scale Direct Monocular SLAM. LSD-SLAM is an important system for SLAM researchers because it does not use corners or any other local features. Direct tracking is performed by image-to-image alignment using a coarse-to-fine algorithm with a robust Huber loss. This is quite different than the feature-based systems out there. Depth estimation uses an inverse depth parametrization (like many other SLAM systems) and uses a large number or relatively small baseline image pairs. Rather than relying on image features, the algorithms is effectively performing “texture tracking”. Global mapping is performed by creating and solving a pose graph "bundle adjustment" optimization problem, and all of this works in real-time. The method is semi-dense because it only estimates depth at pixels solely near image boundaries. LSD-SLAM output is denser than traditional features, but not fully dense like Kinect-style RGBD SLAM.

LSD-SLAM in Action: LSD-SLAM generates both a camera trajectory and a semi-dense 3D scene reconstruction. This approach works in real-time, does not use feature points as primitives, and performs direct image-to-image alignment.

Engel gave us an overview of the original LSD-SLAM system as well as a handful of new results, extending their initial system to more creative applications and to more interesting deployments. (See paper citations below)

Related: LSD-SLAM Open-Source Code on github LSD-SLAM project webpage
Related: LSD-SLAM: Large-Scale Direct Monocular SLAM (J. Engel, T. Schöps, D. Cremers), In European Conference on Computer Vision (ECCV), 2014. [pdf] [youtube video]

An extension to LSD-SLAM, Omni LSD-SLAM was created by the observation that the pinhole model does not allow for a large field of view. This work was presented at IROS 2015 (Caruso is first author) and allows a large field of view (ideally more than 180 degrees). From Engel’s presentation it was pretty clear that you can perform ballerina-like motions (extreme rotations) while walking around your office and holding the camera. This is one of those worst-case scenarios for narrow field of view SLAM, yet works quite well in Omni LSD-SLAM.

Omnidirectional LSD-SLAM Model. See Engel's Semi-Dense Direct SLAM presentation slides.

Related: Large-Scale Direct SLAM for Omnidirectional Cameras (D. Caruso, J. Engel, D. Cremers), In International Conference on Intelligent Robots and Systems (IROS), 2015. [pdf] [youtube video]

Stereo LSD-SLAM is an extension of LSD-SLAM to a binocular camera rig. This helps in getting the absolute scale, initialization is instantaneous, and there are no issues with strong rotation. While monocular SLAM is very exciting from an academic point of view, if your robot is a 30,000$ car or 10,000$ drone prototype, you should have a good reason to not use a two+ camera rig. Stereo LSD-SLAM performs quite competitively on SLAM benchmarks.

Stereo LSD-SLAM. Excellent results on KITTI vehicle-SLAM dataset.

Stereo LSD-SLAM is quite practical, optimizes a pose graph in SE(3), and includes a correction for auto exposure. The goal of auto-exposure correcting is to make the error function invariant to affine lighting changes. The underlying parameters of the color-space affine transform are estimated during matching, but thrown away to estimate the image-to-image error. From Engel's talk, outliers (often caused by over-exposed image pixels) tend to be a problem, and much care needs to be taken to care of their effects.

Related: Large-Scale Direct SLAM with Stereo Cameras (J. Engel, J. Stueckler, D. Cremers), In International Conference on Intelligent Robots and Systems (IROS), 2015. [pdf] [youtube video]

Later in his presentation, Engel gave us a sneak peak on new research about integrating both stereo and inertial sensors. For details, you’ll have to keep hitting refresh on Arxiv or talk to Usenko/Engel in person. On the applications side, Engel's presentation included updated videos of an Autonomous Quadrotor driven by LSD-SLAM. The flight starts with an up-down motion to get the scale estimate and a free-space octomap is used to estimate the free-space so that the quadrotor can navigate space on its own. Stay tuned for an official publication...

Quadrotor running Stereo LSD-SLAM.

See Engel's quadrotor youtube video from 2012.

The story of LSD-SLAM is also the story of feature-based vs direct-methods and Engel gave both sides of the debate a fair treatment. Feature-based methods are engineered to work on top of Harris-like corners, while direct methods use the entire image for alignment. Feature-based methods are faster (as of 2015), but direct methods are good for parallelism. Outliers can be retroactively removed from feature-based systems, while direct methods are less flexible w.r.t. outliners. Rolling shutter is a bigger problem for direct methods and it makes sense to use a global shutter or a rolling shutter model (see Kerl’s work). Feature-based methods require making decisions using incomplete information, but direct methods can use much more information. Feature-based methods have no need for good initialization and direct-based methods need some clever tricks for initialization. There is only about 4 years of research on direct methods and 20+ on sparse methods. Engel is optimistic that direct methods will one day rise to the top, and so am I.

Feature-based vs direct methods of building SLAM systems. Slide from Engel's talk.

At the end of Engel's presentation, Davison asked about semantic segmentation and Engel wondered whether semantic segmentation can be performed directly on semi-dense "near-image-boundary" data. However, my personal opinion is that there are better ways to apply semantic segmentation to LSD-like SLAM systems. Semi-dense SLAM can focus on geometric information near boundaries, while object recognition can focus on reliable semantics away from the same boundaries, potentially creating a hybrid geometric/semantic interpretation of the image.

Related: Engel's Semi-Dense Direct SLAM presentation slides

Talk 3: Sattler on The challenges of Large-Scale Localization and Mapping

Torsten Sattler gave a talk on large-scale localization and mapping. The motivation for this work is to perform 6-dof localization inside an existing map, especially for mobile localization. One of the key points in the talk was that when you are using traditional feature-based methods, storing your descriptors soon becomes very costly. Techniques such as visual vocabularies (remember product quantization?) can significantly reduce memory overhead, and with clever optimization at some point storing descriptors no longer becomes the memory bottleneck.

Another important take-home message from Sattler’s talk is that the number of inliers is not actually a good confidence measure for camera pose estimation. When the feature point are all concentrated in a single part of the image, camera localization can be kilometers away! A better measure of confidence is the “effective inlier count” which looks at the area spanned by the inliers as a fraction of total image area. What you really want is feature matches from all over the image — if the information is spread out across the image you get a much better pose estimate.

Sattler’s take on the future of real-time slam is the following: we should focus on compact map representations, we should get better at understanding camera pose estimate confidences (like down-weighing features from trees), we should work on more challenging scenes (such as worlds with planar structures and nighttime localization against daytime maps).

Mobile Localisation: Sattler's key problem is localizing yourself inside a large city with a single smartphone picture

Related: Scalable 6-DOF Localization on Mobile Devices. Sven Middelberg, Torsten Sattler, Ole Untzelmann, Leif Kobbelt. In ECCV 2014. [pdf]
Related: Torsten Sattler 's The challenges of large-scale localisation and mapping slides

Talk 4: Artal on Feature-based vs Direct-Methods

Raúl Artal, the creator of ORB-SLAM, dedicated his entire presentation to the Feature-based vs Direct-method debate in SLAM and he's definitely on the feature-based side. ORB-SLAM is available as an open-source SLAM package and it is hard to beat. During his evaluation of ORB-SLAM vs PTAM it seems that PTAM actually fails quite often (at least on the TUM RGB-D benchmark). LSD-SLAM errors are also much higher on the TUM RGB-D benchmark than expected.

Feature-Based SLAM vs Direct SLAM. See Artal's Should we still do sparse feature based SLAM? presentation slides

Related: Artal's Should we still do sparse-feature based SLAM? slides
Related: Monocular ORB-SLAM R. Mur-Artal, J. M. M. Montiel and J. D. Tardos. A versatile and Accurate Monocular SLAM System. IEEE Transactions on Robotics. 2015 [pdf]
Related: ORB-SLAM Open-source code on github, Project Website

Talk 5: Project Tango and Visual loop-closure for image-2-image constraints

Simply put, Google's Project Tango is the world' first attempt at commercializing SLAM. Simon Lynen from Google Zurich (formerly ETH Zurich) came to the workshop with a Tango live demo (on a tablet) and a presentation on what's new in the world of Tango. In case you don't already know, Google wants to put SLAM capabilities into the next generation of Android Devices.

Google's Project Tango needs no introduction.

The Project Tango presentation discussed a new way of doing loop closure by finding certain patters in the image-to-image matching matrix. This comes from the “Placeless Place Recognition” work. They also do online bundle adjustment w/ vision-based loop closure.

Loop Closure inside a Project Tango? Lynen et al's Placeless Place Recognition. The image-to-image matrix reveals a new way to look for loop-closure. See the algorithm in action in this youtube video.

The Project Tango folks are also working on combing multiple crowd-sourced maps at Google, where the goals to combine multiple mini-maps created by different people using Tango-equipped devices.

Simon showed a video of mountain bike trail tracking which is actually quite difficult in practice. The idea is to go down a mountain bike trail using a Tango device and create a map, then the follow-up goal is to have a separate person go down the trail. This currently “semi-works” when there are a few hours between the map building and the tracking step, but won’t work across weeks/months/etc.

During the Tango-related discussion, Richard Newcombe pointed out that the “features” used by Project Tango are quite primitive w.r.t. getting a deeper understanding of the environment, and it appears that Project Tango-like methods won't work on outdoor scenes where the world is plagued by non-rigidity, massive illumination changes, etc. So are we to expect different systems being designed for outdoor systems or will Project Tango be an indoor mapping device?

Related: Placeless Place Recognition. Lynen, S. ; Bosse, M. ; Furgale, P. ; Siegwart, R. In 3DV 2014.

Talk 6: ElasticFusion is DenseSLAM without a pose-graph

ElasticFusion is a dense SLAM technique which requires a RGBD sensor like the Kinect. 2-3 minutes to obtain a high-quality 3D scan of a single room is pretty cool. A pose-graph is used behind the scenes of many (if not most) SLAM systems, and this technique has a different (map-centric) approach. The approach focuses on building a map, but the trick is that the map is deformable, hence the name ElasticFusion. The “Fusion” part of the algorithm is in homage to KinectFusion which was one of the first high quality kinect-based reconstruction pipelines. Also surfels are used as the underlying primitives.

Image from Kintinuous, an early version of Whelan's Elastic Fusion.

Recovering light sources: we were given a sneak peak at new unpublished work from Imperial College London / dyson Robotics Lab. The idea is that detecting the light source direction and detecting specularities, you can improve 3D reconstruction results. Cool videos of recovering light source locations which work for up to 4 separate lights.

Related: Map-centric SLAM with ElasticFusion presentation slides
Related: ElasticFusion: Dense SLAM Without A Pose Graph. Whelan, Thomas and Leutenegger, Stefan and Salas-Moreno, Renato F and Glocker, Ben and Davison, Andrew J. In RSS 2015.

Talk 7: Richard Newcombe’s DynamicFusion
Richard Newcombe's (whose recently formed company was acquired by Oculus), was the last presenter. It's really cool to see the person behind DTAM, KinectFusion, and DynamicFusion now working in the VR space.

Newcombe's Dynamic Fusion algorithm. The technique won the prestigious CVPR 2015 best paper award, and to see it in action just take a look at the authors'DynamicFusion Youtube video.

Related: DynamicFusion: Reconstruction and Tracking of Non-rigid Scenes in Real-Time, Richard A. Newcombe, Dieter Fox, Steven M. Seitz. In CVPR 2015. [pdf] [Best-Paper winner]

Related: SLAM++: Simultaneous Localisation and Mapping at the Level of Objects Renato F. Salas-Moreno, Richard A. Newcombe, Hauke Strasdat, Paul H. J. Kelly and Andrew J. Davison (CVPR 2013)
Related: KinectFusion: Real-Time Dense Surface Mapping and Tracking Richard A. Newcombe Shahram Izadi,Otmar Hilliges, David Molyneaux, David Kim, Andrew J. Davison, Pushmeet Kohli, Jamie Shotton, Steve Hodges, Andrew Fitzgibbon (ISMAR 2011, Best paper award!)

Workshop Demos

During the demo sessions (held in the middle of the workshop), many of the presenter showed off their SLAM systems in action. Many of these systems are available as open-source (free for non-commercial use?) packages, so if you’re interested in real-time SLAM, downloading the code is worth a shot. However, the one demo which stood out was Andrew Davison’s showcase of his MonoSLAM system from 2004. Andy had to revive his 15-year old laptop (which was running Redhat Linux) to show off his original system, running on the original hardware. If the computer vision community is going to oneway decide on a “retro-vision” demo session, I’m just going to go ahead and nominate Andy for the best-paper prize, right now.

Andry's Retro-Vision SLAM Setup (Pictured on December 18th, 2015)

It was interesting to watch the SLAM system experts wave their USB cameras around, showing their systems build 3D maps of the desk-sized area around their laptops. If you carefully look at the way these experts move the camera around (i.e., smooth circular motions), you can almost tell how long a person has been working with SLAM. When the non-experts hold the camera, probability of tracking failure is significantly higher.

I had the pleasure of speaking with Andy during the demo session, and I was curious which line of work (in the past 15 years) surprised him the most. His reply was that PTAM, which showed how to perform real-time bundle adjustment, surprised him the most. The PTAM system was essentially a MonoSLAM++ system, but the significantly improved tracking results were due to taking a heavyweight algorithm (bundle adjustment) and making it real-time — something which Andy did not believe was possible in the early 2000s.

Part III: Deep Learning vs SLAM

The SLAM panel discussion was a lot of fun. Before we jump to the important Deep Learning vs SLAM discussion, I should mention that each of the workshop presenters agreed that semantics are necessary to build bigger and better SLAM systems. There were lots of interesting mini-conversations about future directions. During the debates, Marc Pollefeys (a well-known researcher in SfM and Multiple-View Geometry) reminded everybody that Robotics is the killer application of SLAM and suggested we keep an eye on the prize. This is quite surprising since SLAM was traditionally applied to Robotics problems, but the lack of Robotics success in the last few decades (Google Robotics?) has shifted the focus of SLAM away from Robots and towards large-scale map building (ala GoogleMaps) and Augmented Reality. Nobody at this workshop talked about Robots.

Integrating semantic information into SLAM

There was a lot of interest in incorporating semantics into today’s top-performing SLAM systems. When it comes to semantics, the SLAM community is unfortunately stuck in the world of bags-of-visual-words, and doesn't have new ideas on how to integrate semantic information into their systems. On the other end, we’re now seeing real-time semantic segmentation demos (based on ConvNets) popping up at CVPR/ICCV/ECCV, and in my opinion SLAM needs Deep Learning as much as the other way around.

Integrating semantics into SLAM is often talk about, but it is easier said than done.

Figure 6.9 (page 142) from Moreno's PhD thesis: Dense Semantic SLAM

"Will end-to-end learning dominate SLAM?"

Towards the end of the SLAM workshop panel, Dr. Zeeshan Zia asked a question which startled the entire room and led to a memorable, energy-filled discussion. You should have seen the look on the panel’s faces. It was a bunch of geometers being thrown a fireball of deep learning. Their facial expressions suggest both bewilderment, anger, and disgust. "How dare you question us?"they were thinking. And it is only during these fleeting moments that we can truly appreciate the conference experience. Zia's question was essentially: Will end-to-end learning soon replace the mostly manual labor involved in building today’s SLAM systems?.

Zia's question is very important because end-to-end trainable systems have been slowly creeping up on many advanced computer science problems, and there's no reason to believe SLAM will be an exception. A handful of the presenters pointed out that current SLAM systems rely on too much geometry for a pure deep-learning based SLAM system to make sense -- we should use learning to make the point descriptors better, but leave the geometry alone. Just because you can use deep learning to make a calculator, it doesn't mean you should.

Learning Stereo Similarity Functions via ConvNets, by Yan LeCun and collaborators.

While many of the panel speakers responded with a somewhat affirmative "no", it was Newcombe which surprisingly championed what the marriage of Deep Learning and SLAM might look like.

Newcombe's Proposal: Use SLAM to fuel Deep Learning
Although Newcombe didn’t provide much evidence or ideas on how Deep Learning might help SLAM, he provided a clear path on how SLAM might help Deep Learning. Think of all those maps that we've built using large-scale SLAM and all those correspondences that these systems provide — isn’t that a clear path for building terascale image-image "association" datasets which should be able to help deep learning? The basic idea is that today's SLAM systems are large-scale "correspondence engines" which can be used to generate large-scale datasets, precisely what needs to be fed into a deep ConvNet.

Concluding Remarks

There is quite a large disconnect between the kind of work done at the mainstream ICCV conference (heavy on machine learning) and the kind of work presented at the real-time SLAM workshop (heavy on geometric methods like bundle adjustment). The mainstream Computer Vision community has witnessed several min-revolutions within the past decade (e.g., Dalal-Triggs, DPM, ImageNet, ConvNets, R-CNN) while the SLAM systems of today don’t look very different than they did 8 years ago. The Kinect sensor has probably been the single largest game changer in SLAM, but the fundamental algorithms remain intact.

Integrating semantic information: The next frontier in Visual SLAM.

Brain image from Arwen Wallington's blog post.

Today’s SLAM systems help machines geometrically understand the immediate world (i.e., build associations in a local coordinate system) while today’s Deep Learning systems help machines reason categorically (i.e., build associations across distinct object instances). In conclusion, I share Newcombe and Davison excitement in Visual SLAM, as vision-based algorithms are going to turn Augmented and Virtual Reality into billion dollar markets. However, we should not forget to keep our eyes on the "trillion-dollar" market, the one that's going to redefine what it means to "work" -- namely Robotics. The day of Robot SLAM will come soon.

↧

Deep Learning Trends @ ICLR 2016

June 1, 2016, 2:21 am

≫ Next: Making Deep Networks Probabilistic via Test-time Dropout

≪ Previous: Why SLAM Matters, The Future of Real-Time SLAM, and Deep Learning vs SLAM

Started by the youngest members of the Deep Learning Mafia [1], namely Yann LeCun and Yoshua Bengio, the ICLR conference is quickly becoming a strong contender for the single most important venue in the Deep Learning space. More intimate than NIPS and less benchmark-driven than CVPR, the world of ICLR is arXiv-based and moves fast.

Today's post is all about ICLR 2016. I’ll highlight new strategies for building deeper and more powerful neural networks, ideas for compressing big networks into smaller ones, as well as techniques for building “deep learning calculators.” A host of new artificial intelligence problems is being hit hard with the newest wave of deep learning techniques, and from a computer vision point of view, there's no doubt that deep convolutional neural networks are today's "master algorithm" for dealing with perceptual data.

Deep Powwow in Paradise?ICLR 2016 was held in Puerto Rico.

Whether you're working in Robotics, Augmented Reality, or dealing with a computer vision-related problem, the following summary of ICLR research trends will give you a taste of what's possible on top of today's Deep Learning stack. Consider today's blog post a reading group conversation-starter.

Part I: ICLR vs CVPR
Part II: ICLR 2016 Deep Learning Trends
Part III: Quo Vadis Deep Learning?

Part I: ICLR vs CVPR

Last month's International Conference of Learning Representations, known briefly as ICLR 2016, and commonly pronounced as “eye-clear,” could more appropriately be called the International Conference on Deep Learning. The ICLR 2016 conference was held May 2nd-4th 2016 in lovely Puerto Rico. This year was the 4th installment of the conference -- the first was in 2013 and it was initially so small that it had to be co-located with another conference. Because it was started by none other than the Deep Learning Mafia, it should be no surprise that just about everybody at the conference was studying and/or applying Deep Learning Methods. Convolutional Neural Networks (which dominate image recognition tasks) were all over the place, with LSTMs and other Recurrent Neural Networks (used to model sequences and build "deep learning calculators") in second place. Most of my own research conference experiences come from CVPR (Computer Vision and Pattern Recognition), and I've been a regular CVPR attendee since 2004. Compared to ICLR, CVPR has a somewhat colder, more-emprical feel. To describe the difference between ICLR and CVPR, Yan LeCun, quoting Raquel Urtasun (who got the original saying from Sanja Fidler), put it best on Facebook.

CVPR: What can Deep Nets do for me?

ICLR: What can I do for Deep Nets?

The ICLR 2016 conference was my first official powwow that truly felt like a close-knit "let's share knowledge" event. 3 days of the main conference, plenty of evening networking events, and no workshops. With a total attendance of about 500, ICLR is about 1/4 the size of CVPR. In fact, CVPR 2004 in D.C. was my first conference ever, and CVPRs are infamous for their packed poster sessions, multiple sessions, and enough workshops/tutorials to make CVPRs last an entire week. At the end of CVPR, you'll have a research hangover and will need a few days to recuperate. I prefer the size and length of ICLR.

CVPR and NIPS, like many other top-tier conferences heavily utilizing machine learning techniques, have grown to gargantuan sizes, and paper acceptance rates at these mega conferences are close to 20%. It not necessarily true that the research papers at ICLR were any more half-baked than some CVPR papers, but the amount of experimental validation for an ICLR paper makes it a different kind of beast than CVPR. CVPR’s main focus is to produce papers that are ‘state-of-the-art’ and this essentially means you have to run your algorithm on a benchmark and beat last season’s leading technique. ICLR’s main focus it to highlight new and promising techniques in the analysis and design of deep convolutional neural networks, initialization schemes for such models, and the training algorithms to learn such models from raw data.

Deep Learning is Learning Representations
Yann LeCun and Yoshua Bengio started this conference in 2013 because there was a need to a new, small, high-quality venue with an explicit focus on deep methods. Why is the conference called “Learning Representations?” Because the typical deep neural networks that are trained in an end-to-end fashion actually learn such intermediate representations. Traditional shallow methods are based on manually-engineered features on top of a trainable classifier, but deep methods learn a network of layers which learns those highly-desired features as well as the classifier. So what do you get when you blur the line between features and classifiers? You get representation learning. And this is what Deep Learning is all about.

ICLR Publishing Model: arXiv or bust
At ICLR, papers get posted on arXiv directly. And if you had any doubts that arXiv is just about the single awesomest thing to hit the research publication model since the Gutenberg press, let the success of ICLR be one more data point towards enlightenment. ICLR has essentially bypassed the old-fashioned publishing model where some third party like Elsevier says “you can publish with us and we’ll put our logo on your papers and then charge regular people $30 for each paper they want to read.” Sorry Elsevier, research doesn’t work that way. Most research papers aren’t good enough to be worth $30 for a copy. It is the entire body of academic research that provides true value, for which a single paper just a mere door. You see, Elsevier, if you actually gave the world an exceptional research paper search engine, together with the ability to have 10-20 papers printed on decent quality paper for a $30/month subscription, then you would make a killing on researchers and I would endorse such a subscription. So ICLR, rightfully so, just said fuck it, we’ll use arXiv as the method for disseminating our ideas. All future research conferences should use arXiv to disseminate papers. Anybody can download the papers, see when newer versions with corrections are posted, and they can print their own physical copies. But be warned: Deep Learning moves so fast, that you’ve gotta be hitting refresh or arXiv on a weekly basis or you’ll be schooled by some grad students in Canada.

Attendees of ICLR
Google DeepMind and Facebook’s FAIR constituted a large portion of the attendees. A lot of startups, researchers from the Googleplex, Twitter, NVIDIA, and startups such as Clarifai and Magic Leap. Overall a very young and vibrant crowd, and a very solid representation by super-smart 28-35 year olds.

Part II: Deep Learning Themes @ ICLR 2016

Incorporating Structure into Deep Learning
Raquel Urtasun from the University of Toronto gave a talk about Incorporating Structure in Deep Learning. See Raquel's Keynote video here. Many ideas from structure learning and graphical models were presented in her keynote. Raquel’s computer vision focus makes her work stand out, and she additionally showed some recent research snapshots from her upcoming CVPR 2016 work.

Raquel gave a wonderful 3D Indoor Understanding Tutorial at last year's CVPR 2015.

One of Raquel's strengths is her strong command of geometry, and her work covers both learning-based methods as well as multiple-view geometry. I strongly recommend keeping a close look at her upcoming research ideas. Below are two bleeding edge papers from Raquel's group -- the first one focuses on soccer field localization from a broadcast of such a game using branch and bound inference in a MRF.

Raquel's new work. Soccer Field Localization from Single Image. Homayounfar et al, 2016.

Soccer Field Localization from a Single Image. Namdar Homayounfar, Sanja Fidler, Raquel Urtasun. in arXiv:1604.02715.

The second upcoming paper from Raquel's group is on using Deep Learning for Dense Optical Flow, in the spirit of FlowNet, which I discussed in my ICCV 2015 hottest papers blog post. The technique is built on the observation that the scene is typically composed of a static background, as well as a relatively small number of traffic participants which move rigidly in 3D. The dense optical flow technique is applied to autonomous driving.

Deep Semantic Matching for Optical Flow. Min Bai, Wenjie Luo, Kaustav Kundu, Raquel Urtasun. In arXiv:1604.01827.

Reinforcement Learning
Sergey Levine gave an excellent Keynote on deep reinforcement learning and its application to Robotics[3]. See Sergey's Keynote video here. This kind of work is still the future, and there was very little robotics-related research in the main conference. It might not be surprising, because having an assembly of robotic arms is not cheap, and such gear is simply not present in most grad student research labs. Most ICLR work is pure software and some math theory, so a single GPU is all that is needed to start with a typical Deep Learning pipeline.

An army of robot arms jointly learning to grasp somewhere inside Google.

Take a look at the following interesting work which shows what Alex Krizhevsky, the author of the legendary 2012 AlexNet paper which rocked the world of object recognition, is currently doing. And it has to do with Deep Learning for Robotics, currently at Google.

Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection Sergey Levine, Peter Pastor, Alex Krizhevsky, Deirdre Quillen. In arXiv:1603.02199.

For those of you who want to learn more about Reinforcement Learning, perhaps it is time to check out Andrej Karpathy's Deep Reinforcement Learning: Pong From Pixels tutorial. One thing is for sure: when it comes to deep reinforcement learning, OpenAI is all-in.

Compressing Networks

Model Compression: The WinZip of
Neural Nets?

While NVIDIA might be today’s king of Deep Learning Hardware, I can’t help the feeling that there is a new player lurking in the shadows. You see, GPU-based mining of bitcoin didn’t last very long once people realized the economic value of owning bitcoins. Bitcoin very quickly transitioned into specialized FPGA hardware for running the underlying bitcoin computations, and the FPGAs of Deep Learning are right around the corner. Will NVIDIA remain the King? I see a fork in NVIDIA's future. You can continue producing hardware which pleases both gamers and machine learning researchers, or you can specialize. There is a plethora of interesting companies like Nervana Systems, Movidius, and most importantly Google, that don’t want to rely on power-hungry heatboxes known as GPUs, especially when it comes to scaling already trained deep learning models. Just take a look at Fathom by Movidius or the Google TPU.

But the world has already seen the economic value of Deep Nets, and the “software” side of deep nets isn't waiting for the FPGAs of neural nets. The software version of compressing neural networks is a very trendy topic. You basically want to take a beefy neural network and compress it down into smaller, more efficient model. Binarizing the weights is one such strategy. Student-Teacher networks where a smaller network is trained to mimic the larger network are already here. And don’t be surprised if within the next year we’ll see 1MB sized networks performing at the level of Oxford’s VGGNet on the ImageNet 1000-way classification task.

Summary from ICLR 2016's Deep Compression paper by Han et al.

This year's ICLR brought a slew of Compression papers, the three which stood out are listed below.

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. Song Han, Huizi Mao, and Bill Dally. In ICLR 2016. This paper won the Best Paper Award. See Han give the Deep Compression talk.

Neural Networks with Few Multiplications. Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, Yoshua Bengio. In ICLR 2016.

8-Bit Approximations for Parallelism in Deep Learning. Tim Dettmers. In ICLR 2016.

Unsupervised Learning
Philip Isola presented a very Efrosian paper on using Siamese Networks defined on patches to learn a patch similarity function in an unsupervised way. This patch-patch similarity function was used to create a local similarity graph defined over an image which can be used to discover the extent of objects. This reminds me of the Object Discovery line of research started by Alyosha Efros and the MIT group, where the basic idea is to abstain from using class labels in learning a similarity function.

Isola et al: A Siamese network has shared weights and can be used for learning embeddings or "similarity functions."

Learning visual groups from co-occurrences in space and time Phillip Isola, Daniel Zoran, Dilip Krishnan, Edward H. Adelson. In ICLR 2016.

Isola et al: Visual groupings applied to image patches, frames of a video, and a large scene dataset.

Initializing Networks: And why BatchNorm matters
Getting a neural network up and running is more difficult than it seems. Several papers in ICLR 2016 suggested new ways of initializing networks. But practically speaking, deep net initialization is “essentially solved.” Initialization seems to be an area of research that truly became more of a “science” than an “art” once researchers introduced BatchNorm into their neural networks. BatchNorm is the butter of Deep Learning -- add it to everything and everything will taste better. But this wasn’t always the case!

In the early days, researchers had lots of problems with constructing an initial set of weights of a deep neural network such that the back propagation could learn anything. In fact, one of the reasons why the Neural Networks of the 90s died as a research program, is precisely because it was well-known that a handful of top researchers knew how to tune their networks so that they could start automatically learning from data, but the other research didn’t know all of the right initialization tricks. It was as if the “black magic” inside the 90s NNs was just too intense. At some point, convex methods and kernel SVMs because the tools of choice — with no need to initialize in a convex optimization setting, for almost a decade (1995 to 2005) researchers just ran away from deep methods. Once 2006 hit, Deep Architectures were working again with Hinton’s magical deep Boltzmann Machines and unsupervised pretraining. Unsupervised pretaining didn’t last long, as researchers got GPUs and found that once your data set is large enough (think ~2 million images in ImageNet), that simple discriminative back-propagation does work. Random weight initialization strategies and cleverly tuned learning rates were quickly shared amongst researchers once 100s of them jumped on the ImageNet dataset. People started sharing code, and wonderful things happened!

But designing new neural networks for new problems was still problematic -- one wouldn't know exactly the best way to set multiple learning rates and random initialization magnitudes. But researchers got to work, and a handful of solid hackers from Google found out that the key problem was that poorly initialized networks were having a hard time flowing information through the networks. It’s as if layer N was producing activations in one range and the subsequent layers were expecting information to be of another order of magnitude. So Szegedy and Ioffe from Google proposed a simple “trick” to whiten the flow of data as it passes through the network. Their trick, called “BatchNorm” involves using a normalization layer after each convolutional and/or fully-connected layer in a deep network. This normalization layer whitens the data by subtracting a mean and dividing by a standard deviation, thus producing roughly gaussian numbers as information flows through the network. So simple, yet so sweet. The idea of whitening data is so prevalent in all of machine learning, that it’s silly that it took deep learning researchers so long to re-discover the trick in the context of deep nets.

Data-dependent Initializations of Convolutional Neural Networks Philipp Krähenbühl, Carl Doersch, Jeff Donahue, Trevor Darrell. In ICLR 2016. Carl Doersch, a fellow CMU PhD, is going to DeepMind, so there goes another point for DeepMind.

Backprop Tricks
Injecting noise into the gradient seems to work. And this reminds me of the common grad student dilemma where you fix a bug in your gradient calculation, and your learning algorithm does worse. You see, when you were computing the derivative on the white board, you probably made a silly mistake like messing up a coefficient that balances two terms or forgetting an additive / multiplicative term somewhere. However, with a high probability, your “buggy gradient” was actually correlated with the true “gradient”. And in many scenarios, a quantity correlated with the true gradient is better than the true gradient. It is a certain form of regularization that hasn’t been adequately addressed in the research community. What kinds of “buggy gradients” are actually good for learning? And is there a space of “buggy gradients” that are cheaper to compute than “true gradients”? These “FastGrad” methods could speed up training deep networks, at least for the first several epochs. Maybe by ICLR 2017 somebody will decide to pursue this research track.

Adding Gradient Noise Improves Learning for Very Deep Networks. Arvind Neelakantan, Luke Vilnis, Quoc V. Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach, James Martens. In ICLR 2016.

Robust Convolutional Neural Networks under Adversarial Noise Jonghoon Jin, Aysegul Dundar, Eugenio Culurciello. In ICLR 2016.

Attention: Focusing Computations
Attention-based methods are all about treating different "interesting" areas with more care than the "boring" areas. Not all pixels are equal, and people are able to quickly focus on the interesting bits of a static picture. ICLR 2016's most interesting "attention" paper was the Dynamic Capacity Networks paper from Aaron Courville's group at the University of Montreal. Hugo Larochelle, another key researcher with strong ties to the Deep Learning mafia, is now a Research Scientist at Twitter.

Dynamic Capacity Networks Amjad Almahairi, Nicolas Ballas, Tim Cooijmans, Yin Zheng, Hugo Larochelle, Aaron Courville. In ICLR 2016.

The “ResNet trick”: Going Mega Deep because it's Mega Fun
We saw some new papers on the new “ResNet” trick which emerged within the last few months in the Deep Learning Community. The ResNet trick is the “Residual Net” trick that gives us a rule for creating a deep stack of layers. Because each residual layer essentially learns to either pass the raw data through or mix in some combination of a non-linear transformation, the flow of information is much smoother. This “control of flow” that comes with residual blocks, lets you build VGG-style networks that are quite deep.

Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke. In ICLR 2016.

Resnet in Resnet: Generalizing Residual Architectures Sasha Targ, Diogo Almeida, Kevin Lyman. In ICLR 2016.

Deep Metric Learning and Learning Subcategories
A great paper, presented by Manohar Paluri of Facebook, focused on a new way to think about deep metric learning. The paper is “Metric Learning with Adaptive Density Discrimination” and reminds me of my own research from CMU. Their key idea can be distilled to the “anti-category” argument. Basically, you build into your algorithm the intuition that not all elements of a category C1 should collapse into a single unique representation. Due to the visual variety within a category, you only make the assumption that an element X of category C is going to be similar to a subset of other Cs, and not all of them. In their paper, they make the assumption that all members of category C belong to a set of latent subcategories, and EM-like learning alternates between finding subcategory assignments and updating the distance metric. During my PhD, we took this idea even further and build Exemplar-SVMs which were the smallest possible subcategories with a single positive “exemplar” member.

Manohar started his research as a member of the FAIR team, which focuses more on R&D work, but metric learning ideas are very product-focused, and the paper is a great example of a technology that seems to be "product-ready." I envision dozens of Facebook products that can benefit from such data-derived adaptive deep distance metrics.

Metric Learning with Adaptive Density Discrimination. Oren Rippel, Manohar Paluri, Piotr Dollar, Lubomir Bourdev. In ICLR 2016.

Deep Learning Calculators
LSTMs, Deep Neural Turing Machines, and what I call “Deep Learning Calculators” were big at the conference. Some people say, “Just because you can use deep learning to build a calculator, it doesn’t mean you should." And for some people, Deep Learning is the Holy-Grail-Titan-Power-Hammer, and everything that can be described with words should be built using deep learning components. Nevertheless, it's an exciting time for Deep Turing Machines.

The winner of the Best Paper Award was the paper, Neural Programmer-Interpreters by Scott Reed and Nando de Freitas. An interesting way to blend deep learning with the theory of computation. If you’re wondering what it would look like to use Deep Learning to learn quicksort, then check out their paper. And it seems like Scott Reed is going to Google DeepMind, so you can tell where they’re placing their bets.

Neural Programmer-Interpreters. Scott Reed, Nando de Freitas. In ICLR 2016.

Another interesting paper by some OpenAI guys is “Neural Random-Access Machines” which is going to be another fan favorite for those who love Deep Learning Calculators.

Neural Random-Access Machines. Karol Kurach, Marcin Andrychowicz, Ilya Sutskever. In ICLR 2016.

Computer Vision Applications
Boundary detection is a common computer vision task, where the goal is to predict boundaries between objects. CV folks have been using image pyramids, or multi-level processing, for quite some time. Check out the following Deep Boundary paper which aggregates information across multiple spatial resolutions.

Pushing the Boundaries of Boundary Detection using Deep Learning Iasonas Kokkinos, In ICLR 2016.

A great application for RNNs is to "unfold" an image into multiple layers. In the context of object detection, the goal is to decompose an image into its parts. The following figure explains it best, but if you've been wondering where to use RNNs in your computer vision pipeline, check out their paper.

Learning to decompose for object detection and instance segmentation Eunbyung Park, Alexander C. Berg. In ICLR 2016.

Dilated convolutions are a "trick" which allows you to increase your network's receptive field size and scene segmentation is one of the best application domains for such dilations.

Multi-Scale Context Aggregation by Dilated Convolutions Fisher Yu, Vladlen Koltun. In ICLR 2016.

Visualizing Networks
Two of the best “visualization” papers were “Do Neural Networks Learn the same thing?” by
Jason Yosinski (now going to Geometric Intelligence, Inc.) and “Visualizing and Understanding Recurrent Networks” presented by Andrej Karpathy (now going to OpenAI). Yosinski presented his work on studying what happens when you learn two different networks using different initializations. Do the nets learn the same thing? I remember a great conversation with Jason about figuring out if the neurons in network A can be represented as linear combinations of network B, and his visualizations helped make the case. Andrej’s visualizations of recurrent networks are best consumed in presentation/blog form[2]. For those of you that haven’t yet seen Andrej’s analysis of Recurrent Nets on Hacker News, check it out here.

Convergent Learning: Do different neural networks learn the same representations? Yixuan Li, Jason Yosinski, Jeff Clune, Hod Lipson, John Hopcroft. In ICLR 2016. See Yosinski's video here.

Visualizing and Understanding Recurrent Networks Andrej Karpathy, Justin Johnson, Li Fei-Fei. In ICLR 2016.

Do Deep Convolutional Nets Really Need to be Deep (Or Even Convolutional)?

Figure from Do Nets have to be Deep?

This was the key question asked in the paper presented by Rich Caruana. (Dr. Caruana is now at Microsoft, but I remember meeting him at Cornell eleven years ago) Their papers' two key results which are quite meaningful if you sit back and think about them. First, there is something truly special about convolutional layers that when applied to images, they are significantly better than using solely fully connected layers -- there’s something about the 2D structure of images and the 2D structures of filters that makes convolutional layers get a lot of value out of their parameters. Secondly, we now have teacher-student training algorithms which you can use to have a shallower network “mimic” the teacher’s responses on a large dataset. These shallower networks are able to learn much better using a teacher and in fact, such shallow networks produce inferior results when the are trained on the teacher’s training set. So it seems you get go [Data to MegaDeep], and [MegaDeep to MiniDeep], but you cannot directly go from [Data to MiniDeep].

Do Deep Convolutional Nets Really Need to be Deep (Or Even Convolutional)? Gregor Urban, Krzysztof J. Geras, Samira Ebrahimi Kahou, Ozlem Aslan, Shengjie Wang, Rich Caruana, Abdelrahman Mohamed, Matthai Philipose, Matt Richardson. In ICLR 2016.

Another interesting idea on the [MegaDeep to MiniDeep] and [MiniDeep to MegaDeep] front,

Net2Net: Accelerating Learning via Knowledge Transfer Tianqi Chen, Ian Goodfellow, Jonathon Shlens. In ICLR 2016.

Language Modeling with LSTMs
There was also considerable focus on methods that deal with large bodies of text. Chris Dyer (who is supposedly also going to DeepMind), gave a keynote asking the question “Should Model Architecture Reflect Linguistic Structure?” See Chris Dyer's Keynote video here. Some of his key take-aways from comparing word-level embedding vs character-level embeddings is that for different languages, different methods work better. For languages which have a rich syntax, character-level encodings outperform word-level encodings.

Improved Transition-Based Parsing by Modeling Characters instead of Words with LSTMs Miguel Ballesteros, Chris Dyer, Noah A. Smith. In Proceedings of EMNLP 2015.

An interesting approach, with a great presentation by Ivan Vendrov, was “Order-Embeddings of Images and Language" by Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun which showed a great intuitive coordinate-system-y way for thinking about concepts. I really love these coordinate system analogies and I’m all for new ways of thinking about classical problems.

Order-Embeddings of Images and Language Ivan Vendrov, Ryan Kiros, Sanja Fidler, Raquel Urtasun. In ICLR 2016. See Video here.

Training-Free Methods: Brain-dead applications of CNNs to Image Matching

These techniques use the activation maps of deep neural networks trained on an ImageNet classification task for other important computer vision tasks. These techniques employ clever ways of matching image regions and from the following ICLR paper, are applied to smart image retrieval.

Particular object retrieval with integral max-pooling of CNN activations. Giorgos Tolias, Ronan Sicre, Hervé Jégou. In ICLR 2016.

This reminds me of the RSS 2015 paper which uses ConvNets to match landmarks for a relocalization-like SLAM task.

Place Recognition with ConvNet Landmarks: Viewpoint-Robust, Condition-Robust, Training-Free. Niko Sunderhauf, Sareh Shirazi, Adam Jacobson, Feras Dayoub, Edward Pepperell, Ben Upcroft, and Michael Milford. In RSS 2015.

Gaussian Processes and Auto Encoders
Gaussian Processes used to be quite popular at NIPS, sometimes used for vision problems, but mostly “forgotten” in the era of Deep Learning. VAEs or Variational Auto Encoders used to be much more popular when pertaining was the only way to train deep neural nets. However, with new techniques like adversarial networks, people keep revisiting Auto Encoders, because we still “hope” that something as simple as an encoder / decoder network should give us the unsupervised learning power we all seek, deep down inside. VAEs got quite a lot of action but didn't make the cut for today's blog post.

Geometric Methods
Overall, very little content pertaining to the SfM / SLAM side of the vision problem was present at ICLR 2016. This kind of work is very common at CVPR, and it's a bit of a surprise that there wasn't a lot of Robotics work at ICLR. It should be noted that the techniques used in SfM/SLAM are more based on multiple-view geometry and linear algebra than the data-driven deep learning of today.

Perhaps a better venue for Robotics and Deep Learning will be the June 2016 workshop titled Are the Sceptics Right? Limits and Potentials of Deep Learning in Robotics. This workshop is being held at RSS 2016, one of the world's leading Robotics conferences.

Part III: Quo Vadis Deep Learning?

Neural Net Compression is going to be big -- real-world applications demand it. The algos guys aren't going to wait for TPU and VPUs to become mainstream. Deep Nets which can look at a picture and tell you what’s going on are going to be inside every single device which has a camera. In fact, I don’t see any reason why all cameras by 2020 won’t be able to produce a high-quality RGB image as well as a neural network response vector. New image formats will even have such “deep interpretation vectors” directly saved alongside the image. And it's all going to be a neural net, in one shape or another.

OpenAI had a strong presence at ICLR 2016, and I feel like every week a new PhD joins OpenAI. Google DeepMind and Facebook FAIR had a large number of papers. Google demoed a real-time version of deep-learning based style transfer using TensorFlow. Microsoft is no longer King of research. Startups were giving out little toys -- Clarifai even gave out free sandals. Graduates with well-tuned Deep Learning skills will continue being in high-demand, but once the next generation of AI-driven startups emerge, it is only those willing to transfer their academic skills into a product world-facing focus, aka the upcoming wave of deep entrepreneurs, that will make serious $$$.

Research-wise, arXiv is a big productivity booster. Hopefully, now you know where to place your future deep learning research bets, have enough new insights to breath some inspiration into your favorite research problem, and you've gotten a taste of where the top researchers are heading. I encourage you to turn off your computer and have a white-board conversation with your colleagues about deep learning. Grab a friend, teach him some tricks.

I'll see you all at CVPR 2016. Until then, keep learning.

Related computervisionblog.com Blog Posts

Why your lab needs a reading group. May 2012
ICCV 2015: 21 Hottest Research Papers December 2015
Deep Down the Rabbit Hole: CVPR 2015 and Beyond June 2015
The Deep Learning Gold Rush of 2015 November 2015
Deep Learning vs Machine Learning vs Pattern Recognition March 2015
Deep Learning vs Probabilistic Graphical Models April 2015
Future of Real-time SLAM and "Deep Learning vs SLAM" January 2016

Relevant Outside Links

[1] Welcome to the AI Conspiracy: The 'Canadian Mafia' Behind Tech's Latest Craze @ <re/code>
[2] The Unreasonable Effectiveness of Recurrent Neural Networks @ Andrej Karpathy's Blog
[3] Deep Learning for Robots: Learning from Large-Scale Interaction. @ Google Research Blog

↧

Making Deep Networks Probabilistic via Test-time Dropout

June 17, 2016, 4:24 am

≫ Next: Nuts and Bolts of Building Deep Learning Applications: Ng @ NIPS2016

≪ Previous: Deep Learning Trends @ ICLR 2016

In Quantum Mechanics, Heisenberg's Uncertainty Principle states that there is a fundamental limit to how well one can measure a particle's position and momentum. In the context of machine learning systems, a similar principle has emerged, but relating interpretability and performance. By using a manually wired or shallow machine learning model, you'll have no problem understanding the moving pieces, but you will seldom be happy with the results. Or you can use a black-box deep neural network and enjoy the model's exceptional performance. Today we'll see one simple and effective trick to make our deep black boxes a bit more intelligible. The trick allows us to convert neural network outputs into probabilities, with no cost to performance, and minimal computational overhead.

Interpretability vs Performance: Deep Neural Networks perform well on most computer vision tasks, yet they are notoriously difficult to interpret.

The desire to understand deep neural networks has triggered a flurry of research into Neural Network Visualization, but in practice we are often forced to treat deep learning systems as black-boxes. (See my recent Deep Learning Trends @ ICLR 2016 post for an overview of recent neural network visualization techniques.) But just because we can't grok the inner-workings of our favorite deep models, it doesn't mean we can't ask more out of our deep learning systems.

There exists a simple trick for upgrading black-box neural network outputs into probability distributions.

The probabilistic approach provides confidences, or "uncertainty" measures, alongside predictions and can make almost any deep learning systems into a smarter one. For robotic applications or any kind of software that must make decisions based on the output of a deep learning system, being able to provide meaningful uncertainties is a true game-changer.

Applying Dropout to your Deep Neural Network is like occasionally zapping your brain

The key ingredient is dropout, an anti-overfitting deep learning trick handed down from Hinton himself (Krizhevsky's pioneering 2012 paper). Dropout sets some of the weights to zero during training, reducing feature co-adaptation, thus improving generalization.

Without dropout, it is too easy to make a moderately deep network attain 100% accuracy on the training set.

The accepted knowledge is that an un-regularized network (one without dropout) is too good at memorizing the training set. For a great introductory machine learning video lecture on dropout, I highly recommend you watch Hugo Larochelle's lecture on Dropout for Deep learning.

Geoff Hinton's dropout lecture, also a great introduction, focuses on interpreting dropout as an ensemble method. If you're looking for new research ideas in the dropout space, a thorough understanding of Hinton's interpretation is a must.

But while dropout is typically used at training-time, today we'll highlight the keen observation that dropout used at test-time is one of the simplest ways to turn raw neural network outputs into probability distributions. Not only does this probabilistic "free upgrade" often improve classification results, it provides a meaningful notion of uncertainty, something typically missing in Deep Learning systems.

The idea is quite simple: to estimate the predictive mean and predictive uncertainty, simply collect the results of stochastic forward passes through the model using dropout.

How to use dropout: 2016 edition

Start with a moderately sized network
Increase your network size with dropout turned off until you perfectly fit your data
Then, train with dropout turned on
At test-time, turn on dropout and run the network T times to get T samples
The mean of the samples is your output and the variance is your measure of uncertainty

Remember that drawing more samples will increase computation time during testing unless you're clever about re-using partial computations in the network. Please note that if you're only using dropout near the end of your network, you can reuse most of the computations. If you're not happy with the uncertainty estimates, consider adding more layers of dropout at test-time. Since you'll already have a pre-trained network, experimenting with test-time dropout layers is easy.

Bayesian Convolutional Neural Networks

To be truly Bayesian about a deep network's parameters, we wouldn't learn a single set of parameters w, we would infer a distribution over weights given the data, p(w|X,Y). Training is already quite expensive, requiring large datasets and expensive GPUs.

Bayesian learning algorithms can in theory provide much better parameter estimates for ConvNets and I'm sure some of our friends at Google are working on this already.

But today we aren't going to talk about such full Bayesian Deep Learning systems, only systems that "upgrade" the model prediction y to p(y|x,w). In other words, only the network outputs gain a probabilistic interpretation.

An excellent deep learning computer vision system which uses test-time dropout comes from a recent University of Cambridge technique called SegNet. The SegNet approach introduced an Encoder-Decoder framework for dense semantic segmentation. More recently, SegNet includes a Bayesian extension that uses dropout at test-time for providing uncertainty estimates. Because the system provides a dense per-pixel labeling, the confidences can be visualized as per-pixel heatmaps. Segmentation system is not performing well? Just look at the confidence heatmaps!

Bayesian SegNet. A fully convolutional neural network architecture which provides

per-pixel class uncertainty estimates using dropout.

The Bayesian SegNet authors tested different strategies for dropout placement and determined that a handful of dropout layers near the encoder-decoder bottleneck is better than simply using dropout near the output layer. Interestingly, Bayesian SegNet improves the accuracy over vanilla SegNet. Their confidence maps shown high uncertainty near object boundaries, but different test-time dropout schemes could provide a more diverse set of uncertainty estimates.

Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding Alex Kendall, Vijay Badrinarayanan, Roberto Cipolla, in arXiv:1511.02680, November 2015. [project page with videos]

Confidences are quite useful for evaluation purposes, because instead of providing a single average result across all pixels in all images, we can sort the pixels and/or images by the overall confidence in prediction. When evaluation the top 10% most confident pixels, we should expect significantly higher performance. For example, the Bayesian SegNet approach achieves 75.4% global accuracy on the SUN RGBD dataset, and an astonishing 97.6% on most confident 10% of the test-set [personal communication with Bayesian SegNet authors]. This kind of sort-by-confidence evaluation was popularized by the PASCAL VOC Object Detection Challenge, where precision/recall curves were the norm. Unfortunately, as the research community moved towards large-scale classification, the notion of confidence was pushed aside. Until now.

Theoretical Bayesian Deep Learning

Deep networks that model uncertainty are truly meaningful machine learning systems. It ends up that we don't really have to understand how a deep network's neurons process image features to trust the system to make decisions. As long as the model provides uncertainty estimates, we'll know when the model is struggling. This is particularly important when your network is given inputs that are far from the training data.

The Gaussian Process: A machine learning approach with built-in uncertainty modeling

In a recent ICML 2016 paper, Yarin Gal and Zoubin Ghahramani develop a new theoretical framework casting dropout training in deep neural networks as approximate Bayesian inference in deep Gaussian processes. Gal's paper gives a complete theoretical treatment of the link between Gaussian processes and dropout, and develops the tools necessary to represent uncertainty in deep learning. They show that a neural network with arbitrary depth and non-linearities, with dropout applied before every weight layer, is mathematically equivalent to an approximation to the probabilistic deep Gaussian process. I have yet to see researchers use dropout between every layer, so the discrepancy between theory and practice suggests that more research is necessary.

Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning Yarin Gal, Zoubin Ghahramani, in ICML. June 2016. [Appendix with relationship to Gaussian Processes]
A Theoretically Grounded Application of Dropout in Recurrent Neural Networks Yarin Gal, in arXiv:1512.05287. May 2016.

What My Deep Model Doesn't Know. Yarin Gal. Blog Post. July 2015

Homoscedastic and Heteroscedastic Regression with Dropout Uncertainty. Yarin Gal. Blog Post. February 2016.

Test-time dropout is used to provide uncertainty estimates for deep learning systems.

In conclusion, maybe we can never get both interpretability and performance when it comes to deep learning systems. But, we can all agree that providing confidences, or uncertainty estimates, alongside predictions is always a good idea. Dropout, the very single regularization trick used to battle overfitting in deep models, shows up, yet again. Sometimes all you need is to add some random variations to your input, and average the results over many trials. Dropout lets you not only wiggle the network inputs but the entire architecture.

I do wonder what Yann LeCun thinks about Bayesian ConvNets... Last I heard, he was allergic to sampling.

Related Posts
Deep Learning vs Probabilistic Graphical Models vs Logic April 2015
Deep Learning Trends @ ICLR 2016 June 2016

↧

Nuts and Bolts of Building Deep Learning Applications: Ng @ NIPS2016

December 15, 2016, 9:13 pm

≫ Next: DeepFakes: AI-powered deception machines

≪ Previous: Making Deep Networks Probabilistic via Test-time Dropout

You might go to a cutting-edge machine learning research conference like NIPS hoping to find some mathematical insight that will help you take your deep learning system's performance to the next level. Unfortunately, as Andrew Ng reiterated to a live crowd of 1,000+ attendees this past Monday, there is no secret AI equation that will let you escape your machine learning woes. All you need is some rigor, and much of what Ng covered is his remarkable NIPS 2016 presentation titled "The Nuts and Bolts of Building Applications using Deep Learning" is not rocket science. Today we'll dissect the lecture and Ng's key takeaways. Let's begin.

Figure 1. Andrew Ng delivers a powerful message at NIPS 2016.

Andrew Ng and the Lecture

Andrew Ng's lecture at NIPS 2016 in Barcelona was phenomenal -- truly one of the best presentations I have seen in a long time. In a juxtaposition of two influential presentation styles, the CEO-style and the Professor-style, Andrew Ng mesmerized the audience for two hours. Andrew Ng's wisdom from managing large scale AI projects at Baidu, Google, and Stanford really shows. In his talk, Ng spoke to the audience and discussed one of they key challenges facing most of the NIPS audience -- how do you make your deep learning systems better? Rather than showing off new research findings from his cutting-edge projects, Andrew Ng presented a simple recipe for analyzing and debugging today's large scale systems. With no need for equations, a handful of diagrams, and several checklists, Andrew Ng delivered a two-whiteboards-in-front-of-a-video-camera lecture, something you would expect at a group research meeting. However, Ng made sure to not delve into Research-y areas, likely to make your brain fire on all cylinders, but making you and your company very little dollars in the foreseeable future.

Money-making deep learning vs Idea-generating deep learning

Andrew Ng highlighted the fact that while NIPS is a research conference, many of the newly generated ideas are simply ideas, not yet battle-tested vehicles for converting mathematical acumen into dollars. The bread and butter of money-making deep learning is supervised learning with recurrent neural networks such as LSTMs in second place. Research areas such as Generative Adversarial Networks (GANs), Deep Reinforcement Learning (Deep RL), and just about anything branding itself as unsupervised learning, are simply Research, with a capital R. These ideas are likely to influence the next 10 years of Deep Learning research, so it is wise to focus on publishing and tinkering if you really love such open-ended Research endeavours. Applied deep learning research is much more about taming your problem (understanding the inputs and outputs), casting the problem as a supervised learning problem, and hammering it with ample data and ample experiments.

"It takes surprisingly long time to grok bias and variance deeply, but people that understand bias and variance deeply are often able to drive very rapid progress."

--Andrew Ng

The 5-step method of building better systems

Most issues in applied deep learning come from a training-data / testing-data mismatch. In some scenarios this issue just doesn't come up, but you'd be surprised how often applied machine learning projects use training data (which is easy to collect and annotate) that is different from the target application. Andrew Ng's discussion is centered around the basic idea of bias-variance tradeoff. You want a classifier with a good ability to fit the data (low bias is good) that also generalizes to unseen examples (low variance is good). Too often, applied machine learning projects running as scale forget this critical dichotomy. Here are the four numbers you should always report:

Training set error
Testing set error
Dev (aka Validation) set error
Train-Dev (aka Train-Val) set error

Andrew Ng suggests following the following recipe:

Figure 2. Andrew Ng's "Applied Bias-Variance for Deep Learning Flowchart"

for building better deep learning systems.

Take all of your data, split it into 60% for training and 40% for testing. Use half of the test set for evaluation purposes only, and the other half for development (aka validation). Now take the training set, leave out a little chunk, and call it the training-dev data. This 4-way split isn't always necessary, but consider the worse case where you start with two separate sets of data, and not just one: a large set of training data and a smaller set of test data. You'll still want to split the testing into validation and testing, but also consider leaving out a small chunk of the training data for the training-validation. By reporting the data on the training set vs the training-validation set, you measure the "variance."

Figure 3. Human-level vs Training vs Training-dev vs Dev vs Test.

Taken from Andrew Ng's 2016 talk.

In addition to these four accuracies, you might want to report the human-level accuracy, for a total of 5 quantities to report. The difference between human-level and training set performance is the Bias. The difference between the training set and the training-dev set is the Variance. The difference between the training-dev and dev sets is the train-test mismatch, which is much more common in real-world applications that you'd think. And finally, the difference between the dev and test sets measures how overfitting.

Nowhere in Andrew Ng's presentation does he mention how to use unsupervised learning, but he does include a brief discussion about "Synthesis." Such synthesis ideas are all about blending pre-existing data or using a rendering engine to augment your training set.

Conclusion
If you want to lose weight, gain muscle, and improve your overall physical appearance, there is no magical protein shake and no magical bicep-building exercise. The fundamentals such as reduced caloric intake, getting adequate sleep, cardiovascular exercise, and core strength exercises like squats and bench presses will get you there. In this sense, fitness is just like machine learning -- there is no secret sauce. I guess that makes Andrew Ng the Arnold Schwarzenegger of Machine Learning.

What you are most likely missing in your life is the rigor of reporting a handful of useful numbers such as performance on the 4 main data splits (see Figure 3). Analyzing these numbers will let you know if you need more data or better models, and will ultimately let you hone in your expertise on the conceptual bottleneck in your system (see Figure 2).

With a prolific research track record that never ceases to amaze, we all know Andrew Ng as one hell of an applied machine learning researcher. But the new Andrew Ng is not just another data-nerd. His personality is bigger than ever -- more confident, more entertaining, and his experience with a large number of academic and industrial projects makes him much wiser. With enlightening lectures as "The Nuts and Bolts of Building Applications with Deep Learning" Andrew Ng is likely to be an individual whose future keynotes you might not want to miss.

Appendix
You can watch a September 27th, 2016 version of the Andrew Ng Nuts and Bolts of Applying Deep Learning Lecture on YouTube, which he delivered at the Deep Learning School. If you are working on machine learning problems in a startup, then definitely give the video a watch. I will update the video link once/if the newer NIPS 2016 version shows up online.

You can also check out Kevin Zakka's blog post for ample illustrations and writeup corresponding to Andrew Ng's entire talk.

↧

DeepFakes: AI-powered deception machines

May 16, 2018, 12:22 pm

≫ Next: Computer Vision and Visual SLAM vs. AI Agents

≪ Previous: Nuts and Bolts of Building Deep Learning Applications: Ng @ NIPS2016

Driven by computer vision and deep learning techniques, a new wave of imaging attacks has recently emerged which allows anyone to easily create highly realistic "fake" videos. These false videos are known as Deep Fakes. While highly entertaining at times, DeepFakes can be used to perturb society and some would argue that the pre-shock has already begun. A rogue DeepFake which goes viral can spread misinformation across the internet like wildfire.

"The ability to effortlessly create visually plausible editing of faces in videos has the potential to severely undermine trust in any form of digital communication. "

--Rössler et al. FaceForensics [3]

Because DeepFakes contain a unique combination of realism and novelty, they are more difficult to detect on social networks as compared to traditional "bad" content like pornography and copyrighted movies. Video hashing might work for finding duplicates or copyright-infringing content, but not good enough for DeepFakes. To fight face-manipulating DeepFake AI, one needs an even stronger AI.

As today's DeepFakes are based on Deep Learning, and Deep Learning tools like TensorFlow and PyTorch are accessible to anybody with a modern GPU, such face manipulation tools are particularly disruptive. The democratization of Artificial Intelligence has brought us near infinite use-cases. From the DeepDream phenomenon of 2015 to the Deep Style Transfer Art apps of 2016, 2018 is the year of the DeepFake. Today's computer vision technology allows a hobbyist to create a Deep Fake video of just about any person they want performing any action they want, in a matter of hours, using commodity computer hardware.

Fig 1. DeepFakes generate "false impressions" which are attacks on the human mind.

What is a Deep Fake?
A deep fake is a video generated from a modern computer vision puppeteering face-swap algorithm which can be used to generate a video of target person X performing target action A, usually given a video of another person Y performing action A. The underlying system learns two face models, one of target person X, and of for person Y, the person in the original video. It then learns a mapping between the two faces, which can be used to create the resulting "fake" video. Techniques for facial reenactment have been pioneered by movie studios for driving character animations from real actors' faces, but these techniques are now emerging as deep learning-based software packages, letting the deep convolutional neural networks do most of the work during model training.

Consider the following collage of faces. Can you guess which ones are real and which ones are DeepFakes?

Fig 2. Can you tell which faces are real and which ones are fake?

Figure from Face Forensics[3]

It is not so easy to tell which image is modified and which one is unadulterated. And if you do a little bit of searching for DeepFakes (warning, unless you are careful, you will encounter lots of pornographic content) you notice that the faces in those videos look very realistic.

How are Deep Fakes made?
While there are conceptually many different ways to make Deep Fakes, today we'll focus on two key underlying techniques: face detection from videos, and deep learning for creating frame alignments between source face X and target face Y.

A lot of this research started with the Face2face work [1] presented at CVPR 2016. This paper was a modernization of the group's earlier SIGGRAPH paper and focused a lot more on the computer vision details. At this time the tools were good enough to create SIGGRAPH-quality videos, but it took a lot of work to put together a facial reenactment rig. In addition, the underlying algorithms did not use any deep learning, so a lot of domain-knowledge (i.e., face modeling expertise) went into making these algorithms work robustly. The TUM/Stanford guys filed their Real-time facial reenactment patent in 2016 [4], and have more recently worked on FaceForensics[3] to detect such manipulated imagery.

Fig3. Face2Face technique from 2016. It is 2018 now, so just imagine how much better this works now!

In addition to the Face2face guys (who have now a handful of similarly themed papers), it is interesting to note that a lot of key early ideas in face puppeteering were pioneered by Ira Kemelmacher-Shlizerman who is now a computer vision and graphics assistant professor at University of Washington. She worked on early face puppeteering technology for the 2010 paper Being John Malkovich, continued with the Photobios work, and later founded Dreambit (based on a SIGGRAPH 2016 paper), which was acquired by Facebook. :-)

Fig 4. Ira's early work on face swapping in 2010. See the Being John Malkovich paper[2].

Take a look at Ira's Dreambit video, which shows some high-quality "entertainment" value out of rapidly produced non-malicious DeepFakes!

Fig 5. Ira's Dreambit system. Lets her imagine herself in different eras, with different hairstyles, etc.

The origin of Ira's Dreambit system is the Transfiguring Portraits SIGGRAPH 2016 paper[6]. What's important to note is that this is 2016 and we're starting to see some use of Deep Learning. The transfiguring portraits work used a big mix of features, using some CNN features computed from early Caffe networks. It is not an entirely easy-to-use system at this point, but good enough to make SIGGRAPH videos, take a one minute to generate other cool outputs, and definitely cool enough for Facebook to acquire.

Fig 6. Transfiguring Portraits. The system used lots of features, but Deep Learning-based CNN features are starting to show up.

Fighting against DeepFakes
There are now published algorithms which try to battle DeepFakes by determining if faces/videos are fake or not. FaceForensics[3] introduces a large DeepFake dataset based on their earlier Face2face work. This dataset contains both real and "fake" Face2face output videos. More importantly, the new dataset is big enough to train a deep learning system to determine if an image is counterfeit. In addition, they are able to both 1.) determine which pixels have likely been manipulated, and 2.) perform a deep cleanup stage to make even better DeepFakes.

Fig 7. The "fakeness" masks in FaceForensics[3] are based on XceptionNet

Another fake detection approach, this time from a Berkeley AI Research group called Image Splice Detection, focuses on detecting where an image was spliced to create a fake composite image. This allows them to determine which part of the image was likely "photoshopped" and the technique is not specific to faces. And because this is a 2018 paper, it should not be a surprise that this kind of work is all based on deep learning techniques.

Fig 8. Fighting Fake News: Image Splice Detection[5]. Response maps are aggregated to determine the combined probability mask.[5]

From the Fighting Fake News paper,

"As new advances in computer vision and image-editing emerge, there is an increasingly urgent need for effective visual forensics methods. We see our approach, which successfully detects manipulations without seeing examples of manipulated images, as being an initial step toward building general-purpose forensics tools."

Concluding Remarks
The early DeepFake tools were pioneered in the early 2010s and were producing SIGGRAPH-quality results by 2015. It was only a matter of years until DeepFake generators became publicly available. 2018's DeepFake generators, being written on top of open-source Deep Learning libraries, are much easier to use than the researchy systems from only a few years back. Today, just about any hobbyist with minimal computer programming knowledge and a GPU can build their own DeepFakes.

Just as Deep Fakes are getting better, Generative Adversarial Networks are showing more promise for photorealistic image generation. It is likely that we will soon see lots of exciting new work on both the generative side (deep fake generation) and the discriminative side (deep fake detection and image forensics) which incorporate more and more ideas from the machine learning community.

References

[1] Justus Thies, Michael Zollhöfer, Marc Stamminger, Christian Theobalt, and Matthias Nießner. "Face2face: Real-time face capture and reenactment of rgb videos." In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pp. 2387-2395. IEEE, 2016.

[2] Ira Kemelmacher-Shlizerman, Aditya Sankar, Eli Shechtman, and Steven M. Seitz. "Being john malkovich." In European Conference on Computer Vision, pp. 341-353. Springer, 2010.

[3] Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. "FaceForensics: A Large-scale Video Dataset for Forgery Detection in Human Faces." arXiv preprint arXiv:1803.09179, 2018.

[4] Christian Theobalt, Michael Zollhöfer, Marc Stamminger, Justus Thies, Matthias Nießner. Real-time Expression Transfer for Facial Reenactment Invention. 2018/3/8. Application Number 15256710

[5] Minyoung Huh, Andrew Liu, Andrew Owens, Alexei A. Efros, "Fighting Fake News: Image Splice Detection via Learned Self-Consistency." arXiv preprint arXiv:1805.04096, 2018

[6] Ira Kemelmacher-Shlizerman, "Transfiguring portraits." ACM Transactions on Graphics (TOG), 35(4), p.94. 2016

↧

Computer Vision and Visual SLAM vs. AI Agents

November 19, 2019, 2:18 am

≪ Previous: DeepFakes: AI-powered deception machines

With all the recent advancements in end-to-end deep learning, it is now possible to train AI agents to perform many different tasks (some in simulation and some in the real-world). End-to-end learning allows one to replace a multi-component, hand-engineered system with a single learning network that can process raw sensor data and output actions for the AI to take in the physical world. I will discuss the implications of these ideas while highlighting some new research trends regarding Deep Learning for Visual SLAM and conclude with some predictions regarding the kinds of spatial reasoning algorithms that we will need in the future.

In today's article, we will go over three ideas:

I.) Does Computer Vision Matter for Action?

II.) Visual SLAM for AI agents

III.) Quō vādis Visual SLAM? Trends and research forecast

I. Does Computer Vision Matter for Action?

At last month's International Conference of Computer Vision (ICCV 2019), I heard the following thought-provoking question,

"What do Artificial Intelligence Agents need (if anything) from the field of Computer Vision?"

The question was posed by Vladlen Koltun (from Intel Research) during his talk at the Deep Learning for Visual SLAM Workshop at ICCV 2019 in Seoul. He spoke about building AI agents with and without the aid of computer vision to guide representation learning. While Koltun has worked on classical Visual SLAM (see his Direct Sparse Odometry (DSO) system [2]), at this workshop, he decided to not speak about his older work on geometry, alignment, or 3D point cloud processing. His talk included numerous ideas spanning several of his team's research papers, some humor (see video), and plenty of Koltun's philosophical views towards general artificial intelligence.

Recent techniques show that it is possible to learn actions (the output quantities that we really want) from pixels (raw inputs) directly without any intermediate computer vision processing like object recognition, depth estimation, and segmentation. But just because it is possible to solve some AI tasks without intermediate representations (i.e., the computer vision stuff), does that mean that we should abandon computer vision research and let end-to-end learning take care of everything? Probably not.

From a very practical standpoint, let's ask the following question:

"Is an agent who is aware of computer vision stuff more robust than an agent trained without intermediate representations?"

Recent research from Koltun's lab [3] indicates that the answer is yes: training with intermediate representations, as done by supervision from per-frame computer vision tasks, gives rise to more robust agents that learn faster and are more robust in a variety of performance tasks! The next natural question is: which computer vision tasks matter most for agent robustness? Koltun's research suggests that depth estimation is one particular task that works well as an auxiliary task when training agents that have to move through space (i.e., most video games). A depth estimation network should help an AI agent navigate an unknown environment as depth estimation is one key component in many of today's RGBD Visual SLAM systems. The best way to learn about Koltun's paper, titled Does Computer Vision Matter for Action?, is to see the video on YouTube.

Video describing Koltun's Does Computer Vision Matter for Action? [3]

Let's imagine that you want to deploy a robot into the world sometimes from now until 2025 based on your large-scale AI agent training, and you're debating whether you should avoid intermediate representations or not.

Intermediate representations facilitate explainability, debuggability, and testing. Explainability is a key to success when systems require spatial reasoning capabilities in the real-world. If your agents are misbehaving, take a look at their intermediate representations. If you want to improve your AI, you can analyze the computer vision systems to prioritize better your data collection effort. Visualization should be a first-order citizen in your deep learning toolbox.

But today's computer vision ecosystem offers more than algorithms that process individual images. Visual SLAM systems rapidly process images while updating the camera's trajectory and updating the 3D map of the world. Visual SLAM, or VSLAM, algorithms are the real-time variants of Structure-from-Motion (SfM), which has been around for a while. SfM uses bundle adjustment -- a minimization of reprojection error, usually solved with Levenberg Marquardt. If there any kind of robot you see moving around today (2019), it is likely that it is running some variant of SLAM (localization and mapping) and not an end-to-end trained network -- at least not today. So what does Visual SLAM mean for AI agents?

II. Visual SLAM for AI Agents

While no single per-frame computer vision algorithm is close to sufficient to enable robust action in an environment, there is a class of real-time computer vision systems like Visual SLAM that can be used to guide agents through space. The Workshop on Deep Learning for Visual SLAM at ICCV 2019 showcased a variety of different Visual SLAM approaches and included a discussion panel. The workshop featured talks on Visual SLAM on mobile platforms (Victor Prisacariu from 6d.ai), autonomous cars (Daniel Cremers from TUM and ArtiSense.ai), high-detail indoor modeling (Angela Dai from TUM), AI Agents (Vladlen Koltun from Intel Research) and mixed-reality (Tomasz Malisiewicz from Magic Leap).

Teaser Image for the 2nd Workshop on Deep Learning for Visual SLAM from Ronnie Clark.
See info at http://visualslam.ai

When it comes to spatial perception capabilities, Koltun's talk made it clear that we, as computer vision researchers, could think bolder. There is a spectrum of spatial perception capabilities that AI agents need that only somewhat overlaps with traditional Visual SLAM (whether deep learning-based or not).

Koltun's work is in favor of using intermediate representations based on computer vision to produce more robust AI agents. However, Koltun is not convinced that 6dof Visual SLAM, as is currently defined, needs to be solved for AI agents. Let's consider ordinary human tasks like walking, washing your hands, and flossing your teeth -- each one requires a different amount of spatial reasoning abilities. It is reasonable to assume that AI agents would need varying degrees of spatial localization and mapping capabilities to perform such tasks.

Visual SLAM techniques, like the ones used inside Augmented Reality systems, build metric 3D maps of the environment for the task of high-precision placement of digital content -- but such high-precision systems might never be used directly inside AI agents. When the camera is hand-held (augmented reality) or head-mounted (mixed reality), a human decides where to move. AI agents have to make their own movement decisions, and this requires more than feature correspondences and bundle adjustment -- more than what is inside the scope of computer vision.

Inside a head-mounted display, you might look at digital content 30 feet away from you, and for everything to look correct geometrically, you must have a decent 3D map of the world (spanning at least 30 feet) and a reasonable estimate of your pose. But for many tasks that AI agents need to perform, metric-level representations of far-away geometry are un-necessary. It is as if proper action requires local, high-quality metric maps and something coarser like topological maps for large-range maps. Visual SLAM systems (stereo-based and depth-sensor based) are likely to find numerous applications in industry such as mixed reality and some branches of robotics, where millimeter precision matters.

More general end-to-end learning for AI agents will show us new kinds of spatial intelligence, automatically learned from data. There is a lot of exciting research to be done to answer questions like the following: What kind of tasks can we train Visual AI Agents for such that map-building and localization capabilities arise? Or What type of core spatial reasoning capabilities can we pre-build to enable further self-supervised learning from the 3D world?

III. Quō vādis Visual SLAM? Trends and research forecast

At the Deep Learning Workshop for Visual SLAM, an interesting question that came up in the panel focused on the convergence of methods in Visual SLAM. Or alternatively,

"Will a single Visual SLAM framework rule them all?"

The world of applied research is moving towards more deep learning -- by 2019, many of the critical tasks inside computer vision exist as some form of a (convolutional/graph) neural network. I don't believe that we will see a single SLAM framework/paradigm dominate all others -- I think we will see a plurality of Visual SLAM systems based on inter-changeable deep learning components. This new generation of deep learning-based components will allow more creative applications of end-to-end learning and be typically useful as modules within other real-world systems. We should create tools that will enable others to make better tools.

PyTorch is making it easy to build multiple-view geometry tools like Kornia -- such that the right parts of computer vision are brought directly into today's deep learning ecosystem as first-order citizens. And PyTorch is winning over the world of research. A dramatic increase in usage happened from 2017 to 2019, with PyTorch now the recommended framework amongst most of my fellow researchers.

To take a look at what the end goal in terms of end-to-end deep learning for visual SLAM might look like, take a look at gradSLAM from Krishna Murthy, a Ph.D. student in MILA, and collaborators at CMU. Their paper offers a new way of thinking of SLAM as made up of differentiable blocks. From the article, "This amalgamation of dense SLAM with computational graphs enables us to backprop from 3D maps to 2D pixels, opening up new possibilities in gradient-based learning for SLAM."

Key Figure from the gradSLAM paper on end-to-end learning for SLAM. [5]

Another key trend that seems to be on the rise inside the context of Deep Visual SLAM is self-supervised learning. We are seeing more and more practical successes of self-supervised learning for multi-view problems where geometry enables us to get away from strong supervision. Even the ConvNet-based point detector SuperPoint [7], which my team and I developed at Magic Leap, uses self-supervision to train more robust interest point detectors. In our case, it was impossible to get ground truth interest points on images, and self-labeling was the only way out. One of my favorite researchers working on self-supervised techniques is Adrien Gaidon from TRI, who studies how such methods can be used to make smarter cars. Adrien gave some great talks at other ICCV 2019 Workshops related to autonomous vehicles, and his work is closely related to Visual SLAM and useful for anybody working on similar problems.

Adrien Gaidon's talk from October 11th, 2019 on Self-Supervised Learning in the context of Autonomous Cars

Another excellent presentation about this topic from Alyosha Efros. He does a great job convincing you why you should love self-supervision.

A presentation about self-supervision from Alyosha Efros on May 25th, 2018

Conclusion

As more and more spatial reasoning skills get baked into deep networks, we must face two opposing forces. On the one hand, specifying internal representations makes it difficult to scale to new tasks -- it is easier to trick the deep nets into doing all the hard work for you. On the other hand, we want interpretability and some amount of safety when we deploy AI agents into the real world, so some intermediate tasks like object recognition are likely to be involved in today's spatial perception recipe. Lots of exciting work is happening with multi-agents from OpenAI [6], but full end-to-end learning will not give real-world robots such as autonomous cars anytime soon.

Video from OpenAI showing Multi-Agent Hide and Seek. [6]

More practical Visual SLAM research will focus on differentiable high-level blocks. As more deep learning happens in Visual SLAM, it will create a renaissance in Visual SLAM as sharing entire SLAM systems will be as easy as sharing CNNs today. I cannot wait until the following is possible:

pip install DeepSLAM

I hope you enjoyed learning about the different approaches to Visual SLAM, and that you have found my blog post insightful and educational. Until next time!

References:

[1]. Vladlen Koltun. Chief Scientist for Intelligent Systems at Intel. http://vladlen.info/

[2]. Direct Sparse Odometry. Jakob Engel, Vladlen Koltun, and Daniel Cremers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(3), 2018. http://vladlen.info/publications/direct-sparse-odometry/

[3]. Does Computer Vision Matter for Action? Brady Zhou, Philipp Krähenbühl, and Vladlen Koltun. Science Robotics, 4(30), 2019. http://vladlen.info/publications/computer-vision-matter-action/

[4]. Kornia: an Open Source Differentiable Computer Vision Library for PyTorch. Edgar Riba, Dmytro Mishkin, Daniel Ponsa, Ethan Rublee, and Gary Bradski. Winter Conference on Applications of Computer Vision, 2019. https://kornia.github.io/

[5]. gradSLAM: Dense SLAM meets Automatic Differentiation. Krishna Murthy J., Ganesh Iyer, and Liam Paull. In arXiv, 2019. http://montrealrobotics.ca/gradSLAM/

[6] Emergent tool use from multi-agent autocurricula. Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, Bob McGrew, and Igor Mordatch. In arXiv 2019. https://openai.com/blog/emergent-tool-use/
[7] SuperPoint: Self-supervised interest point detection and description. Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2018. https://arxiv.org/abs/1712.07629

↧