AI development today has become fixated on singular monolithic models trained end-to-end without any human assistance and encapsulating an almost general intelligence-like variety of tasks together. The resulting models have struggled in areas like content moderation to sufficiently abstract beyond their limited training data. Yet, as Waymo reminds us, the most successful complex AI systems combine multiple deep learning models with traditional hand-coded algorithms to address one of the greatest challenges confronting today’s deep learning systems: their inability to abstract from correlation to causation.
Waymo put it best this past December when the company noted
that “deep learning identifies correlations in the training data, but it arguably cannot build causal models by purely observing correlations … knowing why an expert driver behaved the way they did and what they were reacting to is critical to building a causal model of driving. For this reason, simply having a large number of expert demonstrations to imitate is not enough.”
In Waymo’s case, the company addresses this through two primary means.
The first is hand-coding some rules, like simply telling the vehicle to stop at red stoplights, rather than forcing it to learn this rule from observation. This mimics the human driving experience in which new drivers learn through a combination of memorizing the written rules of the road and observation and experiential driving.
The second is the use of simulators to offer the algorithms a wide array of unusual experiences not ordinarily found in its training data. As the company puts it, “for obvious reasons, we don’t want our expert drivers to get into near-collisions or climb curbs just to show a neural network how to recover in these cases” so it relies on teaching its algorithms about these experiences through simulated driving. Once again, this is how humans are typically taught to recover from exceptional circumstances, whether in pilot simulators or simulators designed to mimic new vehicular interiors and experiences, so brings the AI learning process closer to that used by humans.
Yet, these simulators also serve a second purpose beyond merely helping the algorithms experience rare driving events: they build up the diversity of its statistical inferences to help it compensate for its inability to abstract upwards from its training data towards causal understanding.
The inability to reason at a causal level is perhaps the greatest challenge confronting the future of deep learning.
No matter how hauntingly accurate their results, today’s deep learning algorithms are still at the end of the day nothing more than statistical inferencing engines. They reduce data to minute building blocks, but rather than assemble those building blocks into higher order semantic relationships and causal models like humans, today’s algorithms merely record correlative probabilities between collections of input building blocks and output outcomes.
This means that a deep learning image recognition algorithm fed a library of dog photographs taken exclusively in dog parks and cat photographs taken exclusively in living rooms will naturally gravitate towards the most statistically distinguishing feature: the grassy outdoors versus the furnished indoors.
Learning algorithms optimize for accuracy and in such a training scenario, the location of the photograph is the most distinguishing feature that guarantees near perfect accuracy.
In reality, the algorithm isn’t even learning the concept of “indoors” versus “outdoors.” Under the hood, a typical convolutional neural network merely reduces the image to a set of patterns and colors and assesses the statistical correlation of each of those building blocks with the label “dog” or “cat.”
Humans recompose visual scenes upwards to understand them as objects. A person will see a two-dimensional photograph as a collection of high order semantic representations of the objects the scene depicts. An outdoor photograph of a dog is separated into its logical components of trees, grass, dogs, leashes, collars and the like, allowing the dog to be seen as separate from even the most highly correlated components seen in the training data. A person asked to describe their mental image of a dog can describe it from any angle and context.
In stark contrast, machines decompose scenes downwards into primitive building blocks like the presence of specific colors and textures and their orientations with respect to one another. A photograph of a dog in a park is scene as merely a collection of colors and textures at a particular density and orientation. Only through sufficiently diverse input data can the algorithm even learn that dogs can be different colors and sizes, have different amounts of fur and appear in contexts other than grassy fields. A machine asked to describe its statistical model of a dog will render it according to its training data and describe it in terms of the building blocks into which it has decomposed it.
In the context of driverless cars, an algorithm can infer through large amounts of training data that it should stop momentarily when it spots a large red octagonal object with white lettering. Yet, small strategic modifications
to such a sign can render it entirely invisible or even change its meaning
to that of a completely different sign like a speed limit notification. This is possible due to deep learning’s representation of signage as statistical and spatial relationships between colors and textures, rather than as a high order semantic object like humans.
Most importantly, the simplistic statistical relationships of deep learning algorithms struggle with the prediction of future events required to navigate a busy street.
Neural networks today are limited to learning correlations, associating present actions with future outcomes only through observing large amounts of training data exhibiting that specific behavior. In the context of driving, this means exposing driving algorithms to huge amounts of simulated and recorded driving behavior to infer all of the likely indicators that a cyclist might suddenly turn in front of a moving vehicle.
Humans are far more adept at predicting outcomes when driving due to their ability to abstract. They see cyclists not as blobs of related colors and textures but as dynamic human individuals. Driverless car systems have begun to abstract critical objects like humans by inducing their skeletal structure to understand position and orientation, but here again prediction is still limited to encoding statistical patterns from training data. A human can understand that a child cyclist is more likely to perform unsafe and unpredictable maneuvers than a trained adult and that a group of children cycling to the right of a car and seeing a group of other children on the left-hand sidewalk shouting to them are more likely to suddenly swerve in front of the car to cross to the other side than a professional cyclist who has been steadfastly using all of the proper hand signals thus far.
A machine can learn about gravity by watching large numbers of videos of objects falling. Yet, what it will actually learn is simply that certain colors and textures fall faster than others and that they will fall until reaching other colors and textures. Humans begin with such observation experimentation, but then abstract from what they observe towards higher order concepts that both explain what they are seeing and most importantly, allow them to develop mental models of why and where this correlation will break down.
A stock market forecasting algorithm that merely learns correlations may achieve phenomenal accuracy for a time, but without an understanding of why that correlation exists, it is cannot estimate when its predictions might fail and bankrupt the company.
Putting this all together, as AI matures beyond simplistic tasks like tagging images and relating passages of text across languages towards applications requiring seamless interaction with humans and reasoning more concretely about the real world, it will have to evolve beyond its current reliance on simplistic correlative induction of primitive building blocks and towards high-order semantic abstractions that allow it to understand objects, their relationships and motivations.
In the end, for AI evolve beyond the simplistic and constrained tasks of today towards the complex generalized intelligences of tomorrow will require a fundamental reimagination of how deep learning works, from correlation towards causation.