Get in properly! If one comes, all come! Welcome to the highest stakes Three Card Monte game the world has ever seen.
Deep learning faces the data problem: the demand for labeled data is nearly unlimited, and the lack of labeled data in the enterprise is arguably the biggest bottleneck to progress.
Let’s find the answer.
First, we will select from the amazing number of techniques that have emerged in recent years to tackle the data problem at the heart of artificial intelligence. The cards are all laid out in front of us and under one of them is surely the secret for the next batch of unicorns and deca horns.
Unsupervised learning, fundamental models, weak supervision, transfer learning, ontologies, representation learning, semi-supervised learning, self-supervised learning, synthetic data, knowledge graphs, physical simulations, symbol manipulation, active learning, zero-shot learning, and generative models.
Just to name a few.
The concepts wiggle and weave, connecting and disconnecting in bizarre and unpredictable ways. There is not a single term in this long list that has a universally accepted definition. Powerful tools and over-the-top promises overlap, and the dizzying array of techniques and tools is enough to throw even the savviest of clients and investors off balance.
So what do you choose?
All data, no information
The problem, of course, is that we shouldn’t have been watching the cards at all. It was never a question of which magic catchphrase becomes one remove The data problem, because it was never really about data in the first place. At least not exactly.
Data by itself is useless. With less than a hundred keystrokes, I can set my computer to generate enough random noise to make a modern neural network roll until the universe heats to death from instability. With a little more effort and a single image from a 10 megapixel phone, I could black out any combination of three pixels and create more data than exists on the internet today.
Data is just a vehicle. information is what it wears. It is important not to confuse the two.
In the examples above, there is a lot of data but almost no information. In extremely complex, information-rich systems like loan approvals, industrial supply chains, and even social media analytics, the problem is reversed. Streams of thought and galaxies of human expression are boiled into reductive binaries. Like trying to mine a mountain with a pickaxe.
This is at the heart of the data problem. It’s an unfathomable wealth of information – a billion cars on the road – that is somehow both tangible and inaccessible. It’s thousands of people and billions of dollars carrying scant amounts of spoil and gravel back and forth in captcha tests and classification records.
That’s where the tsunami of buzzwords comes in. With all the hundreds of articles and the complexity of the methods themselves, the motivations and basic principles are simple. The best and simplest explanation is one I attribute to Google’s Underspecification paper.
Form neural networks
Think of every possible neural network as a massive fuzzy space. It can do almost anything, but naively it does nothing.
We want this neural network to do something, but we’re not sure what yet. It’s like unformed clay with infinite possibilities. It’s an unrestrained mess, brimming with Shannon entropy, a mathematical formalization of possibility—the amount of freedom left to a system. Corresponding to the amount of information and work we would need to add to the system to eliminate these possibilities.
Today we are primarily interested in the imitation of humans. So this information and this work has to come from people.
So to move forward people have to make decisions. This vast space needs to be sifted out. A reduction in Shannon entropy. Like finding the perfect drop of water in a sea of possibilities, and it’s just as impractical as you make it out to be. More practically, it’s like finding the right strip of sea. This is the equivalence set – an infinite subset of the infinitely large ocean in which each option is equally optimal.
As far as you can tell.
Surveillance, information captured in data, is how we scour the ocean. So we say: “Out of everything you could that’s what you do should do.” That’s key and the clarity to cut through the noise. There’s no free lunch here, and in the blizzard of techniques and math that pours in you, you need to focus on the flow of information.
Where does new information come into the system?
Nvidia’s Omniverse Replicator is a wonderful example. It is a synthetic data platform. In truth, however, that tells you very little. It describes the data, but the information is the physics simulations. It differs fundamentally from other synthetic data platforms like statice.ai, which focus on using generative models to transform information encapsulated in personal data into unidentifiable synthetic data containing the same information.
Another case study is Tesla’s unique active learning approach. In traditional active learning, the primary source of information is the data scientist. By providing an active learning strategy well suited to the task, new training examples will reduce your equivalence set even further than usual. In one of Andrei Karpathy’s recent lectures on this topic, he explains how Tesla is significantly improving this technique. Instead of letting data scientists develop an optimal strategy for active learning, they use several noisy strategies together and use further human selection to identify the most effective examples.
Unintuitively, they improve overall system performance by adding additional human intervention. Traditionally, this would be viewed as a step backwards. More intervention means less automation, which is less good with the traditional lens. Seen through the lens of information, however, this approach makes perfect sense. They have dramatically improved the bandwidth of information in the system, so the rate of improvement is accelerating.
This is the name of the game. The explosion of buzzwords is frustrating, and no doubt a large number of those who have embraced these buzzwords have misunderstood the promise they hold. Nonetheless, the buzzwords point to real progress. There are no magic bullets, and we’ve explored these fields long enough to know that. However, each of these areas has yielded benefits in their own right, and research continues to show that there are still significant gains to be made by combining and unifying these surveillance paradigms.
It’s an era of incredible opportunity. Our ability to utilize information from previously untapped sources is accelerating. The greatest problems we face now are an embarrassment of wealth and a confusion of noise. If it all seems too much and you’re having trouble separating fact from fiction, just remember:
Follow the information.
Slater Victoroff is the founder and CTO of Indico Data.
data decision maker
Welcome to the VentureBeat community!
DataDecisionMakers is the place where experts, including technical staff, working with data can share data-related insights and innovations.
If you want to read about innovative ideas and up-to-date information, best practices and the future of data and data technology, visit us at DataDecisionMakers.
You might even consider contributing an article of your own!
Read more from DataDecisionMakers