
Analysts estimate that by 2025, 30% of the data generated will be real-time data. That’s 52 zettabytes (ZB) of real-time data per year—roughly the amount of total Data produced in 2020. Because data has grown so rapidly, 52 ZB is three times the amount of total Data produced in 2015. With this exponential growth, it is clear that conquering real-time data is the future of data science.
Over the past decade, technology has been developed by companies such as Materialize, Deephaven, Kafka, and Redpanda to work with these streams of real-time data. They can transform, transfer and persist data streams on-the-fly and provide the basic building blocks needed to build applications for the new real-time reality. But in order to make such enormous amounts of data really usable, artificial intelligence (AI) must be used.
Businesses need insightful technology that can create knowledge and understanding with minimal human intervention to keep up with the tidal wave of real-time data. However, the implementation of this idea of applying AI algorithms to real-time data is still in its infancy. Specialized hedge funds and big-name AI players — like Google and Facebook — use real-time AI, but few others have ventured into these waters.
To make real-time AI ubiquitous, supporting software must be developed. This software must provide:
- A simple way to move from static to dynamic data
- A simple way to clean static and dynamic data
- A simple path from model creation and validation to production
- An easy way to manage software as needs—and the outside world—change
A simple way to move from static to dynamic data
Developers and data scientists want to spend their time thinking about important AI problems, not time-consuming data maintenance. A data scientist shouldn’t care if the data is a static table from Pandas or a dynamic table from Kafka. Both are tables and should be treated the same. Unfortunately, most current-generation systems treat static and dynamic data differently. Data is obtained in different ways, queried in different ways and used in different ways. This makes transitions from research to production expensive and labor intensive.
To truly derive value from real-time AI, developers and data scientists must be able to seamlessly switch between using static and dynamic data within the same software environment. This requires common APIs and a framework that can handle both static and real-time data in a UX-consistent manner.
A simple way to clean static and dynamic data
The most exciting work for AI engineers and data scientists is creating new models. Unfortunately, most of an AI engineer’s or data scientist’s time is spent being a data steward. Records are inevitably dirty and need to be cleaned and put into the right form. This is thankless and time-consuming work. With an exponentially growing flood of real-time data, this entire process needs to require less human labor and work with both static and streaming data.
In practice, simple data cleansing is achieved through a concise, powerful, and expressive way of performing common data cleansing operations that works for both static and dynamic data. These include removing bad data, filling in missing values, merging multiple data sources, and transforming data formats.
There are currently some technologies that allow users to implement the data cleaning and manipulation logic just once and use it on both static and real-time data. Materialize and ksqlDb both allow SQL queries from Kafka streams. These options are good choices for use cases with relatively simple logic or for SQL developers. Deephaven features a table-oriented query language that supports Kafka, Parquet, CSV and other popular data formats. This type of query language is suitable for more complex and mathematical logic or for Python developers.
A simple path from model creation and validation to production
Many—possibly even most—new AI models never make it from research to production. This is because research and production are typically implemented with very different software environments. Research environments are designed for working with large static data sets, model calibration, and model validation. On the other hand, production environments make predictions about new events as they happen. In order to increase the proportion of AI models impacting the world, the steps to go from research to production must be extremely simple.
Imagine an ideal scenario: First, static and real-time data would be retrieved and manipulated using the same API. This provides a consistent platform for building applications with static and/or real-time data. Second, the data cleaning and manipulation logic would be implemented once for use in both static research and dynamic production cases. Duplicating this logic is expensive and increases the likelihood that research and production will diverge in unexpected and consistent ways. Third, AI models would be easy to serialize and deserialize. This allows production models to be exchanged simply by changing a file path or URL. Finally, the system would make it easy to monitor — in real time — how well production AI models are performing in the wild.
An easy way to manage software as needs—and the outside world—change
Change is inevitable, especially when working with dynamic data. In data systems, these changes can be in input data sources, requirements, team members, and more. No matter how carefully a project is planned, it will need to adapt over time. Often these adjustments never happen. Accumulated technical debt and knowledge lost through staff turnover are frustrating these efforts.
To cope with a changing world, a real-time AI infrastructure needs to make all phases of a project (from training to validation to production) understandable and modifiable for a very small team. And not just for the original team it was developed for – it should also be understandable and modifiable for new people adopting existing production applications.
As the tidal wave of real-time data hits, we will see significant innovations in real-time AI. Real-time AI will move beyond the Googles and Facebooks of the world and into the toolkit of all AI engineers. We get better answers, faster and with less effort. Engineers and data scientists can spend more time focusing on interesting and important real-time solutions. Businesses get higher-quality, timely responses from fewer people, reducing the challenges of hiring AI talent.
If we have software tools that meet these four requirements, we will finally be able to get real-time AI right.
Chip Kent is the senior data scientist at Deephaven Data Labs.
data decision maker
Welcome to the VentureBeat community!
DataDecisionMakers is the place where experts, including technical staff, working with data can share data-related insights and innovations.
If you want to read about innovative ideas and up-to-date information, best practices and the future of data and data technology, visit us at DataDecisionMakers.
You might even consider contributing an article of your own!
Read more from DataDecisionMakers