All machine learning models are tied to one critical factor: the quality of the data on which the model is trained.
The challenge of data curation to improve the quality of machine learning and AI models is well known. A 2021 MIT research study found systemic issues in the labeling of training data, leading to inaccurate results in AI systems. A study in the journal Quantitative Science Studies which analyzed 141 previous studies on data labeling found that 41% of the models used datasets that had been labeled by humans.
Among the vendors trying to tackle the challenge of optimizing data maintenance for AI is a Swiss startup, Lightly. Founded in 2019, the company announced this week that it had raised $3 million in a seed funding round. However, Lightly does not intend to become a data labeling provider. Instead, the company wants to help curate data using a self-supervised machine learning model that could one day reduce the need for data labeling operations altogether.
“I’m constantly amazed at how much of the work in machine learning is manual, very tedious, and not automated at all,” Matthias Heller, co-founder of Lightly, told VentureBeat. “People always think everything is so advanced with machine learning, but machine learning and deep learning in particular is such a young technology and a lot of the tools and infrastructure are only now being made available.”
A growing market for data maintenance and data labeling
There is no shortage of money or vendors in the market to optimize data for machine learning, be it data curation or data labeling.
For example, Defined.ai, known as DefinedCrowd before its 2021 rebrand, has raised $78 million to date to further its vision of data curation.
And Grand View Research has forecast that the data labeling market will reach $8.2 billion by 2028, with a projected CAGR of 24.6% between 2021 and 2028 Appens Figure Eight and Amazon Sagemaker Ground Truth, SuperAnnotate, Dataloop and Darwin from V7.
Other popular providers are Labelbox and the open source Labelstudio, both of which can be integrated with Lightly’s technology. In general, Lightly plans an open approach, allowing users to use the company’s technology with any labeling vendor.
How the self-supervised model works
Three years ago, Heller and his co-founder Igor Susmelj were working on a machine learning project that required them to label their data.
“We’ve always wondered if the data we’re labeling actually helps improve the model,” Heller said.
That led to Lightly, which comprises a number of open source projects. The main project is the Lightly library, which provides a self-supervised approach to image machine learning.
There are multiple approaches to training data for machine learning, Heller explained. In a supervised approach, such as For example, in computer vision, there is an image and an associated label that are used in combination to teach a model, with a human doing the labeling.
Unsupervised learning, on the other hand, is the opposite – no human interaction is required. The self-monitored model that Lightly enables falls somewhere in the middle and requires minimal human interaction.
“You can use the self-supervised model to curate data because the model learns certain information, certain similarities, what’s related and what’s different,” Heller said.
From open source to commercial solution
While Lightly is open source and free to use, users still have to do a lot of the work to set up the right environment and manage the configuration.
Lightly’s commercial service provides a managed offering with infrastructure, tuned algorithms, and learning frameworks, all configured for users.
“Our main competition today is the in-house tool shop,” said Heller. “We use self-supervised learning to tell you what 1% of the data you should label and use for model training.”
Looking ahead, Heller provocatively predicts that there may come a day in the future when data labeling will no longer be needed as unsupervised machine learning continues to improve.
“I think that the demand for labels will decrease significantly in the next few years,” said Heller. “Maybe in the future we won’t need labels anymore.”
VentureBeat’s mission is intended to be a digital marketplace for technical decision makers to acquire knowledge about transformative enterprise technology and to conduct transactions. Learn more about membership.