
There’s nothing better than a good benchmark to motivate the computer vision field.
That’s why one of the research teams at the Allen Institute for AI, also known as AI2, recently collaborated with the University of Illinois at Urbana-Champaign to develop a new, unified benchmark called GRIT (General Robust Image Task) for general-purpose computers -Vision models. Their goal is to help AI developers build the next generation of computer vision programs that can be applied to a range of common tasks – a particularly complex challenge.
“We discuss, as we do weekly, the need to develop more general computer vision systems capable of solving a range of tasks and generalizing in ways that current systems cannot,” said Derek Hoiem, professor in Computer Science from the University of Illinois at Urbana-Champagne. “We recognized that one of the challenges is that there is no good way to assess a system’s overall vision capabilities. All current benchmarks are designed to evaluate systems that have been specifically trained for that benchmark.”
What general computer vision models should be able to do
According to Tanmay Gupta, who joined AI2 as a research scientist after completing his PhD. from the University of Illinois at Urbana-Champaign, said there have been other attempts to build multitasking models that can do more than one thing — but a general-purpose model requires more than just the ability to do three or four different tasks.
“Often you don’t know in advance which tasks the system will have to perform in the future,” he says. “We wanted to architecture the model so that anyone with a different background could give natural language instructions to the system.”
For example, he explained, someone could say “Describe the picture” or “Find the brown dog” and the system could execute that instruction and either return a bounding box — a rectangle around the dog you’re referring to — or give one Caption reading “There’s a brown dog playing in a green meadow”. So that was the challenge, to build a system that could execute instructions, including instructions it’s never seen before, and do it for a variety of tasks that involve segmentation or bounding boxes or captions or answering questions,” he said he .
The GRIT benchmark, Gupta continued, is just one way to assess those capabilities in a way that allows the system to be evaluated in terms of how robust it is to distortions in the images and how general it is across different data sources. “Does it solve the problem not just for one or two or ten or twenty different concepts, but for thousands of concepts?” he said.
Benchmarks have served as drivers for computer vision research
Benchmarks have been a big driver of computer vision research since the early days, Hoiem said. “If a new benchmark is created and it’s well designed to assess the types of research that people are interested in, then it really facilitates that research by making it much easier to compare progress and evaluate innovation, without having to re-implement algorithms, which takes a lot of time,” he says.
Computer vision and AI have made great strides in the past decade, he added. “You’re seeing that in smartphones, home-assistance and vehicle security systems, with AI on the road in ways that weren’t there ten years ago,” he said. “We used to go to computer vision conferences and people would say, ‘What’s new?’ and we’d say, ‘It still doesn’t work’ – but now things are starting to work.”
The downside, however, is that existing computer vision systems are typically designed and trained only for specific tasks. “For example, you could build a system that can box vehicles, people, and bikes for a driving application, but if you wanted it to also box motorcycles, you would have to change the code and architecture and retrain.”
The GRIT researchers wanted to figure out how to build systems that are more human-like, in the sense that they can learn to run a whole range of different types of tests. “We don’t have to change our bodies to learn how to do new things,” he said. “We want that kind of generality in AI where you don’t have to change the architecture, but the system can do a lot of different things.”
Benchmark will advance the field of computer vision
The large computer vision research community, which publishes tens of thousands of articles each year, has seen an increasing body of work generalizing machine vision systems, he added, including various people reporting numbers on the same benchmark.
The researchers hope to host a workshop around the GRIT benchmark and announce it at the 2022 Conference on Computer Vision and Pattern Recognition on June 19-20. “Hopefully that will encourage people to submit their methods and their new models and evaluate them against this benchmark,” Gupta said. “We’re hoping to see a significant amount of work in this direction over the next year and quite a performance improvement from where we are today.”
With the growth of the computer vision community, there are many researchers and industries looking to advance the field, Hoiem said.
“They’re always looking for new benchmarks and new problems to work on,” he said. “A good benchmark can shift a major focus of the field, so this is a great place for us to take on this challenge and motivate the field to build in this exciting new direction.”