Researchers at the University of Pennsylvania and the Allen Institute for Artificial Intelligence have developed a groundbreaking tool that allows open-source AI systems to match or surpass the visual understanding capabilities of proprietary models like GPT-4V and Gemini 1.5 Flash, potentially reshaping the competitive landscape between open and closed AI development.
The tool, called CoSyn (Code-Guided Synthesis), addresses a critical bottleneck in AI development: the scarcity of high-quality training data for teaching machines to understand complex visual information like scientific charts, medical diagrams, and financial documents. Rather than scraping millions of images from the internet — a practice fraught with copyright and ethical concerns — CoSyn leverages the coding abilities of existing language models to generate synthetic training data.
“We have, we lack of such data to train the model. We lack of data, like documents, charts with rich annotations to train a vision language model to do question answering over those images,” explained Yue Yang, a recent Penn Engineering Ph.D. graduate and co-first author of the research, during an exclusive interview with VentureBeat. “Those images actually are more challenging to annotate, compared to natural photos, like a picture of a dog of a cat of a house.”
The breakthrough comes as enterprises increasingly seek AI systems capable of understanding and reasoning about complex visual information — capabilities essential for everything from automated document processing to AI agents that can navigate digital interfaces independently. The work was conducted during Yang’s internship with the PRIOR team at the Allen Institute for AI and supported by the Office of the Director of National Intelligence, Intelligence Advanced Research Projects Activity, and the Defense Advanced Research Projects Agency.
How synthetic data generation solves AI’s biggest training challenge
The challenge of training AI to understand text-rich images has long plagued the field. Unlike natural photographs, scientific figures, charts, and documents require extensive annotation work that is both time-consuming and expensive. Traditional approaches have relied on harvesting images and their alt-text descriptions from the internet, but this method produces training data that is often superficial and legally problematic.
CoSyn takes a fundamentally different approach by recognizing that most text-rich images are originally created through code — Python scripts generate charts, LaTeX renders mathematical equations, HTML creates web interfaces. The research team’s insight was to reverse this process: use language models’ proven coding abilities to generate the underlying code, then execute that code to create realistic synthetic images.
“One intuition is actually those images like charts documents. We render them from programs from code, like we use Python to generate charts. We use, like latex or word to write our documents,” Yang said. “So how about we go through the reverse way, like we generated the code because the text only language model has been proved very good at writing code.”
Chris Callison-Burch, a computer science professor at Penn who co-advised the research, described the approach in simpler terms: “This is like taking a student who’s great at writing and asking them to teach someone how to draw, just by describing what the drawing should look like. We’re essentially transferring the strengths of open-source AI from text to vision.”
CoSyn-trained models outperform GPT-4V and Gemini on key benchmarks
... continue reading