To help open-source AI catch up to proprietary systems like ChatGPT and Gemini, researchers at the University of Pennsylvania and the Allen Institute for AI have created a tool that uses synthetic data to teach AI how to understand complex, text-heavy images. The system, called CoSyn, generates scientific charts, chemical diagrams, and user-interface screenshots—along with millions of training instructions—that dramatically boost performance in image-based language tasks.
Teaching AI to “see” with synthetic vision
Vision-language models, which interpret images alongside text, struggle with data scarcity when it comes to specialized formats like medical charts or scientific figures. To solve this, CoSyn (short for Code-Guided Synthesis) uses AI’s coding abilities to create realistic images from scratch and pair them with tasks like question-answering and caption generation.
In essence, it turns one AI’s strength—code generation—into a learning tool for another. “We’re essentially transferring the strengths of open-source AI from text to vision,” explained Yue Yang, co-first author and a research scientist at Ai2’s PRIOR group.
How CoSyn works and what it creates
CoSyn produced a dataset called CoSyn-400K containing:
- 400,000+ synthetic, text-rich images
- 2.7 million paired instructions (captions, questions, reasoning tasks)
- Over a dozen image types, from charts to math plots to UI screens
To create this volume and variety, the team used a tool called DataDreamer, which let them scale up image generation through automated prompting. They also embedded character “personas” like a novelist or chemistry teacher into the prompts to ensure diversity in tone, style, and domain coverage.
Outperforming proprietary models
Despite being trained on synthetic data, CoSyn models matched or beat top-tier systems like GPT-4V and Gemini 1.5 Flash on seven major benchmarks. In one standout test, the team trained a model on just 7,000 synthetic nutrition labels for a custom benchmark, NutritionQA. That small set outperformed other models trained on millions of real images.
“Training AI with CoSyn is incredibly data efficient,” said Mark Yatskar, Yang’s co-advisor and Assistant Professor of Computer and Information Science at Penn. “We’re showing that synthetic data can help models generalize to real-world scenarios.”
Open-source AI, ethical by design
Unlike proprietary models trained on undisclosed or potentially copyrighted data, CoSyn is built entirely from open tools and is freely available to the research community. The researchers hope this transparency will promote ethical development and spark scientific progress.
“It opens the door to AI systems that can reason about scientific documents,” noted Chris Callison-Burch, professor and project advisor. “That could help everyone from college students to researchers.”
Next step: AI that acts, not just sees
Yang and the team plan to extend CoSyn’s capabilities from understanding images to interacting with them—clicking buttons, filling forms, or navigating digital tools. “In the long run, we want AI that can act in the world, not just describe it,” Yang said.
Paper: Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation
Conference: ACL 2025
Published: May 21, 2025
ScienceBlog.com has no paywalls, no sponsored content, and no agenda beyond getting the science right. Every story here is written to inform, not to impress an advertiser or push a point of view.
Good science journalism takes time — reading the papers, checking the claims, finding researchers who can put findings in context. We do that work because we think it matters.
If you find this site useful, consider supporting it with a donation. Even a few dollars a month helps keep the coverage independent and free for everyone.
