With Right Prompts, AI Can Analyze Big Data Accurately

A high school student and a master’s candidate walk into a biomedical data challenge. It sounds like the setup for a joke, but the punchline is that they won, sort of, with AI doing the heavy lifting on code that took seasoned bioinformaticians months to write by hand. Their AI collaborators knocked it out in under two minutes.

The pair were part of a team at the University of California, San Francisco and Wayne State University that wanted to test whether large language models could handle one of the gnarlier problems in reproductive health research: sifting through vast biological datasets to predict things like gestational age and the risk of preterm birth. These aren’t trivial questions. Roughly 11 per cent of babies worldwide arrive early, and we still don’t have reliable tools to flag which pregnancies are heading that way.

So they pitted eight AI chatbots against results from DREAM challenges, those crowdsourced competitions where teams of data scientists from around the globe spend months building prediction models from the same dataset. More than a hundred groups had previously competed across three such challenges focused on pregnancy, using everything from blood transcriptomics and placental DNA methylation to vaginal microbiome data. The researchers gave each LLM a single natural-language prompt describing the data, the task and what metrics to report, then simply ran whatever code came back. No hand-holding, no iterative refinement. One shot.

Half the AI models fell flat. Four of the eight couldn’t produce code that ran at all.

But the other four, led by OpenAI’s o3-mini-high, managed something rather striking. Across four prediction tasks, the best LLM-generated models matched or beat the median performance of the human DREAM participants. And for one task in particular, predicting placental gestational age from about 350,000 DNA methylation markers, the AI-written code actually outperformed the top human team. The LLM’s ridge regression model achieved an error of 1.12 weeks compared with 1.24 weeks for the best human effort, a statistically significant difference. The human team had spent months developing multi-stage random forest models that incorporated additional clinical information the AI was never even told existed. The LLM just plumped for a simpler approach and it worked better.

“These AI tools could relieve one of the biggest bottlenecks in data science: building our analysis pipelines,” says Marina Sirota, a professor of pediatrics and interim director of the Bakar Computational Health Sciences Institute at UCSF. “The speed-up couldn’t come sooner for patients who need help now.”

There is, of course, context that tempers the excitement. For three of the four tasks, human teams still came out on top overall; the LLMs matched the median but couldn’t touch the best performers. And the humans had certain advantages, including access to additional demographic features and the ability to submit multiple models and pick the one that scored highest. The LLMs got one go. It’s also worth noting that R code worked far more reliably than Python across the board (14 out of 16 task completions versus seven), partly because Bioconductor packages for R come with detailed code examples that the LLMs had presumably absorbed during training. None of the AIs could handle the trickiest data retrieval task in Python.

What might matter more than raw accuracy, though, is speed. Code that took human participants anywhere from hours to days, and which sat within a three-month competition window, came back from the LLMs in seconds. The entire project, from first prompt to journal submission, took roughly six months. Adi L. Tarca, a professor at Wayne State University who co-led the study, reckons this changes the calculus for researchers without deep programming chops. “Thanks to generative AI, researchers with a limited background in data science won’t always need to form wide collaborations or spend hours debugging code,” he says. “They can focus on answering the right biomedical questions.”

There’s a subtler finding buried in the results too. None of the four successful LLMs committed what’s perhaps the cardinal sin of predictive modelling: leaking information from the test set into training. That sort of contamination is a common source of overstated accuracy in human-built models, and the fact that AI-generated code avoided it suggests something about the quality of the training data these models have absorbed.

But the researchers are candid about the limits. The data were all tabular; we don’t yet know how these models would cope with imaging, unstructured clinical notes or the messy longitudinal designs that characterise much of real-world medicine. And there’s a worry about convergence. Three LLMs produced identical models for one task, which is great for reproducibility but could stifle the methodological diversity that makes crowdsourced science valuable in the first place. If everyone’s AI spits out the same ridge regression, you lose the creative outliers.

“This kind of work is only possible with open data sharing, pooling the experiences of many women and the expertise of many researchers,” says Tomiko T. Oskotsky, co-director of the March of Dimes Preterm Birth Data Repository and an associate professor at UCSF. The datasets underpinning this study came from about 1,200 pregnant women whose outcomes had been tracked across nine studies. Getting that data assembled was the hard part; the AI analysis, by comparison, was almost startlingly quick.

For now, the message seems to be that AI won’t replace the bioinformatician, but it could give a capable non-specialist the tools to get started. And in a field where roughly a thousand babies are born prematurely every day in the United States alone, where only 55 percent of pregnancies deliver within a week of their estimated due date, any tool that accelerates the search for better predictive markers is worth paying attention to. The researchers are already looking ahead to agentic AI systems that could iteratively refine their own models rather than relying on a single prompt, though that raises its own questions about data security and resource management. Still early days. But the direction of travel is clear enough.

Study link: https://www.cell.com/cell-reports-medicine/fulltext/S2666-3791(26)00011-X

Quick Note Before You Read On.

ScienceBlog.com has no paywalls, no sponsored content, and no agenda beyond getting the science right. Every story here is written to inform, not to impress an advertiser or push a point of view.

Good science journalism takes time — reading the papers, checking the claims, finding researchers who can put findings in context. We do that work because we think it matters.

If you find this site useful, consider supporting it with a donation. Even a few dollars a month helps keep the coverage independent and free for everyone.

With Right Prompts, AI Can Analyze Big Data Accurately

Related

Leave a Comment Cancel reply