Illustration: Mina De Lao/Getty Images
Dear Commons Community,
Major scientific journals and conferences ban crediting an artificial intelligence (AI) program, such as ChatGPT, as an author or reviewer of a study. Computers can’t be held accountable, the thinking goes. But last week, a norm-breaking meeting turned that taboo on its head: All 48 papers presented, covering topics ranging from designer proteins to mental health, were required to list an AI as the lead author and be scrutinized by AIs acting as reviewers. Here is a description of the meeting courtesy of Science.
“The virtual meeting, Agents4-Science, was billed as the first to explore a theme that only a year ago might have seemed like science fiction: Can AIs take the lead in developing useful hypotheses, designing and running computations to test them, and writing a paper summarizing the results? And can large language models, the type of AI that powers ChatGPT, then effectively vet the work?
“There’s still some stigma about using AI, and people are incentivized to hide or to minimize it,” a lead conference organizer, AI researcher James Zou of Stanford University, tells Science. The conference aimed “to have this study in the open so that we can start to collect real data, to start to answer these important questions.” Ultimately, the organizers hope a fuller embrace of AI could accelerate science—and ease the burden on peer reviewers facing a ballooning number of manuscripts submitted to journals and conferences.
But some researchers reject the conference’s very premise. “No human should mistake this for scholarship,” said Raffaele Ciriello of the University of Sydney, who studies digital innovation, in a statement released by the Science Media Centre before the meeting. “Science is not a factory that converts data into conclusions. It is a collective human enterprise grounded in interpretation, judgment, and critique. Treating research as a mechanistic pipeline … presumes that the process of inquiry is irrelevant so long as the outputs appear statistically valid.”
Most of the 315 manuscripts submitted to the conference, which attracted 1800 registrants, were reviewed and scored on a sixpoint scale by three popular large language models—GPT-5, Gemini 2.5 Pro, and Claude Sonnet 4. The results were averaged for each paper and humans were asked to review 80 papers that passed a threshold score. Organizers accepted 48 covering multiple disciplines, based on both the AI and human reviews.
One accepted paper that organizers highlighted was submitted by Sergey Ovchinnikov, a biologist at the Massachusetts Institute of Technology. His team asked advanced versions of ChatGPT to generate amino acid sequences that code for biologically active proteins with a structure called a four-helix bundle. Scientists typically use specialized software to design proteins. But to Ovchinnikov’s surprise, ChatGPT produced sequences without further refinement of his team’s query. He and his human colleagues tested two of the sequences in the lab, confirming that a protein derived from one had a four-helix bundle. Still, Ovchinnikov found ChatGPT’s performance wasn’t flawless. Most of the sequences did not garner “high confidence” on a score predicting whether they would form the desired protein structure.
Data presented at the conference also probed how researchers collaborate with AIs. The organizers asked human authors to report how much work AI contributed in several key areas, including generating hypotheses, analyzing data, and writing the paper. AI did more than half of the hypothesis-generation labor in just 57% of submissions and 52% of accepted papers. But in about 90% of papers, AI played a big role in writing the manuscripts, perhaps because that task is less computationally challenging.
Some human authors who presented at the meeting said their AI partners enabled them to complete in just a few days tasks that usually take much longer, and eased interdisciplinary collaborations with scholars outside their fields. But they also pointed to AI’s drawbacks, including a tendency to misinterpret complex methods, write code that humans needed to debug, and fabricate irrelevant or nonexistent references in papers.
Scientists should remain skeptical about applying AI for tasks that require deep, conceptual reasoning and scientific judgment, according to Stanford computational astrophysicist Risa Wechsler, who reviewed some of the submitted papers.
“I’m genuinely excited about the use of AI for research, but I think that this conference usefully also demonstrates a lot of limitations,” she said during a panel discussion at the meeting. “I am not at all convinced that AI agents right now have the capability to design robust scientific questions that are actually pushing forward the forefront of the field.” One paper she reviewed “may have been technically correct but was neither interesting nor important,” she said. “One of the most important things we teach human scientists is how to have good scientific taste. And I don’t know how we will teach AI that.”
The conference organizers plan to analyze and compare the AI- and human-written reviews. But the remarks garnered by Ovchinnikov’s protein-design paper hinted that people and machines may often diverge. An AI reviewer called it “profound.” But a human deemed it “an interesting proof-of-concept study with some lingering questions.”
Tony



