Summarizing complex scientific findings for a non-expert audience is one of the most important things a science journalist does from day to day. Generating summaries of complex writing has also been frequently mentioned as one of the best use cases for large language models (despite some prominent counterexamples).
With all that in mind, the team at the American Association for the Advancement of Science (AAAS) ran an informal year-long study to determine whether ChatGPT could produce the kind of "news brief" paper summaries that its "SciPak" team routinely writes for the journal Science and services like EurekAlert. These SciPak articles are designed to follow a specific and simplified format that conveys crucial information, such as the study's premise, methods, and context, to other journalists who might want to write about it.
Now, in a new blog post and white paper discussing their findings, the AAAS journalists have concluded that ChatGPT can "passably emulate the structure of a SciPak-style brief," but with prose that "tended to sacrifice accuracy for simplicity" and which "required rigorous fact-checking by SciPak writers."
"These technologies may have potential as helpful tools for science writers, but they are not ready for 'prime time,' at this point for the SciPak team," AAAS writer Abigail Eisenstadt said.
Where’s the human touch?
From December 2023 to December 2024, AAAS researchers selected up to two papers per week for ChatGPT to summarize using three different prompts of varying specificity. The team focused on papers with difficult elements like technical jargon, controversial insights, groundbreaking discoveries, human subjects, or non-traditional formats. The tests used the "Plus" version of the latest publicly available GPT models available through the study period, which generally spanned the eras of GPT-4 and GPT-4o.
In total, 64 papers were summarized, and those summaries were evaluated both quantitatively and qualitatively by the same SciPak writers who had briefed those papers for the AAAS. The researchers note that this design "could not account for human biases," which we'd argue might be significant among journalists evaluating a tool that was threatening to take over one of their core job functions.