August 03, 2023
Edifix has led the way in alerting users to two important types of “reference abuse”: citation of retracted articles, and citation of articles from predatory journals.
Edifix uses various sources to detect retractions, including Crossref, PubMed, and the Retraction Watch database (see our retractions FAQ here), offering what we think is the most comprehensive retraction detection available.
Since 2021, Edifix has also included the Cabells Predatory Reports database to alert users to citations of material from over 15,000 journals that engage in predatory, deceptive, or otherwise unethical publication practices. You can read more about the Edifix + Cabells partnership here.
Beyond specific alerts for retracted references or citations of predatory publications, Edifix results contain additional information that, upon closer review, may indicate problematic or even fake research content. This blog post explores how Edifix results can be used to detect additional issues that may warrant closer review by editorial teams.
2023 has seen the explosion into the public consciousness of ChatGPT and other large language models (LLMs), and these AI applications have been rapidly and widely adopted in educational and research writing settings. Much has already been written about the potential benefits and pitfalls of using LLMs in scholarly publishing, including 14 posts on the Scholarly Kitchen blog alone (listed chronologically under “Further Reading” below).
The rapid embrace of LLMs has brought with it another flavor of potential reference manipulation: fake references.
If you ask ChatGPT what it is good at, among its strengths it will list language generation: “ChatGPT is adept at generating coherent and contextually relevant text. It can assist with writing tasks, creative writing, brainstorming ideas, or even generating code snippets, recipes, or stories.” But it also offers this caveat: “It may occasionally provide incorrect or nonsensical answers, and its responses are generated based on patterns in its training data rather than true understanding.”
Those last words are particularly relevant to this discussion: ChatGPT does not have a true understanding of the questions it is asked or the tasks it is set. Among the “nonsensical answers” that ChatGPT can give, one type especially pertinent to research publishing is its inability to generate relevant and accurate citations.
This failure was highlighted by Curtis Kendrick on the Scholarly Kitchen just two months after the public launch of ChatGPT. When he asked ChatGPT to provide a reference list for a piece it had written on racism and whiteness in academic libraries, the list of 29 references it provided revealed a number of eye-opening problems.
First, half of the citations were from just two journals, and typically these references were incomplete, generally lacking volume and/or issue numbers. Partly this reflects the limitations of the dataset used to train the model, which, for example, had access only to open access articles. Much more worrying was that ChatGPT didn’t always admit to not knowing the answer, sometimes appearing to lie instead. Of the 29 references it came up with, only one was accurate; some contained elements of genuine references but with parts transposed, and others were completely fake.
Our analysis of Edifix results since the rise of ChatGPT suggests that Edifix offers (at least) three clues to the use of AI to generate references. As with the detection of retractions, using Edifix with both Crossref and PubMed Reference Checking gives you the best odds of detecting “fake” references.
The first clue that a reference list may not be what it seems is the overall rate of linking on Crossref and PubMed.
There are some obvious caveats:
Given these known limitations, however, a low link rate may indicate a fake reference list. Depending on the discipline, if the link rate of journal references in an Edifix job seems unusually low, it’s probably worth digging deeper. For example, we typically expect a link rate of ≥95% for journal references in a life sciences article, but perhaps only ≥80% in some engineering disciplines. A link rate below what you typically see may be worth investigating.
Retraction Watch recently highlighted an extreme example of this phenomenon in the reference list of a now-withdrawn preprint. We scraped the reference list from the PDF, and the results of processing those references with Edifix were eye-opening:
This reference list clearly fails the “typical link rate” test. Of 129 references, only 96 are references to journal articles (you can find the total number of each type of reference by looking at the XML of the job results). Only 2 linked to PubMed and only 14 linked to Crossref, although 8 of those DOIs were in the manuscript already; while most of the DOIs from ChatGPT were valid Crossref DOIs, they did not match the content of the references (see section 2 below).
Another hallmark of AI-generated references is the presence of chimeric references, in which parts of one genuine reference are combined with other parts from an unrelated reference (which may or may not be genuine).
Chimeric references can often be detected using the warnings Edifix generates from Crossref and PubMed reference correction. For example:
In this example, the journal title, volume and page information, and DOI of a genuine OA article have been combined with an author list and article title that return no matches on a Google search.
And these chimeras aren’t limited to two-reference hybrids! This reference combines elements of at least three different genuine references:
The author list is the first three authors from a 2007 paper about climate change; an invented page range has been added to a legitimate combination of journal, year, and volume; and the DOI from a genuine article in that volume. These elements have all been mashed together with an apparently fictitious article title.
The same reference list showed several other such chimeras. Reviewing the warnings issued by Edifix indicates very clearly that there are significant concerns associated with this reference list that warrant further investigation.
Another significant warning relates to invalid DOIs, which can also indicate potentially fake references:
The reference parsing technology behind Edifix can handle references formatted to a whole range of editorial styles and (in many cases) to no recognized style. But there are certain anomalies that will prevent Edifix from parsing a journal reference, including the absence of key pieces of information.
For example, a recent Edifix job showed only 6 of 20 references successfully parsed; closer inspection revealed that the majority of the unparsed references were journal references with no journal title, as in this example:
This is exactly the kind of mistake that ChatGPT is known to make—missing key pieces of information from the reference because it doesn’t understand what the different elements mean.
A parse rate below 50%, therefore, is another red flag that suggests further analysis is warranted.
In isolation, any of the individual references we’ve discussed above might not have thrown up a red flag. But taken together with the other issues highlighted by Edifix in this reference list, enough flags have been raised to warrant a closer look. Just as plagiarism-checking software gives you indicators of possible plagiarism rather than a yes/no answer to the question Was this article plagiarized?, Edifix gives you indicators that a closer look is required rather than a yes/no answer to the question, Is this reference list fake?
These Edifix tools represent welcome reinforcements to our arsenal in the fight against AI-generated references. Remember, as Lionel Fusco said of the struggle against the malign AI program Samaritan in Person of Interest, “No one ever said we were gonna win, but it doesn’t mean you stop fighting.”
Phil Davis, “Did ChatGPT just lie to me?” The Scholarly Kitchen, January 13, 2023
Link: https://www.edifix.com/blog/using-edifix-results-to-detect-research-integrity-issues