Reproducibility problem: The Information Sciences Institute at the University of Southern California has come up with a revolutionary approach to evaluating the reliability of scientific research by combining AI with knowledge graphs.
What Is Reproducibility?
What holds all scientific inquiry together is the scientific process. The problem-solving method is used by everyone from third-grade science students to Nobel laureates. In addition, the reproducibility of findings is essential to the scientific method.
A definition, please. For our purposes, “reproducibility” refers to the ability to produce the same output under identical conditions, including but not limited to the same input data, computational steps, methodologies, and code.
If you set out to test X in conditions Y and found that it performed as expected, you should be able to replicate your results. And if another scientist attempts to test X in Y conditions, they should receive the same result Z. That’s the way science goes, baby!
That’s not always the case, though; in fact, it’s rarely true.
The Big Problem in Science Right Now
The scientific world has realized over the past two decades that much scientific research is difficult or impossible to precisely repeat, a problem dubbed “the reproducibility dilemma.” Seventy percent of researchers in a 2016 study of over 1500 scientists conducted by the newspaper Nature said they have attempted and failed to repeat another scientist’s experiment, and more than half said they have attempted and failed to reproduce their own experiments.
This is problematic in and of itself since it casts doubt on the veracity of the findings, but it becomes catastrophic when one considers that subsequent research is based on the foundation laid by the original. If we can’t guarantee that testing X in Y conditions will always yield Z, then we can’t rely on Z to determine anything else.
The director of USC’s Information Sciences Institute (ISI), Jay Pujara, stated it plainly: “People will not believe in science if we can’t demonstrate that scientific research is reproducible.”
In 2019, Pujara was a part of an ISI team that received funding from the Defense Advanced Research Projects Agency (DARPA). The team’s principal investigators (PIs) were Kristina Lerman and Fred Morstatter. SCORE (Systematizing Confidence in Open Research and Evidence) was a DARPA initiative to create and distribute automated tools for assigning “confidence scores” that would measure the likelihood that a given scientific finding might be replicated.
Assessing Reproducibility
Manually evaluating a paper’s repeatability is doable, but it is time-consuming and costly. And in many scientific domains, new results are more important than repeating previous experiments when it comes to advancing one’s career. Thus, researchers have little incentive to conduct time-consuming replication experiments in an attempt to manually duplicate outcomes linked to their study due to restricted resources. SCORE set out to discover a method for automatically evaluating the scalability of research reproducibility, a much-needed service for the scientific community.
Machine learning methods have recently been used by researchers on the issue, with the goal of understanding which factors in an article have the most bearing on its repeatability. Ph.D. student in Pujara’s lab, Kexuan Sun, pushed these approaches further by looking at the correlations across publications rather than just the variables within them (which had never been done).
Instead of simply reading a document or analyzing its individual elements, “we had new ideas on how to measure reproducibility using knowledge graphs,” Pujara explained. To understand as much as possible about a publication, “we weren’t going to only look at sample size and p-values, for example.”
The unique method was created and evaluated, and the results were presented in a paper titled “Assessing Scientific Research Papers using Knowledge Graphs” at ACM SIGIR 22 (Association for Computing Machinery’s Special Interest Group on Information Retrieval) in July 2022.
From the Micro to the Macro
In order to create a massive knowledge graph (KG) [a representation of a network of real-world entities that depicts their interactions] the ISI team collected massive amounts of data from over 250 million academic papers. The KG was built using a synergistic approach to data collection and analysis at the micro and macro scales.
They examined more closely, at the level of individual articles, factors such as sample sizes, effect sizes, and experimental models that have been found to impact repeatability. These minute characteristics were incorporated as supplementary data for entities like authors and books.
In addition, the team took a broad view, zooming out to include metadata about the connections between entities, such as authorship and citations. The network’s framework was built on the back of these overarching connections.
So, Did It Work?
In a word, yes! As it turned out, the unique approach was remarkably synergistic. The micro-level findings were aided by the relationship information gleaned from the macro-level study. Similarly, the macro KG analysis benefited from the insights provided by the micro text analysis. Among the study’s most important conclusions are that the procedure yielded insights that transcended the simple addition of the two perspectives on the data.
Pujara called their result “state-of-the-art performance on predicting repeatability,” saying, “We found that if we could put both of these things together, combine some data from the text and some features from the knowledge graph, we could do considerably better than either of those methods alone.”
From here, the group intends to expand the KG to incorporate more papers and conduct more in-depth research into certain aspects of it. A report based on the team’s usage of the knowledge graph to examine the correlation between citations and gender was published in PNAS on September 26, 2022.
Additionally, they hope to release their software to the general public. Pujara has stated, “We’re attempting to construct portals where anyone can evaluate research and its reproducibility,” describing a website where users “could enter in papers and explore the knowledge graph and some of the features.”
Settling the Score with Science
Pujara envisions applications for the findings, which would ultimately be “reproducibility scores” assigned to pieces of study, by students, the scientific community, and the general public.
“If you’re a freshman in college, this might shed light on how to conduct research more effectively.”
In addition, it has the potential to improve the current system used to evaluate scientific studies. There may be tens of thousands of papers submitted to a conference or publication before organizers make their final decisions on which ones to accept, reject, or send back for revisions. Consider the time and effort required to go over all of those. Here’s how to sort through those 20,000 papers and determine which ones are most important.
Finally, the public could benefit from the approach they’ve established by increasing their faith in the system, which, as we witnessed during the COVID-19 epidemic, has important consequences for public health. When asked about the scientific credibility surrounding the pandemic, he responded, “There was phony research, and then there was genuinely good science.”
Pujara elaborated, saying, “Perhaps we would be able to prevent that type of disagreement if we were able to use these kinds of automated methods to recognize what research was trustworthy and what research we should take with a grain of salt.”
The scientific reproducibility dilemma may not be over, but it is being addressed thanks to Pujara and the ISI team’s persistent efforts. Furthermore, in the future, we will know how reliable Z is if we put it to the test under Y’s conditions.