‘KinderMining’: Tackling big data sets by keeping things simple

Photo: Closeup of researcher with stem cells

A research assistant uses a pipette to change media that feed trays of human embryonic stem cell cultures in a UW–Madison research lab. Many of UW stem cell pioneer James Thomson’s landmark discoveries provided the original inspiration for the KinderMiner project. Photo: Jeff Miller

With about 100 lines of code, a Morgridge Institute for Research team has unleashed a fast, simple and predictive text-mining tool that may turbocharge big biomedical pursuits such as drug repurposing and stem cell treatments.

The algorithm, named “KinderMiner” by its inventors, has been put to use exploring one of the largest single archives of research journal papers, Europe PubMed Central. Within hours, it can scan the more than 30 million papers online in Europe PMC and provide ranked associations for select target terms and key phrases.

“We started this project to try to find a text mining approach that works more effectively for scientists,” says senior author Ron Stewart, associate director of bioinformatics at Morgridge, a biomedical institute affiliated with the University of Wisconsin–Madison. “Most often, researchers are running manual Google searches and combing through millions of hits to find, for example, certain genes that are important to a biological process or disease. It’s often based on hunches and intuition. We’re trying to automate and formalize that process.”

Ron Stewart

Finn Kuusisto

Finn Kuusisto, a postdoctoral researcher at the Morgridge Institute and first author on the KinderMiner paper, presented results Wednesday, March 29, at the American Medical Informatics Association’s annual Joint Summits on Translational Science in San Francisco. The summit showcases new applications in bioinformatics that are improving health care.

“There are other techniques out there that require a lot more data-wrangling,” says Kuusisto. “But in our case, we write about 100 lines of Python code, and our users can be given answers that may significantly speed up their scientific process.”

The scientists emphasize that while their queries focused on biomedicine, KinderMiner can be applied to any discipline — the only constant is the need for a massive corpus to search. The next step will be to create an online search interface available for the scientific community.

To test KinderMining, the team chose two scientific projects that prove to be time consuming and often intractable. The first is identifying relevant transcription factors to reprogram stem cells, and the second is finding potential drugs with off-label benefits or adverse effects.

For cell reprogramming, there are about 2,000 known transcription factors that might be useful in changing a cell from one state to another, such as creating induced pluripotent stem (iPS) cells from skin cells. They used KinderMining on three reprogramming efforts that are well established in research literature: creating iPS cells, creating cardiomyocytes, and maturation of liver cells.

To show the predictive power of the algorithm, the team censored the literature by date, taking out all papers beginning two years before the published dates of each discovery. They queried only up to 2004 for iPS cells, 2008 for cardiomyocytes and 2009 for liver cells.

“Most often, researchers are running manual Google searches and combing through millions of hits … It’s often based on hunches and intuition. We’re trying to automate and formalize that process.”

Ron Stewart

The results in all three tests identified numerous relevant transcription factors in the top 20 hits — again, from a potential pool of more than 2,000 factors. This is a substantial benefit to the wet lab scientists, given that the factors likely need to act in combination. For instance, if one needs to test all 2,000 factors four at a time, it represents 100 billion experiments — clearly outside the realm of possibility.

Stewart notes that KinderMining ranks the factors, and it is likely that the important factors will be in the top 10 or 20. Now if scientists test 10 factors four at a time, it requires a manageable 210 experiments, Stewart says.

They compared their results against a state of the art data mining tool called Mogrify, and the KinderMining results overlap on a large proportion of accurate hits.

“This is kind of like a ‘time machine’ for biology, where we can go back before any of the big publications came out on reprogramming, and still make a good guess about what genes are most important,” says Stewart.

Stewart works in the Morgridge regenerative biology team led by stem cell pioneer James Thomson, and many of Thomson’s landmark discoveries provided the original inspiration for this project. “It would be great if we could help someone in the Thomson lab or a related lab come up with a discovery that has great clinical benefit — but instead of taking 15 years, we do it in three years.”

The second big test involved scanning Europe PMC to identify drugs that have the effect of reducing blood glucose. Of the top 50 drugs found, 43 are known diabetes treatments, but the team found seven drugs that either raise or lower blood glucose as a secondary, off-label effect. Those hits are especially important as they demonstrate possible prediction of repurposed drug targets.

“You could spend all your time … scanning the literature for this kind of secondary drug effect and only scratch the surface of what’s out there. It’s better to write an automated machine learning package to do it instead.”

David Page

Repurposed drugs make up about 30 percent of all new drugs or vaccines approved by the U.S. Food and Drug Administration. David Page, a co-author on the study and a professor of biostatistics and medical informatics at UW–Madison, says he is excited about the potential of KinderMiner to identify promising drugs to repurpose.

“You could spend all your time — and all your students’ time — scanning the literature for this kind of secondary drug effect and only scratch the surface of what’s out there,” Page says. “It’s better to write an automated machine learning package to do it instead.”

Kuusisto and Page have received approval to use approximately 10 million de-identified electronic health records from the Veterans Administration to continue the drug repurposing work, examining several drug effects such as lowering of cholesterol levels or blood pressure.

Morgridge computational biologist John Steill, another co-author of the KinderMining study, is using the tool to improve gene marker lists, which have numerous uses such as classifying cells or samples by cell type and identifying samples that may produce tumors.