Finding the Corpus of Knowledge for Machine Learning/AI In Chemistry
Machine learning for predicting chemistry is an area of intense research and publication. However, since the terminology used to describe this activity is diverse it can be difficult to identify all of the publications describing use of computers to predict chemical outcomes or retrosynthesis paths.
The ways authors describe their work includes the following terms:
prediction of complex chemical reactions
prediction of reaction outcomes
computer assisted synthesis planning
computer assisted organic chemistry
computer assisted (retro)synthetic analysis
So in general it is difficult to do simple searches to find all of the papers on the topic.
While citation analysis is most often used to analyze the number of citations a paper gets, I used it to identify a standard corpus of work by analyzing the papers cited by a key starting paper.
A group at AstraZeneca and University of Bern recently published an excellent paper on how datasets influence “computer assisted synthesis planning” tools. This paper referenced the appropriate works and therefore the authors identified a significant corpus of papers relevant for chemistry prediction.
However, to make sure that the full set of work in the field was identified, we used Scopus to look up the papers cited by the citations, and a third level of papers cited by those citations. This provided three generations of citations. As you might expect, the repeated process begins to converge on a finite set of articles because many papers cite each other, defining a corpus of papers in the field defined by the mutual citations. The total number of papers identified is 7,573 connected by 11,678 citations among them.
These were filtered to remove “leaf” citations that are cited by only one work to provide a smaller set of about 500 papers. I then used Cytoscape to visualize the citation graph. In the graph below, the starting paper is at the lower right. Moving left each circle represents a publication year, and the citation links connect the circles. The circles size is proportional to the number of times it is cited in this corpus. the color is graded by year back to the first paper in the dataset from 1971.
The time-based graph is eerily similar to the writing used by the aliens in the 2016 movie “Arrival“. In that movie, one can learn to see the past and future by reading this language.
We can do the same here! In the version at the left, I have selected the papers with more than 15 citations in the body of knowledge. These may represent the main corpus of the field. The most significant milestones are highlighted by the mutual citation. The graph should be largely independent of the starting point after the first year.
One more observation is that the trend for increasing papers in the field started 10 years ago; before that there are only 4-10 papers per year, rather than the 20-30 we see more recently.
The table below lists the titles of the highlighted works. These are very significant in the field; the method for finding significant works seems to be working.
Significant enabling technologies are identified as important in the analysis:
- SMILES notation for chemical structures
- Connectivity fingerprints
- Lipinski rules for drug-like compounds
Among the first uses of the term “Artificial Intelligence” in this context is by Wipke in 1984. The corpus found by citations includes the ‘classic’ works in the field that are significant efforts in predicting chemistry:
- LHASA (Corey 1972)
- CAMEO (Jorgensen 1990)
- SYNCHEM (Gelernter 1990)
- Route Designer (Law 2009)
- ReactionPredictor (Kayala 2011/12)
- Wei at Harvard (2016)
- Chematica (Synthia) (Szymkuc, 2016)
- Coley’s work at MIT (2017)
- Segler and Waller at Münster. (2017)
In conclusion, we have developed a method to identify a corpus of papers highly relevant to the field. The method does not rely on expert searching but uses the authors’ domain expertise to define the results. We use citation analysis, but not the normal mathematics for impact factors. In fact, it is not clear that normal citation analysis would identify the same body of work.
One can use these key papers as alerts; papers that cite them are likely relevant to the field.
The Elsevier services team can use this and other techniques to help identify technologies, authors and institutions for strategic planning, collaborations and other needs.
[This article previously appeared here on my LinkedIn page. From time to time, I share my thoughts and learnings about computational chemistry, chemoinformatics, bioinformatics, supercomputing, informatics and data mining there. Follow me on Linkedin.]