This was first posted: 05/29/2014 here

Medicinal chemists are known to display behaviour akin to compulsive gamblers – “this last molecule will solve [fill in the blank] and be the next [candidate] drug”. This can be attributed to a combination of a positive never-give-up attitude and a naive over-confidence in forecasting.To be fair, with more computer power, better algorithms and more data we are indeed constantly getting better at forecasting. But the gains in our ability to predict are still often dwarfed by randomness and the complexity of the problem at hand.

In his wonderful book “The Black Swan” Nassim Taleb colourfully describes that we humans often over-interpret and come up with stories that are used to convince ourselves that we understand the past. Sequences of events are taken and explanations are interlaced into them. This despite the fact that much of what did happen was deemed unlikely, if not insane, before it actually happened. Still, we find it rather trivial to post-rationalize and come up with logic explanations which do not seem crazy after the event happened. This illusion of understanding and our ability to simplify is called the “narrative fallacy”. Take the most famous doping incident in sport for example. It is obvious to everyone now that Lance Armstrong was doped. But before he was caught, Armstrong’s success was credited to hard work and mental superpower. Few people, if any, suspected that he was doped, and it was only after the facts emerged people believed it. And now it obvious to everyone that he had to be doped to win Tour de France seven (!) times in a row.

The human mind is hard-wired to construct stories around events, making them easier to remember. To stretch it just a little, it’s almost as we have a got a mini John Nash sitting on our shoulder (I mean the Russel Crowe-Nash version that could see patterns where there were no patterns, not the real John Nash, the Nobel prize winner version).

Russel Crowe as John Nash in “A Beautiful Mind” can see patterns where there are no patterns (left), whereas the new matched-molecular series tool Matsy can find patterns where there really are patterns (right).

Lo and behold, there are ways to escape the narrative fallacy. By applying hypothesis-based approaches – conducting experiments with testable predictions, knowledge can supersede storytelling. A previous blogpost addresses some of these concerns – Anthony Nicholls at OpenEye and Pat Walters at Vertex are championing for better use of statistics and reproducibility in molecular modelling.

The more information one collects and the smarter one uses it, the better the predictions are likely to be. There are, however, entertaining examples of when collecting too much information and being too clever leads to unwanted results. A personal favourite is when the American retail store Target figured out that a teenage girl was pregnant due to her buying patterns, before her father did (making him quite upset). The way Target went about can be seen as standard. They maintained a database storing everything their customers bought and used math and statistics to find patterns, which could be formed into a “pregnancy score”. They identified key products (scent-free soap, bags of cotton balls, hand sanitizers, washcloths etc.), and when these were combined in certain amounts and ways a pregnancy alert would go off, and targeted advertising went out to the customer. Such recommender system approaches are often very useful. Target realized, however, that revealingthat they had such information freaks people out.

What has this got to do with drug design then? Well…there are many large databases of biological activity data available that could be mined in a similar fashion, using new algorithms and evaluated with statistics. The concept of leveraging such datasets to predict new trends is conceptually very attractive, and is consequently a hot topic in the current medicinal chemistry literature.

In collaboration with Noel O’Boyle and Roger Sayle (NextMove Software) we recentlydescribed a matched-molecular series*method that exploits “Big Data” to recommend what R-group to put on next. Matched-series may be seen as an extension to the popular matched-molecular pairs approach. A draw-back with matched-pairs is that effects are often normally distributed around zero, when investigating big sets of biological data. We found that certain activity orders of R-groups are preferred and the longer the matched series, the more predictive it is. A bit like the pregnancy-score. These observations provide medicinal chemists with a knowledge-based recommender tool to help win the jackpot quicker. The tool is called Matsy, and can consequently be used as a (testable) hypothesis generator for molecular design. In case you are interested in finding out more about this approach, the manuscript is freely available here.

Finally, if you would like to hear more on the “narrative fallacy” just listen to athletes/coaches/fans explaining why they lost/won their latest game. We scientists do not do such things, right?

*A matched molecular series describes a set of molecules with the same scaffold but different R-groups at a particular position