What is Reinforcement Learning and what is it capable of?

Credit: CIO.com

Insilico Medicine created a drug in just 21 days: what usually takes eight years was reduced to three weeks with reinforcement learning. But how?

“We’ve got AI strategy combined with AI imagination,” Insilico CEO Alex Zhavoronkov, told Forbes. The Hong Kong-based medicine company recently posted research that claimed their GENTRL system could identify potential treatments for fibrosis in just 21 days. That’s a level of efficiency that any industry dreams of, let alone healthcare.

Zhavoronkov reportedly became interested in Ian Goodfellow’s work in machine learning. This informed the direction of the company, researching and developing a reinforcement learning AI capable of creating a drug in just three weeks.

The traditional process to develop drug candidates takes over eight years. It costs millions of dollars too, compared to Insilico’s method, which is approximately $150,000 to implement. In order to develop drugs, molecules have to be screened: Insilico’s vision was that if a machine could do this, it would save a lot of time and effort all round.

Insilico Medicine is not the example of what Zhavoronkov describes as a marriage between imagination and strategy. AlphaGo Zero successfully taught itself to improve at the game of Go, combining a neural network with a search algorithm to predict moves. In the paper, ‘Reinforcement learning-based multi-agent system for network traffic signal control’, researchers tested multi-agent reinforcement learning for a more efficient traffic light system.

Even Twitter is set to use reinforcement learning to cut down on fake news.

How does reinforcement learning work?

Reinforcement learning is a seriously powerful AI method and it’s quite independent in comparison to supervised learning. Unlike supervised learning, you needn’t present labelled input or output pairs: a balance between the exploration and exploitation of data is instead the focus.

Consider Pac-Man, for a minute. In the iconic 80s arcade game, the titular character has to collect dots, avoid ghosts and select rewards that flash up on the screen.

Pac-Man is in a perpetual battle of exploration and exploitation. He can choose to exploit the small dots near to him to rack up points and even aim for the bigger dots if they are near to him. However, should he explore the maze a little further, he can pick up even more points from eating the ghosts when he’s energised: this is a risky strategy as, for a while, he’s chasing his predator and could be in danger when the energiser wears off.

The game of Pac-Man has similarities with the essentials of reinforced learning. / Credit: KnowYourMeme

This is an example of the exploitation/exploration trade-off: the idea that a gamble to explore may reward you more. It’s a cornerstone of computer science philosophy.

Supervised learning relies on the data provided: in reinforcement learning, the AI has to pick up the data itself as it goes, rather like Pac-Man has to eat his way through flashing dots. So the actions your AI takes, like Pac-Man, inform the data that gets collected: sometimes it’s worth considering new actions to gather new data – exploring – whereas other times, an AI will exploit the data it has.

To exploit or explore

Choosing whether to exploit or explore randomly is not the most efficient way to produce results. Wouldn’t it be better if an AI could be more accurate – more greedy, in fact – and find the highest value of an action without having to explore so much?

This is what’s known as a Markov Decision Process.

Say the AI is faced with a choice among a number (k) of different actions. After each choice, depending on the action, the AI may get a reward. It’s the AI’s aim to try and receive the biggest reward as possible. This is what’s known as the k-armed bandit problem, a reference to slot machines and a continuation of the arcade theme. The AI keeps pulling on the lever to maximise its jackpot, so to speak.


Reinforcement learning demands an enormous skillset, gargantuanly complex algorithms and accurate simulations of real-world environments.


So, if we can work out the value of a k action, we can always select the action with the highest value. It’s fair to assume we don’t know action values but we can estimate. At any one time, one action must have the greatest estimated value.

These are what are known as “greedy actions”: when you select one of these actions, you are exploiting its knowledge of the values of the actions. If you choose to gamble and go “non-greedy”, this is exploring. Exploitation maximises expected reward, but exploration may produce greater reward in the long run. Exploration is necessary because we can never be sure how precise action-value estimates are. 

Exploration and exploitation revolve around reward and regret; this is true of computer science, ordering something new from the menu or leaving a job you’re happy in for more money. An AI wants to maximise cumulative reward and minimise total regret.

We want algorithms that bring regret closer towards zero: deep neural nets can process extremely complex functions like this.

Reinforced learning is entering the fray

Supervised learning is still the dominant technique in artificial intelligence. Examples of big companies employing reinforcement learning are still pretty rare but are growing steadily: reinforcement learning has long been an academic research subject, shunned in favour of more straightforward frameworks.

If reinforcement learning sounds complex, that’s because it is: very. It demands an enormous skillset, gargantuanly complex algorithms and accurate simulations of real-world environments.

The crux of reinforcement learning is an accessible one, though: a dilemma similar to the ones we face as individuals in our everyday lives. Do we stick or twist? It’s a question we ask ourselves regularly, yet until now, very few have been willing to invest in what has long been seen as a risky technique.

Insilico Medicine is just one recent example of how reinforced learning can lead to incredible new discoveries. Just like with the technique itself, the journey will be formative. Reinforcement learning may be a complex topic, only just stepping into its spotlight, but with risk, there always comes a lot of reward.

Luke Conrad

Technology & Marketing Enthusiast

Custom Software Development

Natalia Yanchii • 04th October 2024

There is a wide performance gap between industry-leading companies and other market players. What helps these top businesses outperform their competitors? McKinsey & Company researchers are confident that these are digital technologies and custom software solutions. Nearly 70% of the top performers develop their proprietary products to differentiate themselves from competitors and drive growth. As...

The Impact of Test Automation on Software Quality

Natalia Yanchii • 04th October 2024

Software systems have become highly complex now, with multiple interconnected components, diverse user interfaces, and business logic. To ensure quality, QA engineers thoroughly test these systems through either automated or manual testing. At Testlum, we met many software development teams who were pressured to deliver new features and updates at a faster pace. The manual...

Custom Software Development

Natalia Yanchii • 03rd October 2024

There is a wide performance gap between industry-leading companies and other market players. What helps these top businesses outperform their competitors? McKinsey & Company researchers are confident that these are digital technologies and custom software solutions. Nearly 70% of the top performers develop their proprietary products to differentiate themselves from competitors and drive growth. As...

Six ways to maintain compliance and remain secure

Patrick Spencer VP at Kiteworks • 16th September 2024

With approximately 3.4 billion malicious emails circulating daily, it is crucial for organisations to implement strong safeguards to protect against phishing and business email compromise (BEC) attacks. It is a problem that is not going to go away. In fact, email phishing scams continue to rise, with news of Screwfix customers being targeted breaking at...

Enriching the Edge-Cloud Continuum with eLxr

Jeff Reser • 12th September 2024

At the global Debian conference this summer, the eLxr Project was launched, delivering the first release of a Debian derivative that inherits the intelligent edge capabilities of Debian, with plans to expand these for a streamlined edge-to-cloud deployment approach. eLxr is an open source, enterprise-grade Linux distribution that addresses the unique challenges of near-edge networks...
The Digital Transformation Expo is coming to London on October 2-3. Register now!