What is Reinforcement Learning and what is it capable of?

Credit: CIO.com

Insilico Medicine created a drug in just 21 days: what usually takes eight years was reduced to three weeks with reinforcement learning. But how?

“We’ve got AI strategy combined with AI imagination,” Insilico CEO Alex Zhavoronkov, told Forbes. The Hong Kong-based medicine company recently posted research that claimed their GENTRL system could identify potential treatments for fibrosis in just 21 days. That’s a level of efficiency that any industry dreams of, let alone healthcare.

Zhavoronkov reportedly became interested in Ian Goodfellow’s work in machine learning. This informed the direction of the company, researching and developing a reinforcement learning AI capable of creating a drug in just three weeks.

The traditional process to develop drug candidates takes over eight years. It costs millions of dollars too, compared to Insilico’s method, which is approximately $150,000 to implement. In order to develop drugs, molecules have to be screened: Insilico’s vision was that if a machine could do this, it would save a lot of time and effort all round.

Insilico Medicine is not the example of what Zhavoronkov describes as a marriage between imagination and strategy. AlphaGo Zero successfully taught itself to improve at the game of Go, combining a neural network with a search algorithm to predict moves. In the paper, ‘Reinforcement learning-based multi-agent system for network traffic signal control’, researchers tested multi-agent reinforcement learning for a more efficient traffic light system.

Even Twitter is set to use reinforcement learning to cut down on fake news.

How does reinforcement learning work?

Reinforcement learning is a seriously powerful AI method and it’s quite independent in comparison to supervised learning. Unlike supervised learning, you needn’t present labelled input or output pairs: a balance between the exploration and exploitation of data is instead the focus.

Consider Pac-Man, for a minute. In the iconic 80s arcade game, the titular character has to collect dots, avoid ghosts and select rewards that flash up on the screen.

Pac-Man is in a perpetual battle of exploration and exploitation. He can choose to exploit the small dots near to him to rack up points and even aim for the bigger dots if they are near to him. However, should he explore the maze a little further, he can pick up even more points from eating the ghosts when he’s energised: this is a risky strategy as, for a while, he’s chasing his predator and could be in danger when the energiser wears off.

The game of Pac-Man has similarities with the essentials of reinforced learning. / Credit: KnowYourMeme

This is an example of the exploitation/exploration trade-off: the idea that a gamble to explore may reward you more. It’s a cornerstone of computer science philosophy.

Supervised learning relies on the data provided: in reinforcement learning, the AI has to pick up the data itself as it goes, rather like Pac-Man has to eat his way through flashing dots. So the actions your AI takes, like Pac-Man, inform the data that gets collected: sometimes it’s worth considering new actions to gather new data – exploring – whereas other times, an AI will exploit the data it has.

To exploit or explore

Choosing whether to exploit or explore randomly is not the most efficient way to produce results. Wouldn’t it be better if an AI could be more accurate – more greedy, in fact – and find the highest value of an action without having to explore so much?

This is what’s known as a Markov Decision Process.

Say the AI is faced with a choice among a number (k) of different actions. After each choice, depending on the action, the AI may get a reward. It’s the AI’s aim to try and receive the biggest reward as possible. This is what’s known as the k-armed bandit problem, a reference to slot machines and a continuation of the arcade theme. The AI keeps pulling on the lever to maximise its jackpot, so to speak.


Reinforcement learning demands an enormous skillset, gargantuanly complex algorithms and accurate simulations of real-world environments.


So, if we can work out the value of a k action, we can always select the action with the highest value. It’s fair to assume we don’t know action values but we can estimate. At any one time, one action must have the greatest estimated value.

These are what are known as “greedy actions”: when you select one of these actions, you are exploiting its knowledge of the values of the actions. If you choose to gamble and go “non-greedy”, this is exploring. Exploitation maximises expected reward, but exploration may produce greater reward in the long run. Exploration is necessary because we can never be sure how precise action-value estimates are. 

Exploration and exploitation revolve around reward and regret; this is true of computer science, ordering something new from the menu or leaving a job you’re happy in for more money. An AI wants to maximise cumulative reward and minimise total regret.

We want algorithms that bring regret closer towards zero: deep neural nets can process extremely complex functions like this.

Reinforced learning is entering the fray

Supervised learning is still the dominant technique in artificial intelligence. Examples of big companies employing reinforcement learning are still pretty rare but are growing steadily: reinforcement learning has long been an academic research subject, shunned in favour of more straightforward frameworks.

If reinforcement learning sounds complex, that’s because it is: very. It demands an enormous skillset, gargantuanly complex algorithms and accurate simulations of real-world environments.

The crux of reinforcement learning is an accessible one, though: a dilemma similar to the ones we face as individuals in our everyday lives. Do we stick or twist? It’s a question we ask ourselves regularly, yet until now, very few have been willing to invest in what has long been seen as a risky technique.

Insilico Medicine is just one recent example of how reinforced learning can lead to incredible new discoveries. Just like with the technique itself, the journey will be formative. Reinforcement learning may be a complex topic, only just stepping into its spotlight, but with risk, there always comes a lot of reward.

Luke Conrad

Technology & Marketing Enthusiast

The Future of Smart Buildings: Trends in Occupancy Monitoring

Khai Zin Thein • 12th June 2024

Occupancy monitoring technology is revolutionising building management with advancements in AI and IoT. AI algorithms analyse data from IoT sensors, enabling automated adjustments in lighting, HVAC, and security systems based on occupancy levels. Modern systems leverage big data and AI to optimise space usage and resource management, reducing energy consumption and promoting sustainability. Enhanced encryption...

The need to weave agility throughout the business

John Craig Swartz SVP at POWWR • 11th June 2024

With geopolitical tensions, more extreme weather events and the legacy of a global pandemic, it is more difficult for energy suppliers to preserve their margins and remain competitive than ever before. To thrive in the current climate, it is imperative that a supplier makes marginal gains wherever they can. Profitability within the sector today hinges...

Artificial general intelligence is closer than expected

AI expert Stuart Fenton • 10th June 2024

Whilst most of the attention around artificial intelligence (AI) thus far has been on ChatGPT, it is just the tip of the iceberg. In many ways, ChatGPT shouldn’t be thought of as true AI as it is – at its heart – just generative, learned behaviour. The future of AI, in contrast, is a system...

The State of Data Streaming

Confluent • 06th June 2024

Confluent survey: 90% of respondents say data streaming platforms can lead to more product and service innovation in AI and ML development 86% of respondents cite data streaming as a strategic or important priority for IT investments in 2024 For 91% of respondents, data streaming platforms are critical or important for achieving data-related goals

The State of Data Streaming

Confluent • 06th June 2024

Confluent survey: 90% of respondents say data streaming platforms can lead to more product and service innovation in AI and ML development 86% of respondents cite data streaming as a strategic or important priority for IT investments in 2024 For 91% of respondents, data streaming platforms are critical or important for achieving data-related goals

Grant Funding Awarded to Advance Cancer Therapeutics Discovery

Dr Alan Roth • 04th June 2024

The CRUK (Cancer Research UK) Scotland Institute and Oxford Drug Design, a biotechnology company with core expertise in AI drug discovery, have announced that their joint application for the MRC (UK Medical Research Council) National Mouse Genetics Network (NMGN) Business Engagement Fund has been awarded.