What is Reinforcement Learning and what is it capable of?

Credit: CIO.com

Insilico Medicine created a drug in just 21 days: what usually takes eight years was reduced to three weeks with reinforcement learning. But how?

“We’ve got AI strategy combined with AI imagination,” Insilico CEO Alex Zhavoronkov, told Forbes. The Hong Kong-based medicine company recently posted research that claimed their GENTRL system could identify potential treatments for fibrosis in just 21 days. That’s a level of efficiency that any industry dreams of, let alone healthcare.

Zhavoronkov reportedly became interested in Ian Goodfellow’s work in machine learning. This informed the direction of the company, researching and developing a reinforcement learning AI capable of creating a drug in just three weeks.

The traditional process to develop drug candidates takes over eight years. It costs millions of dollars too, compared to Insilico’s method, which is approximately $150,000 to implement. In order to develop drugs, molecules have to be screened: Insilico’s vision was that if a machine could do this, it would save a lot of time and effort all round.

Insilico Medicine is not the example of what Zhavoronkov describes as a marriage between imagination and strategy. AlphaGo Zero successfully taught itself to improve at the game of Go, combining a neural network with a search algorithm to predict moves. In the paper, ‘Reinforcement learning-based multi-agent system for network traffic signal control’, researchers tested multi-agent reinforcement learning for a more efficient traffic light system.

Even Twitter is set to use reinforcement learning to cut down on fake news.

How does reinforcement learning work?

Reinforcement learning is a seriously powerful AI method and it’s quite independent in comparison to supervised learning. Unlike supervised learning, you needn’t present labelled input or output pairs: a balance between the exploration and exploitation of data is instead the focus.

Consider Pac-Man, for a minute. In the iconic 80s arcade game, the titular character has to collect dots, avoid ghosts and select rewards that flash up on the screen.

Pac-Man is in a perpetual battle of exploration and exploitation. He can choose to exploit the small dots near to him to rack up points and even aim for the bigger dots if they are near to him. However, should he explore the maze a little further, he can pick up even more points from eating the ghosts when he’s energised: this is a risky strategy as, for a while, he’s chasing his predator and could be in danger when the energiser wears off.

The game of Pac-Man has similarities with the essentials of reinforced learning. / Credit: KnowYourMeme

This is an example of the exploitation/exploration trade-off: the idea that a gamble to explore may reward you more. It’s a cornerstone of computer science philosophy.

Supervised learning relies on the data provided: in reinforcement learning, the AI has to pick up the data itself as it goes, rather like Pac-Man has to eat his way through flashing dots. So the actions your AI takes, like Pac-Man, inform the data that gets collected: sometimes it’s worth considering new actions to gather new data – exploring – whereas other times, an AI will exploit the data it has.

To exploit or explore

Choosing whether to exploit or explore randomly is not the most efficient way to produce results. Wouldn’t it be better if an AI could be more accurate – more greedy, in fact – and find the highest value of an action without having to explore so much?

This is what’s known as a Markov Decision Process.

Say the AI is faced with a choice among a number (k) of different actions. After each choice, depending on the action, the AI may get a reward. It’s the AI’s aim to try and receive the biggest reward as possible. This is what’s known as the k-armed bandit problem, a reference to slot machines and a continuation of the arcade theme. The AI keeps pulling on the lever to maximise its jackpot, so to speak.


Reinforcement learning demands an enormous skillset, gargantuanly complex algorithms and accurate simulations of real-world environments.


So, if we can work out the value of a k action, we can always select the action with the highest value. It’s fair to assume we don’t know action values but we can estimate. At any one time, one action must have the greatest estimated value.

These are what are known as “greedy actions”: when you select one of these actions, you are exploiting its knowledge of the values of the actions. If you choose to gamble and go “non-greedy”, this is exploring. Exploitation maximises expected reward, but exploration may produce greater reward in the long run. Exploration is necessary because we can never be sure how precise action-value estimates are. 

Exploration and exploitation revolve around reward and regret; this is true of computer science, ordering something new from the menu or leaving a job you’re happy in for more money. An AI wants to maximise cumulative reward and minimise total regret.

We want algorithms that bring regret closer towards zero: deep neural nets can process extremely complex functions like this.

Reinforced learning is entering the fray

Supervised learning is still the dominant technique in artificial intelligence. Examples of big companies employing reinforcement learning are still pretty rare but are growing steadily: reinforcement learning has long been an academic research subject, shunned in favour of more straightforward frameworks.

If reinforcement learning sounds complex, that’s because it is: very. It demands an enormous skillset, gargantuanly complex algorithms and accurate simulations of real-world environments.

The crux of reinforcement learning is an accessible one, though: a dilemma similar to the ones we face as individuals in our everyday lives. Do we stick or twist? It’s a question we ask ourselves regularly, yet until now, very few have been willing to invest in what has long been seen as a risky technique.

Insilico Medicine is just one recent example of how reinforced learning can lead to incredible new discoveries. Just like with the technique itself, the journey will be formative. Reinforcement learning may be a complex topic, only just stepping into its spotlight, but with risk, there always comes a lot of reward.

Luke Conrad

Technology & Marketing Enthusiast

Laying the foundations for global connectivity

Waldemar Sterz • 26th June 2024

With the globalisation of trade, the axis is shifting. The world has witnessed an unprecedented rise in new digital trade routes that are connecting continents and increasing trade volumes between nations. Waldemar Sterz, CEO of Telegraph42 explains the complexities involved in establishing a Global Internet and provides insight into some of the key initiatives Telegraph42...

Laying the foundations for global connectivity

Waldemar Sterz • 26th June 2024

With the globalisation of trade, the axis is shifting. The world has witnessed an unprecedented rise in new digital trade routes that are connecting continents and increasing trade volumes between nations. Waldemar Sterz, CEO of Telegraph42 explains the complexities involved in establishing a Global Internet and provides insight into some of the key initiatives Telegraph42...

IoT Security: Protecting Your Connected Devices from Cyber Attacks

Miro Khach • 19th June 2024

Did you know we’re heading towards having more than 25 billion IoT devices by 2030? This jump means we have to really focus on keeping our smart devices safe. We’re looking at everything from threats to our connected home gadgets to needing strong encryption methods. Ensuring we have secure ways to talk to these devices...

Future Proofing Shipping Against the Next Crisis

Captain Steve Bomgardner • 18th June 2024

Irrespective of whether the next crisis for ship owners is war, weather or another global health event, one fact is ineluctable: recruiting onboard crew is becoming difficult. With limited shore time and contracts that become ever longer, morale is a big issue on board. The job can be both mundane and high risk. Every day...

London Tech Week 2024: A Launched Recap

Dianne Castillo • 17th June 2024

Dominating global tech investment, London Tech Week 2024 was buzzing with innovation. Our team joined the action, interviewing founders and soaking up the latest tech trends. Discover key takeaways and meet some of the exciting startups we met!

The Future of Smart Buildings: Trends in Occupancy Monitoring

Khai Zin Thein • 12th June 2024

Occupancy monitoring technology is revolutionising building management with advancements in AI and IoT. AI algorithms analyse data from IoT sensors, enabling automated adjustments in lighting, HVAC, and security systems based on occupancy levels. Modern systems leverage big data and AI to optimise space usage and resource management, reducing energy consumption and promoting sustainability. Enhanced encryption...