Online Courses
Free Tutorials  Go to Your University  Placement Preparation 
Artificial Intelligence(AI) & Machine Learning(ML) Training in Jaipur
Online Training - Youtube Live Class Link
0 like 0 dislike
in Artificial Intelligence(AI) & Machine Learning by Goeduhub's Expert (3.1k points)
edited by

In this article we will  understand Q-learning algorithm and how it work using an maze environment example, bellman equation for Q-learning algorithm and Q-learning definition.

Goeduhub's Online Courses @Udemy

For Indian Students- INR 570/- || For International Students- $12.99/-


Course Name

Apply Coupon


Tensorflow 2 & Keras:Deep Learning & Artificial Intelligence

Apply Coupon


Computer Vision with OpenCV | Deep Learning CNN Projects

Apply Coupon


Complete Machine Learning & Data Science with Python Apply Coupon


Natural Language Processing-NLP with Deep Learning in Python Apply Coupon


Computer Vision OpenCV Python | YOLO| Deep Learning in Colab Apply Coupon


Complete Python Programming from scratch with Projects Apply Coupon

1 Answer

0 like 0 dislike
by Goeduhub's Expert (3.1k points)
edited by
Best answer

To understand Q-learning algorithm let's take an example of a maze environment that the agent needs to explore. Consider the below image. 


In the image we have  A: Agent , H-Hurdle , L: Lost, F: Food  and G: Gold.  The agent need to reach the Gold tile with shortest path in order to accomplish the game and the agent can only move one tile at a time. 

If the agent steps on the lost tile Lost tile, the agent will die. 

Rewards and Penalties for Agent for each state and action. 

  • The Agent will lose one point (-1) for each step, this is done , so that agent take shortest path.
  • The agent will lose 50 (-50) points if the agent lends on Hurdle tile and the game will end. 
  • The agent gain one point (+1) if the agent lends on Food tile. 
  • The agent gain 50 point (+50) if the agent reaches on Gold tile and the game will end.

Now we know the problem, to get a good solution for the problem we need to train our agent in such a way that it can reach the Gold tile with shortest path.

For this we need to create a Q-table to note down the moves of agent's actions and agent's gain.

Q-table :  Q-table is just a lookup table where we calculate the maximum expected future rewards for action at each state. 

Let's try to understand it with the above example 

There are four choices (actions) for the agent to move, if it is at a non-edge tile and three choices (actions) , if it is at edge tile. 

Let's make a Q-table for agent's actions 


Q-table is a iterative process, which is used to maximize expected future reward for agent for each state and action. We try to improve Q-table for each iteration.

How to decide values in Q-table for agent's each action ?

The value of Q-learning can be derived from the Bellman equation and the goal of this algorithm is to maximize the Q value.

Bellman equation (Q-function)


Q-values is calculated using above equation where    st- states ,  at-action , γ-discount factor.

Q-learning algorithm process 


Initialize the Q-table : We initialize Q table with value zero (0) in each cell. In Q table there are n numbers of column and m number of rows. 

n= number of actions

m= number of states

Agent choose a path and perform action: 

This step run under a loop until the time we stop the training. In this step we choose an action in the state based on Q value or Q table. But the problem is initially all values are zeros. 

In this case we have a  concept call exploration and exploitation. In simple terms exploration means exploring environment meaning new opprotunities and exploitation is we are using an opprotunity again and again. 

For example going new restuarant at every weekend is exploration and going same restuarant at every weekend is exploitation.

Agent also work in same way during exploration time agent optimize Q value for maximum future reward. 

In simple terms to short the path Agent try Epsilon greedy strategy concept. The rate of epsilon is high at starting as agent explore the environment and deacrese as agent start exploiting environment.

Epsilon greedy strategy : Epsilon-Greedy is a simple method to balance exploration and exploitation by choosing between exploration and exploitation randomly.

Evaluation: In this step we actually first measure the reward and then evaluate Q function.

This process is repeated again and again until the learning is stopped. In this way the Q-Table is been updated and the Q function is maximized.

Q-Learning Algorithm in conclusion: 

Q-learning is a value based reinforcement learning algorithm meaning a off policy reinforcement learning algorithm which is used to find a optimal action selection policy using Q-function and Q-table where our goal is to maximize Q-function by iteratively updating Q-table in bellman equation.


Q(state, action) returns the expected future reward of that action at that state.

Our Mentors(For AI-ML)

Sharda Godara Chaudhary

Mrs. Sharda Godara Chaudhary

An alumna of MNIT-Jaipur and ACCENTURE, Pune


Ms. Nisha

An alumna of IIT-BHU

Related questions


About Us | Contact Us || Terms & Conditions | Privacy Policy || Youtube Channel || Telegram Channel © Social::   |  |