We know that in machine learning we have three types of learning or ways in which a machine learns, are:

**Supervised learning:** Machine learns under supervision with labeled data. Example-Predicting values based on learning

**Unsupervised learning: **Machine learns without any supervision with unlabeled data (learn by recognizing patterns in data). Example- Clustering the same items in a group

**Reinforcement learning:**

Reinforcement learning is advanced machine learning, in which machines learn in a different way than supervised and unsupervised learning.

In reinforcement learning, there is an agent which continuously learns from its environment by interacting with it. Based on the action of the agent it gets rewarded positively or negatively, which improves the performance of the agent to understand the environment and problem.

For example- Self-drive car

**Environment: **A space in which an agent operates and learns, generally random (stochastic).

**Reward: **A reward is feedback to the agent for its action.

**Agent: **An entity that explores the environment.

**State: **Current situation of the agent or situation returned by the environment.

**Action: **Actions are the moves taken by the agent based on its learning from the environment.

reinforcement learning is based on the Hit and Trial method where the agent is not instructed about the environment and actions need to be taken by the agent. It learns through feedback from the environment that is the reward.

For example in self-driving car agent receive a negative reward if the car gets an accident (that get hit) and receive a positive reward if clear the goals without hitting.

To build an optimal **policy** for the self-driving car not to get hit, the agent has to explore more and more states and have to maximize its rewards. This is called the exploration vs exploitation trade-off. An agent has to balance both, to get a reward (value).

**Policy: **Policy is a strategy mapped by the agent for the next action based on the current state.

**Value:** It is a long-term future reward that an agent should receive with the discount factor and opposite to the short-term reward.

**Applications of reinforcement learning: **

- In robotics for industrial automation.
- Game playing
- In business to make decisions
- Traffic signal control
- Robotics control

**Approaches to implement reinforcement learning **

**There are three ways to implement reinforcement learning are: **

**1.Value Based: **The value based approach used, to maximize value function at a state under any policy and agent expect a long term return at current state and any policy.

**2.Policy Based: ** In policy based approach agent try to come up with such a policy that it can gain maximum rewards in future without using value function.

Two types of policy

- Deterministic: Action of policy is same for any state.
- Stochastic: Action of policy determined by probability.

**3.Model Based: **

In this Reinforcement Learning method, you need to create a virtual model for each environment and the agent explores that environment to learn it.

**Types of Reinforcement learning**

**Positive Reinforcement : **It impacts positively on the behavior of the agent and increases the strength and the frequency of the behavior of agent.

**Negative Reinforcement: **The negative reinforcement is opposite to the positive reinforcement and more effective than the positive reinforcement as it increases the tendency that the specific behavior will occur again by avoiding the negative condition.

**Reinforcement learning algorithms**

There are two important learning models in reinforcement learning.

**Markov Decision Process**

In markov decision process agent is constantly interacts with the environment and performs actions. For each action , the environment responds and generate a new reward and state as a feedback to agent.

The environment is fully observable environment and formally described as Markov decision processes (MDPs).

Markov decision process in used to describe the environment for Reinforcement Learning , and almost all the RL problem can be formalized using MDP.

A markov decision process need to satisfy the Markov Property.

**What is Markov Property ?**

It says that the future is independent of the past given the present. Meaning if agent is at current state S1 and performs an action A1 and move to the state S2, then the state transition from S1 to S2 only depends on the current state and future action and states do not depend on past actions, rewards, or states.

For example in chess game; player only focus on current state and future action not on past action and state.

**Markov Process/ Markov chain:** Markov Process is a memoryless process which consists sequence of random states S1, S2,S3 … with the Markov property.

Markov Process/ Markov Chain tuple (S,P) where S : Finite set of states and

P: State transition probability

**Markov Reward Process: **A Markov Reward Process is a Markov chain with reward values.

Markov reward process tuple (S,P,R,γ) where S : Finite set of states and

P: State transition probability

R: Reward

γ: Discount Factor

In conclusion Markov Decision Process provides a mathematical framework for modeling **decision making** in situations where outcomes are partly random and partly under the control of a **decision** maker.

Source: Wikipedia , UCL Lecture

**Q-learning Algorithm in Reinforcement Learning **