The Mathematical Backbone of AI Agents: Finite Markov Decision Processes

2026-01-19β€’byAritro Roy,Kiran Ghumare

Introduction

Modern AI agents repeatedly observe the world, make decisions, and adapt based on feedback. At the core of this interaction lies a simple but powerful mathematical abstraction: the Markov Decision Process (MDP). This article builds an intuition-first understanding of finite MDPs, starting from basic decision loops and gradually connecting them to real engineering systems and modern AI agents.

Decision-Making as a Feedback Loop

Consider a game where the goal is to maximize points.
At every turn, the player observes the current situation, chooses an action, the game responds, and feedback is received. Over time, the strategy evolves based on what works.

This pattern - observe, act, receive feedback, adapt - is universal.
A Markov Decision Process (MDP) formalizes this loop for systems that must make sequential decisions under uncertainty.

The Agent - Environment Interface

Reinforcement learning models the world using two entities:

  • Agent: the decision-maker
  • Environment: everything the agent does not control

At discrete time step tt:

  • The agent observes the state sts_t
  • Chooses an action ata_t
  • The environment returns a reward rt+1r_{t+1}

Components of a Markov Decision Process

A Markov Decision Process (MDP) is defined by the tuple (S,A,T,R,Ο€)(S, A, T, R, \pi).

States (SS)

A state represents all information required for decision-making at a given time.
Importantly, a state is not reality; it is a useful abstraction of reality.

Actions (AA)

Actions are the choices available to the agent in a given state.

Transition Function (TT)

The transition dynamics of the environment are captured by:

P(st+1∣st,at)P(s_{t+1} \mid s_t, a_t)

This probability models uncertainty in how the environment evolves after the agent takes an action.

Rewards (RR)

The reward function encodes the objective of the agent:

R(st,at,st+1)R(s_t, a_t, s_{t+1})

Rewards specify what the agent should optimize, not how it should behave.

Policy (Ο€\pi)

A policy defines the behavior of the agent:

Ο€(a∣s)=P(At=a∣St=s)\pi(a \mid s) = P(A_t = a \mid S_t = s)

It describes the probability of selecting each action in a given state.

The Reinforcement Learning Loop

The Reinforcement Learning Loop

The agent improves its policy over time to maximize long-term reward.

Engineering Intuition

Consider an agent controlling a character in a tactical game.

  • State: health, ammo, enemy positions
  • Action: shoot, advance, take cover
  • Reward: mission progress
  • Policy: learned tactical strategy

Aggressive actions may yield short-term gains but reduce long-term success. MDPs capture this trade-off explicitly.


A Canonical Example: The Recycling Robot

We now consider a classic example from the reinforcement learning literature: the recycling robot.
Despite its simplicity, this example captures the core challenges of sequential decision-making under uncertainty and illustrates the importance of abstraction in state design.

Problem Description

A mobile robot operates in an office environment and collects empty soda cans for recycling.
The robot runs on a rechargeable battery and must decide how to behave based on its current energy level.

Its objective is to maximize the total number of cans collected over time while avoiding complete battery depletion.

Crucially, the learning agent is not the entire robot.
The agent is responsible only for high-level decision-making, while the environment includes low-level dynamics such as motors, sensors, battery discharge, and the behavior of people in the office.

Action Space

At each decision point, the agent may choose one of the following actions:

  • Search: actively roam the office looking for cans. This action has a high probability of yielding reward, but consumes battery power.
  • Wait: remain idle and wait for someone to bring a can. This action yields smaller expected reward and consumes little or no battery.
  • Recharge: move to the charging station and recharge the battery. This action restores energy but produces no immediate reward.

State Space

The agent observes a single abstract state variable: the battery level.

S={High,Low}S = \{\text{High}, \text{Low}\}

Here, High indicates sufficient battery charge, while Low indicates that the battery is close to depletion.
Many real-world detailsβ€”such as exact voltage, distance traveled, or office layoutβ€”are deliberately ignored.

Key insight:
A state is not a complete description of the world, but a decision-relevant abstraction of it.

Reward Structure

The reward function reflects the robot’s objective of collecting cans efficiently:

  • Search: yields a positive expected reward corresponding to cans collected
  • Wait: yields a small positive reward
  • Recharge: yields zero reward
  • Battery depletion: incurs a large negative reward (failure)

Rewards are stochastic: searching does not guarantee finding a can, and battery discharge is probabilistic.

Transition Dynamics

The environment evolves probabilistically based on the current state and chosen action.

From a High battery state:

  • Search may keep the battery High or transition it to Low
  • Wait leaves the battery High
  • Recharge leaves the battery High

From a Low battery state:

  • Search risks battery depletion and failure
  • Wait leaves the battery Low
  • Recharge transitions the battery to High

State Transition Diagram

State Transition Diagram

Policy Intuition

The robot faces a tradeoff between short-term reward and long-term survival. Searching aggressively increases immediate reward but risks battery depletion, while recharging sacrifices short-term gains for future stability.

An optimal policy typically:

  • Searches when the battery level is High
  • Recharges when the battery level is Low
  • Avoids risky actions near failure states

This balance between exploitation and caution is precisely what Markov Decision Processes are designed to capture.

Finite Markov Decision Processes

When both the state space and the action space are finite, the system is called a finite Markov Decision Process (MDP).
Such MDPs are tractable and admit strong theoretical guarantees.

A finite MDP is fully defined by the joint transition distribution:

p(sβ€²,r∣s,a)=P(St+1=sβ€²,β€…β€ŠRt+1=r∣St=s,β€…β€ŠAt=a)p(s', r \mid s, a) = P(S_{t+1} = s', \; R_{t+1} = r \mid S_t = s, \; A_t = a)

Returns and Discounting

Agents optimize cumulative return, defined as:

Gt=βˆ‘k=0∞γkRt+k+1G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}

The discount factor gamma∈[0,1]\\gamma \in [0, 1] controls how farsighted the agent is:

  • gammaβ‰ˆ0\\gamma \approx 0 emphasizes immediate rewards
  • gammaβ‰ˆ1\\gamma \approx 1 emphasizes long-term rewards

The Markov Property

The Markov assumption states that the future depends only on the present state and action, not on the full history.

P(st+1,rt+1∣st,at)=P(st+1,rt+1∣s0,a0,…,st,at)P(s_{t+1}, r_{t+1} \mid s_t, a_t) = P(s_{t+1}, r_{t+1} \mid s_0, a_0, \dots, s_t, a_t)

In other words, once the current state sts_t is known, past states and actions provide no additional information about the future.

While rarely perfect in real-world systems, this assumption enables tractable reasoning, planning, and learning, and forms the foundation of Markov Decision Processes.


The Bellman Expectation Equation

Given a fixed policy Ο€\pi, the value function VΟ€(s)V^\pi(s) represents the expected return when starting from state ss and following policy Ο€\pi thereafter.

VΟ€(s)=βˆ‘aΟ€(a∣s)βˆ‘sβ€²P(sβ€²βˆ£s,a)[R(s,a,sβ€²)+Ξ³VΟ€(sβ€²)]V^\pi(s) = \sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s, a) \left[ R(s,a,s') + \gamma V^\pi(s') \right]

This equation captures the recursive nature of sequential decision-making:
the value of a state equals the expected immediate reward plus the discounted value of future states.

Intuition Behind the Bellman Expectation Equation

The Bellman expectation equation is often perceived as abstract or intimidating when first encountered.
At its core, however, it is simply an application of expected value over a decision tree.

For reference, we restate the equation:

VΟ€(s)=βˆ‘aΟ€(a∣s)βˆ‘sβ€²P(sβ€²βˆ£s,a)[R(s,a,sβ€²)+Ξ³VΟ€(sβ€²)]V^\pi(s) = \sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s, a) \left[ R(s,a,s') + \gamma V^\pi(s') \right]

This equation answers a single question:

If the agent starts in state ss and follows policy Ο€\pi, what is the expected long-term return?

Breaking the Equation into Decisions and Outcomes

The structure of the Bellman expectation equation mirrors a two-level decision process:

  1. The agent first selects an action according to its policy Ο€(a∣s)\pi(a \mid s)
  2. The environment then responds by transitioning to a next state sβ€²s' according to
    P(sβ€²βˆ£s,a)P(s' \mid s, a)

At each resulting outcome, the agent receives an immediate reward and continues from the new state.

Tree-Based Interpretation

Conceptually, evaluating VΟ€(s)V^\pi(s) is equivalent to computing the expected value of all root-to-leaf paths in a decision tree:

  • The root represents the current state ss
  • The first branching corresponds to action selection under policy Ο€\pi
  • The second branching corresponds to stochastic environment transitions
  • Each leaf contributes an immediate reward plus the discounted value of the next state

Decision Tree Visualization

Decision Tree

Each root-to-leaf path corresponds to one possible future trajectory. The value of the root state is the probability-weighted average of all such paths.

Mapping the Tree to the Equation

Each component of the Bellman expectation equation corresponds directly to an element of the decision tree:

Equation TermTree Interpretation
VΟ€(s)V^\pi(s)Value at the root node
Ο€(a∣s)\pi(a \mid s)Probability of choosing an action branch
P(sβ€²βˆ£s,a)P(s' \mid s, a)Probability of an outcome branch
R(s,a,sβ€²)R(s,a,s')Immediate reward at the leaf
Ξ³VΟ€(sβ€²)\gamma V^\pi(s')Value of the subtree rooted at sβ€²s'
βˆ‘\sumAveraging over all paths

Concrete Example: Recycling Robot

Consider the recycling robot in the High battery state.
Suppose its policy is:

Ο€(Search∣High)=0.7,Ο€(Wait∣High)=0.3\pi(\text{Search} \mid \text{High}) = 0.7, \qquad \pi(\text{Wait} \mid \text{High}) = 0.3

If the robot chooses Search, the battery transitions according to:

P(Low∣High,Search)=0.3,P(High∣High,Search)=0.7P(\text{Low} \mid \text{High}, \text{Search}) = 0.3, \qquad P(\text{High} \mid \text{High}, \text{Search}) = 0.7

Choosing Wait always keeps the battery in the High state.

The corresponding value computation is:

VΟ€(High)=β€…β€Š0.7[0.3(R+Ξ³VΟ€(Low))+0.7(R+Ξ³VΟ€(High))]+0.3[1.0(R+Ξ³VΟ€(High))]\begin{aligned} V^\pi(\text{High}) =\;& 0.7 \Big[ 0.3 \big(R + \gamma V^\pi(\text{Low})\big) + 0.7 \big(R + \gamma V^\pi(\text{High})\big) \Big] \\ &+ 0.3 \Big[ 1.0 \big(R + \gamma V^\pi(\text{High})\big) \Big] \end{aligned}

This expression is simply the expected value of all branches in the corresponding decision tree.


Key Takeaway

The Bellman expectation equation does not introduce a new kind of logic.
It computes the expected return of a decision tree, compressed into a recursive equation.

The power of Bellman equations lies in their ability to replace exponentially large trees with compact value functions, enabling efficient planning and learning.

Policy Evaluation (Known Model)

When the transition dynamics and reward function of the environment are known, the value function for a fixed policy can be computed iteratively using the Bellman expectation equation.

# Policy Evaluation for the Recycling Robot

states = ["HIGH", "LOW"]
gamma = 0.9

# Transition model
P = {
    ("HIGH", "SEARCH"): [("HIGH", 0.7), ("LOW", 0.3)],
    ("HIGH", "WAIT"): [("HIGH", 1.0)],
    ("HIGH", "RECHARGE"): [("HIGH", 1.0)],
    ("LOW", "SEARCH"): [("LOW", 0.6), ("FAIL", 0.4)],
    ("LOW", "WAIT"): [("LOW", 1.0)],
    ("LOW", "RECHARGE"): [("HIGH", 1.0)],
}

# Reward function
R = {
    ("HIGH", "SEARCH"): 1.5,
    ("HIGH", "WAIT"): 0.5,
    ("HIGH", "RECHARGE"): 0.0,
    ("LOW", "SEARCH"): -2.0,
    ("LOW", "WAIT"): 0.2,
    ("LOW", "RECHARGE"): 0.0,
}

def evaluate_policy(policy, iterations=500):
    # Initialize value function to zero for all states
    V = {s: 0.0 for s in states}

    # Repeat Bellman updates until convergence
    for _ in range(iterations):
        for s in states:
            a = policy(s)  # Action chosen under the policy
            expected_value = 0.0
            possible_outcomes = P[(s, a)]

            # Compute expected return over all possible transitions
            for s_next, prob in possible_outcomes:
                immediate_reward = R[(s, a)]
                future_value = V.get(s_next, 0.0)

                expected_value += prob * (
                    immediate_reward + gamma * future_value
                )

            V[s] = expected_value

    return V

Bellman Optimality Equation

To compute optimal behavior, we replace the policy-weighted expectation in the Bellman expectation equation with a maximization over actions.

Vβˆ—(s)=max⁑aβˆ‘sβ€²P(sβ€²βˆ£s,a)[R(s,a,sβ€²)+Ξ³Vβˆ—(sβ€²)]V^*(s) = \max_a \sum_{s'} P(s' \mid s, a) \left[ R(s,a,s') + \gamma V^*(s') \right]

This equation defines the optimal value function.


Intuition Behind the Bellman Optimality Equation

While the Bellman expectation equation evaluates the value of a state under a fixed policy, the Bellman optimality equation answers a stronger question:

What is the best possible long-term value achievable from a state, assuming the agent always acts optimally?

To answer this question, the policy-weighted average over actions is replaced with a maximization:

Vβˆ—(s)=max⁑aβˆ‘sβ€²P(sβ€²βˆ£s,a)[R(s,a,sβ€²)+Ξ³Vβˆ—(sβ€²)]V^*(s) = \max_a \sum_{s'} P(s' \mid s, a) \left[ R(s,a,s') + \gamma V^*(s') \right]

This equation formally defines the optimal value function.


From Expectation to Optimization

The difference between the two Bellman equations is subtle but profound:

  • Bellman Expectation: averages over actions according to a policy
  • Bellman Optimality: assumes the agent always chooses the best possible action

Conceptually, the agent no longer asks:

β€œWhat will I do under my current policy?”

Instead, it asks:

β€œWhat should I do to maximize long-term reward?”


Decision Tree Interpretation

The Bellman optimality equation can be interpreted as a decision tree in which the agent performs action selection by choosing the best branch:

  • The root represents the current state ss
  • Each possible action corresponds to a subtree
  • The environment expands each subtree probabilistically
  • The agent selects the subtree with the highest expected

Optimal Tree Diagram

Optimal Tree

Each action defines a complete expectation over future outcomes. The optimal value function selects the action whose subtree has the highest expected value.

Mapping the Tree to the Equation

Each component of the Bellman optimality equation corresponds directly to an element of the decision tree:

Equation ComponentTree Interpretation
Vβˆ—(s)V^*(s)Value of the best subtree rooted at ss
max⁑a\max_aSelection of the best action branch
βˆ‘sβ€²P(sβ€²βˆ£s,a)\sum_{s'} P(s' \mid s, a)Expectation over environment outcomes
R(s,a,sβ€²)R(s,a,s')Immediate reward at leaf nodes
Ξ³Vβˆ—(sβ€²)\gamma V^*(s')Value of future subtrees

Concrete Example: Recycling Robot

Consider the recycling robot in the Low battery state.

The robot has two reasonable actions:

  • Search: high immediate reward but a significant risk of battery failure
  • Recharge: no immediate reward, but safely restores the battery

The Bellman optimality equation compares the expected return of both actions:

Qβˆ—(Low,Search)=βˆ‘sβ€²P(sβ€²βˆ£Low,Search)[R+Ξ³Vβˆ—(sβ€²)]Q^*(\text{Low}, \text{Search}) = \sum_{s'} P(s' \mid \text{Low}, \text{Search}) \left[ R + \gamma V^*(s') \right] Qβˆ—(Low,Recharge)=βˆ‘sβ€²P(sβ€²βˆ£Low,Recharge)[R+Ξ³Vβˆ—(sβ€²)]Q^*(\text{Low}, \text{Recharge}) = \sum_{s'} P(s' \mid \text{Low}, \text{Recharge}) \left[ R + \gamma V^*(s') \right]

The optimal value for the Low battery state is then:

Vβˆ—(Low)=max⁑{Qβˆ—(Low,Search),Qβˆ—(Low,Recharge)}V^*(\text{Low}) = \max \Big\{ Q^*(\text{Low}, \text{Search}), Q^*(\text{Low}, \text{Recharge}) \Big\}

Even if searching occasionally yields higher immediate reward, the increased risk of failure reduces its long-term value, making recharging the optimal choice.

Key Insight

The Bellman optimality equation performs local maximization at each state, assuming all future decisions are also optimal.

Value Iteration

Value iteration applies the Bellman optimality update repeatedly in order to converge to the optimal value function Vβˆ—V^*.

# Value Iteration

actions = ["SEARCH", "WAIT", "RECHARGE"]

def value_iteration(iterations=50):
    # Initialize value function to zero for all states
    V = {s: 0.0 for s in states}

    # Repeat Bellman optimality updates
    for _ in range(iterations):

        # Iterate over all states (HIGH, LOW)
        for s in states:
            best_value_found = -float("inf")

            # Evaluate every possible action
            for a in actions:
                current_action_value = 0.0
                possible_outcomes = P[(s, a)]

                for s_next, prob in possible_outcomes:
                    immediate_reward = R[(s, a)]
                    future_value = V.get(s_next, 0.0)

                    current_action_value += prob * (
                        immediate_reward + gamma * future_value
                    )

                # Keep the best action value
                if current_action_value > best_value_found:
                    best_value_found = current_action_value

            # Update the value of the state
            V[s] = best_value_found

    return V

Q-Learning (Unknown Model)

When transition probabilities are unknown, the agent can learn directly from experience using Q-learning.

Q(s,a)←Q(s,a)+Ξ±[r+Ξ³max⁑aβ€²Q(sβ€²,aβ€²)βˆ’Q(s,a)]Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \max_{a'} Q(s',a') - Q(s,a) \right]

Intuition Behind Q-Learning

When the transition dynamics and reward function of an environment are unknown, planning methods such as value iteration are no longer applicable.
In such settings, the agent must learn directly from interaction with the environment.

Q-learning is a model-free reinforcement learning algorithm that learns optimal behavior purely from experience, without requiring knowledge of transition probabilities or rewards.

From State Values to Action Values

Recall that the optimal value function satisfies the Bellman optimality equation:

Vβˆ—(s)=max⁑aβˆ‘sβ€²P(sβ€²βˆ£s,a)[R(s,a,sβ€²)+Ξ³Vβˆ—(sβ€²)]V^*(s) = \max_a \sum_{s'} P(s' \mid s, a) \left[ R(s,a,s') + \gamma V^*(s') \right]

Q-learning shifts the focus from states to state–action pairs.
Instead of estimating Vβˆ—(s)V^*(s) directly, the agent learns the optimal action-value function:

Qβˆ—(s,a)=βˆ‘sβ€²P(sβ€²βˆ£s,a)[R(s,a,sβ€²)+Ξ³max⁑aβ€²Qβˆ—(sβ€²,aβ€²)]Q^*(s,a) = \sum_{s'} P(s' \mid s, a) \left[ R(s,a,s') + \gamma \max_{a'} Q^*(s',a') \right]

Once Qβˆ—(s,a)Q^*(s,a) is known, the optimal policy follows immediately:

Ο€βˆ—(s)=arg⁑max⁑aQβˆ—(s,a)\pi^*(s) = \arg\max_a Q^*(s,a)

The Q-Learning Update Rule

Q-learning approximates Qβˆ—(s,a)Q^*(s,a) using incremental updates based on experience.

After observing a single transition (s,a,r,sβ€²)(s,a,r,s'), the update is:

Q(s,a)←Q(s,a)+Ξ±[r+Ξ³max⁑aβ€²Q(sβ€²,aβ€²)βˆ’Q(s,a)]Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \max_{a'} Q(s',a') - Q(s,a) \right]

Intuition: Learning from One Step of Experience

Each term in the update rule has a clear interpretation:

  • Q(s,a)Q(s,a): current estimate of the value of taking action aa in state ss
  • rr: immediate reward received
  • max⁑aβ€²Q(sβ€²,aβ€²)\max_{a'} Q(s',a'): estimated value of the best future action
  • Ξ³\gamma: discount factor controlling long-term importance
  • Ξ±\alpha: learning rate controlling how aggressively updates are applied

The quantity inside the brackets is known as the temporal-difference (TD) error:

Ξ΄=r+Ξ³max⁑aβ€²Q(sβ€²,aβ€²)βˆ’Q(s,a)\delta = r + \gamma \max_{a'} Q(s',a') - Q(s,a)

The agent adjusts its estimate in the direction of this error.

Tree-Based Interpretation of Q-Learning

Q-learning can be understood as learning the value of a one-step lookahead tree:

  • The root represents the current state–action pair (s,a)(s,a)
  • The environment expands the tree by transitioning to sβ€²s'
  • The agent assumes optimal behavior beyond sβ€²s'

Unlike Bellman expectation methods, Q-learning involves:

  • No averaging over actions
  • No explicit transition probabilities

The agent samples a single branch of the tree and updates its estimate accordingly.

Q-Learning Tree Diagram

(This can be visualized as a one-step decision tree where the agent evaluates the best possible action from the next state.)

Optimal Tree

The Q-learning update replaces the value of the root node with:

r+Ξ³max⁑aβ€²Q(sβ€²,aβ€²)r + \gamma \max_{a'} Q(s',a')

This is known as a sampled Bellman optimality backup.


Concrete Example: Recycling Robot

Suppose the recycling robot is in the Low battery state and takes the action Search.

  • The robot receives a reward r=βˆ’10r = -10 if the battery fails
  • The episode terminates

The Q-learning update becomes:

Q(Low,Search)←Q(Low,Search)+Ξ±[βˆ’10βˆ’Q(Low,Search)]Q(\text{Low}, \text{Search}) \leftarrow Q(\text{Low}, \text{Search}) + \alpha \left[ -10 - Q(\text{Low}, \text{Search}) \right]

Repeated experiences like this reduce the estimated value of searching in low-battery states, even if searching occasionally yields a positive reward.


Why Q-Learning Works

Q-learning converges to the optimal action-value function Qβˆ—(s,a)Q^*(s,a) under the following conditions:

  • All state–action pairs are explored infinitely often
  • The learning rate Ξ±\alpha decays appropriately
  • The environment is stationary

Crucially, Q-learning is:

  • Off-policy: it learns the optimal policy regardless of the agent’s behavior policy
  • Model-free: it does not require knowledge of P(sβ€²βˆ£s,a)P(s' \mid s,a) or R(s,a,sβ€²)R(s,a,s')

Key Insight

Q-learning replaces full expectation over future outcomes with repeated sampling, gradually shaping value estimates toward optimality.

Each update is noisy and local, but over time the agent internalizes the structure of the optimal decision tree.

In essence, Q-learning is Bellman optimality learned one experience at a time.
It converges to the optimal policy without explicit knowledge of the transition model.


Partial Observability

If the agent does not observe the true battery state directly, the Markov assumption is violated.

# Noisy Observation Model
def observe(true_state):
    if random.random() < 0.2:
        return "HIGH" if true_state == "LOW" else "LOW"
    return true_state

This motivates the use of belief states, memory, and recurrent architectures.


Summary

Starting from a simple finite Markov Decision Process, we derived Bellman equations, implemented planning algorithms, and demonstrated learning from experience.

This progression illustrates how abstract mathematical definitions translate directly into working decision-making systems.


Where MDPs Break Down

Markov Decision Processes struggle with:

  • Partial observability
  • Non-stationary environments
  • Multi-agent interactions
  • Sparse or delayed rewards

These limitations motivate extensions such as Partially Observable MDPs (POMDPs) and hierarchical reinforcement learning.


Conclusion

Finite Markov Decision Processes are not a perfect model of intelligence.
They are a useful abstraction that allows us to reason about decision-making systems.

Understanding MDPs is less about equations and more about understanding how systems make decisions over time.


References

  • Sutton, R. S., & Barto, A. G. Reinforcement Learning: An Introduction. MIT Press.
  • Morales, M. Grokking Deep Reinforcement Learning. Manning Publications.