The Mathematical Backbone of AI Agents: Finite Markov Decision Processes

2026-01-19•byAritro Roy,Kiran Ghumare

Introduction

Modern AI agents repeatedly observe the world, make decisions, and adapt based on feedback. At the core of this interaction lies a simple but powerful mathematical abstraction: the Markov Decision Process (MDP). This article builds an intuition-first understanding of finite MDPs, starting from basic decision loops and gradually connecting them to real engineering systems and modern AI agents.

Decision-Making as a Feedback Loop

Consider a game where the goal is to maximize points.
At every turn, the player observes the current situation, chooses an action, the game responds, and feedback is received. Over time, the strategy evolves based on what works.

This pattern - observe, act, receive feedback, adapt - is universal.
A Markov Decision Process (MDP) formalizes this loop for systems that must make sequential decisions under uncertainty.

The Agent - Environment Interface

Reinforcement learning models the world using two entities:

Agent: the decision-maker
Environment: everything the agent does not control

At discrete time step $t$ :

The agent observes the state $s_t$
Chooses an action $a_t$
The environment returns a reward $r_{t+1}$

Components of a Markov Decision Process

A Markov Decision Process (MDP) is defined by the tuple $(S, A, T, R, \pi)$ .

States ( $S$ )

A state represents all information required for decision-making at a given time.
Importantly, a state is not reality; it is a useful abstraction of reality.

Actions ( $A$ )

Actions are the choices available to the agent in a given state.

Transition Function ( $T$ )

The transition dynamics of the environment are captured by:

P(s_{t+1} \mid s_t, a_t)

This probability models uncertainty in how the environment evolves after the agent takes an action.

Rewards ( $R$ )

The reward function encodes the objective of the agent:

R(s_t, a_t, s_{t+1})

Rewards specify what the agent should optimize, not how it should behave.

Policy ( $\pi$ )

A policy defines the behavior of the agent:

\pi(a \mid s) = P(A_t = a \mid S_t = s)

It describes the probability of selecting each action in a given state.

The Reinforcement Learning Loop

The agent improves its policy over time to maximize long-term reward.

Engineering Intuition

Consider an agent controlling a character in a tactical game.

State: health, ammo, enemy positions
Action: shoot, advance, take cover
Reward: mission progress
Policy: learned tactical strategy

Aggressive actions may yield short-term gains but reduce long-term success. MDPs capture this trade-off explicitly.

A Canonical Example: The Recycling Robot

We now consider a classic example from the reinforcement learning literature: the recycling robot.
Despite its simplicity, this example captures the core challenges of sequential decision-making under uncertainty and illustrates the importance of abstraction in state design.

Problem Description

A mobile robot operates in an office environment and collects empty soda cans for recycling.
The robot runs on a rechargeable battery and must decide how to behave based on its current energy level.

Its objective is to maximize the total number of cans collected over time while avoiding complete battery depletion.

Crucially, the learning agent is not the entire robot.
The agent is responsible only for high-level decision-making, while the environment includes low-level dynamics such as motors, sensors, battery discharge, and the behavior of people in the office.

Action Space

At each decision point, the agent may choose one of the following actions:

Search: actively roam the office looking for cans. This action has a high probability of yielding reward, but consumes battery power.
Wait: remain idle and wait for someone to bring a can. This action yields smaller expected reward and consumes little or no battery.
Recharge: move to the charging station and recharge the battery. This action restores energy but produces no immediate reward.

State Space

The agent observes a single abstract state variable: the battery level.

S = \{\text{High}, \text{Low}\}

Here, High indicates sufficient battery charge, while Low indicates that the battery is close to depletion.
Many real-world details—such as exact voltage, distance traveled, or office layout—are deliberately ignored.

Key insight:
A state is not a complete description of the world, but a decision-relevant abstraction of it.

Reward Structure

The reward function reflects the robot’s objective of collecting cans efficiently:

Search: yields a positive expected reward corresponding to cans collected
Wait: yields a small positive reward
Recharge: yields zero reward
Battery depletion: incurs a large negative reward (failure)

Rewards are stochastic: searching does not guarantee finding a can, and battery discharge is probabilistic.

Transition Dynamics

The environment evolves probabilistically based on the current state and chosen action.

From a High battery state:

Search may keep the battery High or transition it to Low
Wait leaves the battery High
Recharge leaves the battery High

From a Low battery state:

Search risks battery depletion and failure
Wait leaves the battery Low
Recharge transitions the battery to High

State Transition Diagram

Policy Intuition

The robot faces a tradeoff between short-term reward and long-term survival. Searching aggressively increases immediate reward but risks battery depletion, while recharging sacrifices short-term gains for future stability.

An optimal policy typically:

Searches when the battery level is High
Recharges when the battery level is Low
Avoids risky actions near failure states

This balance between exploitation and caution is precisely what Markov Decision Processes are designed to capture.

Finite Markov Decision Processes

When both the state space and the action space are finite, the system is called a finite Markov Decision Process (MDP).
Such MDPs are tractable and admit strong theoretical guarantees.

A finite MDP is fully defined by the joint transition distribution:

p(s', r \mid s, a) = P(S_{t+1} = s', \; R_{t+1} = r \mid S_t = s, \; A_t = a)

Returns and Discounting

Agents optimize cumulative return, defined as:

G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}

The discount factor $\\gamma \in [0, 1]$ controls how farsighted the agent is:

$\\gamma \approx 0$ emphasizes immediate rewards
$\\gamma \approx 1$ emphasizes long-term rewards

The Markov Property

The Markov assumption states that the future depends only on the present state and action, not on the full history.

P(s_{t+1}, r_{t+1} \mid s_t, a_t) = P(s_{t+1}, r_{t+1} \mid s_0, a_0, \dots, s_t, a_t)

In other words, once the current state $s_t$ is known, past states and actions provide no additional information about the future.

While rarely perfect in real-world systems, this assumption enables tractable reasoning, planning, and learning, and forms the foundation of Markov Decision Processes.

The Bellman Expectation Equation

Given a fixed policy $\pi$ , the value function $V^\pi(s)$ represents the expected return when starting from state $s$ and following policy $\pi$ thereafter.

V^\pi(s) = \sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s, a) \left[ R(s,a,s') + \gamma V^\pi(s') \right]

This equation captures the recursive nature of sequential decision-making:
the value of a state equals the expected immediate reward plus the discounted value of future states.

Intuition Behind the Bellman Expectation Equation

The Bellman expectation equation is often perceived as abstract or intimidating when first encountered.
At its core, however, it is simply an application of expected value over a decision tree.

For reference, we restate the equation:

V^\pi(s) = \sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s, a) \left[ R(s,a,s') + \gamma V^\pi(s') \right]

This equation answers a single question:

If the agent starts in state $s$ and follows policy $\pi$ , what is the expected long-term return?

Breaking the Equation into Decisions and Outcomes

The structure of the Bellman expectation equation mirrors a two-level decision process:

The agent first selects an action according to its policy $\pi(a \mid s)$
The environment then responds by transitioning to a next state $s'$ according to
$P(s' \mid s, a)$

At each resulting outcome, the agent receives an immediate reward and continues from the new state.

Tree-Based Interpretation

Conceptually, evaluating $V^\pi(s)$ is equivalent to computing the expected value of all root-to-leaf paths in a decision tree:

The root represents the current state $s$
The first branching corresponds to action selection under policy $\pi$
The second branching corresponds to stochastic environment transitions
Each leaf contributes an immediate reward plus the discounted value of the next state

Decision Tree Visualization

Each root-to-leaf path corresponds to one possible future trajectory. The value of the root state is the probability-weighted average of all such paths.

Mapping the Tree to the Equation

Each component of the Bellman expectation equation corresponds directly to an element of the decision tree:

Equation Term	Tree Interpretation
$V^\pi(s)$	Value at the root node
$\pi(a \mid s)$	Probability of choosing an action branch
$P(s' \mid s, a)$	Probability of an outcome branch
$R(s,a,s')$	Immediate reward at the leaf
$\gamma V^\pi(s')$	Value of the subtree rooted at $s'$
$\sum$	Averaging over all paths

Concrete Example: Recycling Robot

Consider the recycling robot in the High battery state.
Suppose its policy is:

\pi(\text{Search} \mid \text{High}) = 0.7, \qquad \pi(\text{Wait} \mid \text{High}) = 0.3

If the robot chooses Search, the battery transitions according to:

P(\text{Low} \mid \text{High}, \text{Search}) = 0.3, \qquad P(\text{High} \mid \text{High}, \text{Search}) = 0.7

Choosing Wait always keeps the battery in the High state.

The corresponding value computation is:

\begin{aligned} V^\pi(\text{High}) =\;& 0.7 \Big[ 0.3 \big(R + \gamma V^\pi(\text{Low})\big) + 0.7 \big(R + \gamma V^\pi(\text{High})\big) \Big] \\ &+ 0.3 \Big[ 1.0 \big(R + \gamma V^\pi(\text{High})\big) \Big] \end{aligned}

This expression is simply the expected value of all branches in the corresponding decision tree.

Key Takeaway

The Bellman expectation equation does not introduce a new kind of logic.
It computes the expected return of a decision tree, compressed into a recursive equation.

The power of Bellman equations lies in their ability to replace exponentially large trees with compact value functions, enabling efficient planning and learning.

Policy Evaluation (Known Model)

When the transition dynamics and reward function of the environment are known, the value function for a fixed policy can be computed iteratively using the Bellman expectation equation.

# Policy Evaluation for the Recycling Robot

states = ["HIGH", "LOW"]
gamma = 0.9

# Transition model
P = {
    ("HIGH", "SEARCH"): [("HIGH", 0.7), ("LOW", 0.3)],
    ("HIGH", "WAIT"): [("HIGH", 1.0)],
    ("HIGH", "RECHARGE"): [("HIGH", 1.0)],
    ("LOW", "SEARCH"): [("LOW", 0.6), ("FAIL", 0.4)],
    ("LOW", "WAIT"): [("LOW", 1.0)],
    ("LOW", "RECHARGE"): [("HIGH", 1.0)],
}

# Reward function
R = {
    ("HIGH", "SEARCH"): 1.5,
    ("HIGH", "WAIT"): 0.5,
    ("HIGH", "RECHARGE"): 0.0,
    ("LOW", "SEARCH"): -2.0,
    ("LOW", "WAIT"): 0.2,
    ("LOW", "RECHARGE"): 0.0,
}

def evaluate_policy(policy, iterations=500):
    # Initialize value function to zero for all states
    V = {s: 0.0 for s in states}

    # Repeat Bellman updates until convergence
    for _ in range(iterations):
        for s in states:
            a = policy(s)  # Action chosen under the policy
            expected_value = 0.0
            possible_outcomes = P[(s, a)]

            # Compute expected return over all possible transitions
            for s_next, prob in possible_outcomes:
                immediate_reward = R[(s, a)]
                future_value = V.get(s_next, 0.0)

                expected_value += prob * (
                    immediate_reward + gamma * future_value
                )

            V[s] = expected_value

    return V

Bellman Optimality Equation

To compute optimal behavior, we replace the policy-weighted expectation in the Bellman expectation equation with a maximization over actions.

V^*(s) = \max_a \sum_{s'} P(s' \mid s, a) \left[ R(s,a,s') + \gamma V^*(s') \right]

This equation defines the optimal value function.

Intuition Behind the Bellman Optimality Equation

While the Bellman expectation equation evaluates the value of a state under a fixed policy, the Bellman optimality equation answers a stronger question:

What is the best possible long-term value achievable from a state, assuming the agent always acts optimally?

To answer this question, the policy-weighted average over actions is replaced with a maximization:

V^*(s) = \max_a \sum_{s'} P(s' \mid s, a) \left[ R(s,a,s') + \gamma V^*(s') \right]

This equation formally defines the optimal value function.

From Expectation to Optimization

The difference between the two Bellman equations is subtle but profound:

Bellman Expectation: averages over actions according to a policy
Bellman Optimality: assumes the agent always chooses the best possible action

Conceptually, the agent no longer asks:

“What will I do under my current policy?”

Instead, it asks:

“What should I do to maximize long-term reward?”

Decision Tree Interpretation

The Bellman optimality equation can be interpreted as a decision tree in which the agent performs action selection by choosing the best branch:

The root represents the current state $s$
Each possible action corresponds to a subtree
The environment expands each subtree probabilistically
The agent selects the subtree with the highest expected

Optimal Tree Diagram

Each action defines a complete expectation over future outcomes. The optimal value function selects the action whose subtree has the highest expected value.

Mapping the Tree to the Equation

Each component of the Bellman optimality equation corresponds directly to an element of the decision tree:

Equation Component	Tree Interpretation
$V^*(s)$	Value of the best subtree rooted at $s$
$\max_a$	Selection of the best action branch
$\sum_{s'} P(s' \mid s, a)$	Expectation over environment outcomes
$R(s,a,s')$	Immediate reward at leaf nodes
$\gamma V^*(s')$	Value of future subtrees

Concrete Example: Recycling Robot

Consider the recycling robot in the Low battery state.

The robot has two reasonable actions:

Search: high immediate reward but a significant risk of battery failure
Recharge: no immediate reward, but safely restores the battery

The Bellman optimality equation compares the expected return of both actions:

Q^*(\text{Low}, \text{Search}) = \sum_{s'} P(s' \mid \text{Low}, \text{Search}) \left[ R + \gamma V^*(s') \right]

Q^*(\text{Low}, \text{Recharge}) = \sum_{s'} P(s' \mid \text{Low}, \text{Recharge}) \left[ R + \gamma V^*(s') \right]

The optimal value for the Low battery state is then:

V^*(\text{Low}) = \max \Big\{ Q^*(\text{Low}, \text{Search}), Q^*(\text{Low}, \text{Recharge}) \Big\}

Even if searching occasionally yields higher immediate reward, the increased risk of failure reduces its long-term value, making recharging the optimal choice.

Key Insight

The Bellman optimality equation performs local maximization at each state, assuming all future decisions are also optimal.

Value Iteration

Value iteration applies the Bellman optimality update repeatedly in order to converge to the optimal value function $V^*$ .

# Value Iteration

actions = ["SEARCH", "WAIT", "RECHARGE"]

def value_iteration(iterations=50):
    # Initialize value function to zero for all states
    V = {s: 0.0 for s in states}

    # Repeat Bellman optimality updates
    for _ in range(iterations):

        # Iterate over all states (HIGH, LOW)
        for s in states:
            best_value_found = -float("inf")

            # Evaluate every possible action
            for a in actions:
                current_action_value = 0.0
                possible_outcomes = P[(s, a)]

                for s_next, prob in possible_outcomes:
                    immediate_reward = R[(s, a)]
                    future_value = V.get(s_next, 0.0)

                    current_action_value += prob * (
                        immediate_reward + gamma * future_value
                    )

                # Keep the best action value
                if current_action_value > best_value_found:
                    best_value_found = current_action_value

            # Update the value of the state
            V[s] = best_value_found

    return V

Q-Learning (Unknown Model)

When transition probabilities are unknown, the agent can learn directly from experience using Q-learning.

Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \max_{a'} Q(s',a') - Q(s,a) \right]

Intuition Behind Q-Learning

When the transition dynamics and reward function of an environment are unknown, planning methods such as value iteration are no longer applicable.
In such settings, the agent must learn directly from interaction with the environment.

Q-learning is a model-free reinforcement learning algorithm that learns optimal behavior purely from experience, without requiring knowledge of transition probabilities or rewards.

From State Values to Action Values

Recall that the optimal value function satisfies the Bellman optimality equation:

V^*(s) = \max_a \sum_{s'} P(s' \mid s, a) \left[ R(s,a,s') + \gamma V^*(s') \right]

Q-learning shifts the focus from states to state–action pairs.
Instead of estimating $V^*(s)$ directly, the agent learns the optimal action-value function:

Q^*(s,a) = \sum_{s'} P(s' \mid s, a) \left[ R(s,a,s') + \gamma \max_{a'} Q^*(s',a') \right]

Once $Q^*(s,a)$ is known, the optimal policy follows immediately:

\pi^*(s) = \arg\max_a Q^*(s,a)

The Q-Learning Update Rule

Q-learning approximates $Q^*(s,a)$ using incremental updates based on experience.

After observing a single transition $(s,a,r,s')$ , the update is:

Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \max_{a'} Q(s',a') - Q(s,a) \right]

Intuition: Learning from One Step of Experience

Each term in the update rule has a clear interpretation:

$Q(s,a)$ : current estimate of the value of taking action $a$ in state $s$
$r$ : immediate reward received
$\max_{a'} Q(s',a')$ : estimated value of the best future action
$\gamma$ : discount factor controlling long-term importance
$\alpha$ : learning rate controlling how aggressively updates are applied

The quantity inside the brackets is known as the temporal-difference (TD) error:

\delta = r + \gamma \max_{a'} Q(s',a') - Q(s,a)

The agent adjusts its estimate in the direction of this error.

Tree-Based Interpretation of Q-Learning

Q-learning can be understood as learning the value of a one-step lookahead tree:

The root represents the current state–action pair $(s,a)$
The environment expands the tree by transitioning to $s'$
The agent assumes optimal behavior beyond $s'$

Unlike Bellman expectation methods, Q-learning involves:

No averaging over actions
No explicit transition probabilities

The agent samples a single branch of the tree and updates its estimate accordingly.

Q-Learning Tree Diagram

(This can be visualized as a one-step decision tree where the agent evaluates the best possible action from the next state.)

The Q-learning update replaces the value of the root node with:

r + \gamma \max_{a'} Q(s',a')

This is known as a sampled Bellman optimality backup.

Concrete Example: Recycling Robot

Suppose the recycling robot is in the Low battery state and takes the action Search.

The robot receives a reward $r = -10$ if the battery fails
The episode terminates

The Q-learning update becomes:

Q(\text{Low}, \text{Search}) \leftarrow Q(\text{Low}, \text{Search}) + \alpha \left[ -10 - Q(\text{Low}, \text{Search}) \right]

Repeated experiences like this reduce the estimated value of searching in low-battery states, even if searching occasionally yields a positive reward.

Why Q-Learning Works

Q-learning converges to the optimal action-value function $Q^*(s,a)$ under the following conditions:

All state–action pairs are explored infinitely often
The learning rate $\alpha$ decays appropriately
The environment is stationary

Crucially, Q-learning is:

Off-policy: it learns the optimal policy regardless of the agent’s behavior policy
Model-free: it does not require knowledge of $P(s' \mid s,a)$ or $R(s,a,s')$

Key Insight

Q-learning replaces full expectation over future outcomes with repeated sampling, gradually shaping value estimates toward optimality.

Each update is noisy and local, but over time the agent internalizes the structure of the optimal decision tree.

In essence, Q-learning is Bellman optimality learned one experience at a time.
It converges to the optimal policy without explicit knowledge of the transition model.

Partial Observability

If the agent does not observe the true battery state directly, the Markov assumption is violated.

# Noisy Observation Model
def observe(true_state):
    if random.random() < 0.2:
        return "HIGH" if true_state == "LOW" else "LOW"
    return true_state

This motivates the use of belief states, memory, and recurrent architectures.

Summary

Starting from a simple finite Markov Decision Process, we derived Bellman equations, implemented planning algorithms, and demonstrated learning from experience.

This progression illustrates how abstract mathematical definitions translate directly into working decision-making systems.

Where MDPs Break Down

Markov Decision Processes struggle with:

Partial observability
Non-stationary environments
Multi-agent interactions
Sparse or delayed rewards

These limitations motivate extensions such as Partially Observable MDPs (POMDPs) and hierarchical reinforcement learning.

Conclusion

Finite Markov Decision Processes are not a perfect model of intelligence.
They are a useful abstraction that allows us to reason about decision-making systems.

Understanding MDPs is less about equations and more about understanding how systems make decisions over time.

References

Sutton, R. S., & Barto, A. G. Reinforcement Learning: An Introduction. MIT Press.
Morales, M. Grokking Deep Reinforcement Learning. Manning Publications.

The Mathematical Backbone of AI Agents: Finite Markov Decision Processes

Introduction

Decision-Making as a Feedback Loop

The Agent - Environment Interface

Components of a Markov Decision Process

States (SSS)

Actions (AAA)

Transition Function (TTT)

Rewards (RRR)

Policy (π\piπ)

The Reinforcement Learning Loop

Engineering Intuition

A Canonical Example: The Recycling Robot

Problem Description

Action Space

State Space

Reward Structure

Transition Dynamics

State Transition Diagram

Policy Intuition

Finite Markov Decision Processes

Returns and Discounting

The Markov Property

The Bellman Expectation Equation

Intuition Behind the Bellman Expectation Equation

Breaking the Equation into Decisions and Outcomes

Tree-Based Interpretation

Decision Tree Visualization

Mapping the Tree to the Equation

Concrete Example: Recycling Robot

Key Takeaway

Policy Evaluation (Known Model)

Bellman Optimality Equation

Intuition Behind the Bellman Optimality Equation

From Expectation to Optimization

Decision Tree Interpretation

Optimal Tree Diagram

Mapping the Tree to the Equation

Concrete Example: Recycling Robot

Key Insight

Value Iteration

Q-Learning (Unknown Model)

Intuition Behind Q-Learning

From State Values to Action Values

The Q-Learning Update Rule

Intuition: Learning from One Step of Experience

Tree-Based Interpretation of Q-Learning

Q-Learning Tree Diagram

Concrete Example: Recycling Robot

Why Q-Learning Works

Key Insight

Partial Observability

Summary

Where MDPs Break Down

Conclusion

References

States ( $S$ )

Actions ( $A$ )

Transition Function ( $T$ )

Rewards ( $R$ )

Policy ( $\pi$ )