The Mathematical Backbone of AI Agents: Finite Markov Decision Processes
Introduction
Modern AI agents repeatedly observe the world, make decisions, and adapt based on feedback. At the core of this interaction lies a simple but powerful mathematical abstraction: the Markov Decision Process (MDP). This article builds an intuition-first understanding of finite MDPs, starting from basic decision loops and gradually connecting them to real engineering systems and modern AI agents.
Decision-Making as a Feedback Loop
Consider a game where the goal is to maximize points.
At every turn, the player observes the current situation, chooses an action, the game responds, and feedback is received. Over time, the strategy evolves based on what works.
This pattern - observe, act, receive feedback, adapt - is universal.
A Markov Decision Process (MDP) formalizes this loop for systems that must make sequential decisions under uncertainty.
The Agent - Environment Interface
Reinforcement learning models the world using two entities:
- Agent: the decision-maker
- Environment: everything the agent does not control
At discrete time step :
- The agent observes the state
- Chooses an action
- The environment returns a reward
Components of a Markov Decision Process
A Markov Decision Process (MDP) is defined by the tuple .
States ()
A state represents all information required for decision-making at a given time.
Importantly, a state is not reality; it is a useful abstraction of reality.
Actions ()
Actions are the choices available to the agent in a given state.
Transition Function ()
The transition dynamics of the environment are captured by:
This probability models uncertainty in how the environment evolves after the agent takes an action.
Rewards ()
The reward function encodes the objective of the agent:
Rewards specify what the agent should optimize, not how it should behave.
Policy ()
A policy defines the behavior of the agent:
It describes the probability of selecting each action in a given state.
The Reinforcement Learning Loop
The agent improves its policy over time to maximize long-term reward.
Engineering Intuition
Consider an agent controlling a character in a tactical game.
- State: health, ammo, enemy positions
- Action: shoot, advance, take cover
- Reward: mission progress
- Policy: learned tactical strategy
Aggressive actions may yield short-term gains but reduce long-term success. MDPs capture this trade-off explicitly.
A Canonical Example: The Recycling Robot
We now consider a classic example from the reinforcement learning literature: the recycling robot.
Despite its simplicity, this example captures the core challenges of sequential decision-making under uncertainty and illustrates the importance of abstraction in state design.
Problem Description
A mobile robot operates in an office environment and collects empty soda cans for recycling.
The robot runs on a rechargeable battery and must decide how to behave based on its current energy level.
Its objective is to maximize the total number of cans collected over time while avoiding complete battery depletion.
Crucially, the learning agent is not the entire robot.
The agent is responsible only for high-level decision-making, while the environment includes low-level dynamics such as motors, sensors, battery discharge, and the behavior of people in the office.
Action Space
At each decision point, the agent may choose one of the following actions:
- Search: actively roam the office looking for cans. This action has a high probability of yielding reward, but consumes battery power.
- Wait: remain idle and wait for someone to bring a can. This action yields smaller expected reward and consumes little or no battery.
- Recharge: move to the charging station and recharge the battery. This action restores energy but produces no immediate reward.
State Space
The agent observes a single abstract state variable: the battery level.
Here, High indicates sufficient battery charge, while Low indicates that the battery is close to depletion.
Many real-world detailsβsuch as exact voltage, distance traveled, or office layoutβare deliberately ignored.
Key insight:
A state is not a complete description of the world, but a decision-relevant abstraction of it.
Reward Structure
The reward function reflects the robotβs objective of collecting cans efficiently:
- Search: yields a positive expected reward corresponding to cans collected
- Wait: yields a small positive reward
- Recharge: yields zero reward
- Battery depletion: incurs a large negative reward (failure)
Rewards are stochastic: searching does not guarantee finding a can, and battery discharge is probabilistic.
Transition Dynamics
The environment evolves probabilistically based on the current state and chosen action.
From a High battery state:
- Search may keep the battery High or transition it to Low
- Wait leaves the battery High
- Recharge leaves the battery High
From a Low battery state:
- Search risks battery depletion and failure
- Wait leaves the battery Low
- Recharge transitions the battery to High
State Transition Diagram
Policy Intuition
The robot faces a tradeoff between short-term reward and long-term survival. Searching aggressively increases immediate reward but risks battery depletion, while recharging sacrifices short-term gains for future stability.
An optimal policy typically:
- Searches when the battery level is High
- Recharges when the battery level is Low
- Avoids risky actions near failure states
This balance between exploitation and caution is precisely what Markov Decision Processes are designed to capture.
Finite Markov Decision Processes
When both the state space and the action space are finite, the system is called a finite Markov Decision Process (MDP).
Such MDPs are tractable and admit strong theoretical guarantees.
A finite MDP is fully defined by the joint transition distribution:
Returns and Discounting
Agents optimize cumulative return, defined as:
The discount factor controls how farsighted the agent is:
- emphasizes immediate rewards
- emphasizes long-term rewards
The Markov Property
The Markov assumption states that the future depends only on the present state and action, not on the full history.
In other words, once the current state is known, past states and actions provide no additional information about the future.
While rarely perfect in real-world systems, this assumption enables tractable reasoning, planning, and learning, and forms the foundation of Markov Decision Processes.
The Bellman Expectation Equation
Given a fixed policy , the value function represents the expected return when starting from state and following policy thereafter.
This equation captures the recursive nature of sequential decision-making:
the value of a state equals the expected immediate reward plus the discounted value of future states.
Intuition Behind the Bellman Expectation Equation
The Bellman expectation equation is often perceived as abstract or intimidating when first encountered.
At its core, however, it is simply an application of expected value over a decision tree.
For reference, we restate the equation:
This equation answers a single question:
If the agent starts in state and follows policy , what is the expected long-term return?
Breaking the Equation into Decisions and Outcomes
The structure of the Bellman expectation equation mirrors a two-level decision process:
- The agent first selects an action according to its policy
- The environment then responds by transitioning to a next state according to
At each resulting outcome, the agent receives an immediate reward and continues from the new state.
Tree-Based Interpretation
Conceptually, evaluating is equivalent to computing the expected value of all root-to-leaf paths in a decision tree:
- The root represents the current state
- The first branching corresponds to action selection under policy
- The second branching corresponds to stochastic environment transitions
- Each leaf contributes an immediate reward plus the discounted value of the next state
Decision Tree Visualization
Each root-to-leaf path corresponds to one possible future trajectory. The value of the root state is the probability-weighted average of all such paths.
Mapping the Tree to the Equation
Each component of the Bellman expectation equation corresponds directly to an element of the decision tree:
| Equation Term | Tree Interpretation |
|---|---|
| Value at the root node | |
| Probability of choosing an action branch | |
| Probability of an outcome branch | |
| Immediate reward at the leaf | |
| Value of the subtree rooted at | |
| Averaging over all paths |
Concrete Example: Recycling Robot
Consider the recycling robot in the High battery state.
Suppose its policy is:
If the robot chooses Search, the battery transitions according to:
Choosing Wait always keeps the battery in the High state.
The corresponding value computation is:
This expression is simply the expected value of all branches in the corresponding decision tree.
Key Takeaway
The Bellman expectation equation does not introduce a new kind of logic.
It computes the expected return of a decision tree, compressed into a recursive equation.
The power of Bellman equations lies in their ability to replace exponentially large trees with compact value functions, enabling efficient planning and learning.
Policy Evaluation (Known Model)
When the transition dynamics and reward function of the environment are known, the value function for a fixed policy can be computed iteratively using the Bellman expectation equation.
# Policy Evaluation for the Recycling Robot
states = ["HIGH", "LOW"]
gamma = 0.9
# Transition model
P = {
("HIGH", "SEARCH"): [("HIGH", 0.7), ("LOW", 0.3)],
("HIGH", "WAIT"): [("HIGH", 1.0)],
("HIGH", "RECHARGE"): [("HIGH", 1.0)],
("LOW", "SEARCH"): [("LOW", 0.6), ("FAIL", 0.4)],
("LOW", "WAIT"): [("LOW", 1.0)],
("LOW", "RECHARGE"): [("HIGH", 1.0)],
}
# Reward function
R = {
("HIGH", "SEARCH"): 1.5,
("HIGH", "WAIT"): 0.5,
("HIGH", "RECHARGE"): 0.0,
("LOW", "SEARCH"): -2.0,
("LOW", "WAIT"): 0.2,
("LOW", "RECHARGE"): 0.0,
}
def evaluate_policy(policy, iterations=500):
# Initialize value function to zero for all states
V = {s: 0.0 for s in states}
# Repeat Bellman updates until convergence
for _ in range(iterations):
for s in states:
a = policy(s) # Action chosen under the policy
expected_value = 0.0
possible_outcomes = P[(s, a)]
# Compute expected return over all possible transitions
for s_next, prob in possible_outcomes:
immediate_reward = R[(s, a)]
future_value = V.get(s_next, 0.0)
expected_value += prob * (
immediate_reward + gamma * future_value
)
V[s] = expected_value
return V
Bellman Optimality Equation
To compute optimal behavior, we replace the policy-weighted expectation in the Bellman expectation equation with a maximization over actions.
This equation defines the optimal value function.
Intuition Behind the Bellman Optimality Equation
While the Bellman expectation equation evaluates the value of a state under a fixed policy, the Bellman optimality equation answers a stronger question:
What is the best possible long-term value achievable from a state, assuming the agent always acts optimally?
To answer this question, the policy-weighted average over actions is replaced with a maximization:
This equation formally defines the optimal value function.
From Expectation to Optimization
The difference between the two Bellman equations is subtle but profound:
- Bellman Expectation: averages over actions according to a policy
- Bellman Optimality: assumes the agent always chooses the best possible action
Conceptually, the agent no longer asks:
βWhat will I do under my current policy?β
Instead, it asks:
βWhat should I do to maximize long-term reward?β
Decision Tree Interpretation
The Bellman optimality equation can be interpreted as a decision tree in which the agent performs action selection by choosing the best branch:
- The root represents the current state
- Each possible action corresponds to a subtree
- The environment expands each subtree probabilistically
- The agent selects the subtree with the highest expected
Optimal Tree Diagram
Each action defines a complete expectation over future outcomes. The optimal value function selects the action whose subtree has the highest expected value.
Mapping the Tree to the Equation
Each component of the Bellman optimality equation corresponds directly to an element of the decision tree:
| Equation Component | Tree Interpretation |
|---|---|
| Value of the best subtree rooted at | |
| Selection of the best action branch | |
| Expectation over environment outcomes | |
| Immediate reward at leaf nodes | |
| Value of future subtrees |
Concrete Example: Recycling Robot
Consider the recycling robot in the Low battery state.
The robot has two reasonable actions:
- Search: high immediate reward but a significant risk of battery failure
- Recharge: no immediate reward, but safely restores the battery
The Bellman optimality equation compares the expected return of both actions:
The optimal value for the Low battery state is then:
Even if searching occasionally yields higher immediate reward, the increased risk of failure reduces its long-term value, making recharging the optimal choice.
Key Insight
The Bellman optimality equation performs local maximization at each state, assuming all future decisions are also optimal.
Value Iteration
Value iteration applies the Bellman optimality update repeatedly in order to converge to the optimal value function .
# Value Iteration
actions = ["SEARCH", "WAIT", "RECHARGE"]
def value_iteration(iterations=50):
# Initialize value function to zero for all states
V = {s: 0.0 for s in states}
# Repeat Bellman optimality updates
for _ in range(iterations):
# Iterate over all states (HIGH, LOW)
for s in states:
best_value_found = -float("inf")
# Evaluate every possible action
for a in actions:
current_action_value = 0.0
possible_outcomes = P[(s, a)]
for s_next, prob in possible_outcomes:
immediate_reward = R[(s, a)]
future_value = V.get(s_next, 0.0)
current_action_value += prob * (
immediate_reward + gamma * future_value
)
# Keep the best action value
if current_action_value > best_value_found:
best_value_found = current_action_value
# Update the value of the state
V[s] = best_value_found
return V
Q-Learning (Unknown Model)
When transition probabilities are unknown, the agent can learn directly from experience using Q-learning.
Intuition Behind Q-Learning
When the transition dynamics and reward function of an environment are unknown, planning methods such as value iteration are no longer applicable.
In such settings, the agent must learn directly from interaction with the environment.
Q-learning is a model-free reinforcement learning algorithm that learns optimal behavior purely from experience, without requiring knowledge of transition probabilities or rewards.
From State Values to Action Values
Recall that the optimal value function satisfies the Bellman optimality equation:
Q-learning shifts the focus from states to stateβaction pairs.
Instead of estimating directly, the agent learns the optimal action-value function:
Once is known, the optimal policy follows immediately:
The Q-Learning Update Rule
Q-learning approximates using incremental updates based on experience.
After observing a single transition , the update is:
Intuition: Learning from One Step of Experience
Each term in the update rule has a clear interpretation:
- : current estimate of the value of taking action in state
- : immediate reward received
- : estimated value of the best future action
- : discount factor controlling long-term importance
- : learning rate controlling how aggressively updates are applied
The quantity inside the brackets is known as the temporal-difference (TD) error:
The agent adjusts its estimate in the direction of this error.
Tree-Based Interpretation of Q-Learning
Q-learning can be understood as learning the value of a one-step lookahead tree:
- The root represents the current stateβaction pair
- The environment expands the tree by transitioning to
- The agent assumes optimal behavior beyond
Unlike Bellman expectation methods, Q-learning involves:
- No averaging over actions
- No explicit transition probabilities
The agent samples a single branch of the tree and updates its estimate accordingly.
Q-Learning Tree Diagram
(This can be visualized as a one-step decision tree where the agent evaluates the best possible action from the next state.)
The Q-learning update replaces the value of the root node with:
This is known as a sampled Bellman optimality backup.
Concrete Example: Recycling Robot
Suppose the recycling robot is in the Low battery state and takes the action Search.
- The robot receives a reward if the battery fails
- The episode terminates
The Q-learning update becomes:
Repeated experiences like this reduce the estimated value of searching in low-battery states, even if searching occasionally yields a positive reward.
Why Q-Learning Works
Q-learning converges to the optimal action-value function under the following conditions:
- All stateβaction pairs are explored infinitely often
- The learning rate decays appropriately
- The environment is stationary
Crucially, Q-learning is:
- Off-policy: it learns the optimal policy regardless of the agentβs behavior policy
- Model-free: it does not require knowledge of or
Key Insight
Q-learning replaces full expectation over future outcomes with repeated sampling, gradually shaping value estimates toward optimality.
Each update is noisy and local, but over time the agent internalizes the structure of the optimal decision tree.
In essence, Q-learning is Bellman optimality learned one experience at a time.
It converges to the optimal policy without explicit knowledge of the transition model.
Partial Observability
If the agent does not observe the true battery state directly, the Markov assumption is violated.
# Noisy Observation Model
def observe(true_state):
if random.random() < 0.2:
return "HIGH" if true_state == "LOW" else "LOW"
return true_state
This motivates the use of belief states, memory, and recurrent architectures.
Summary
Starting from a simple finite Markov Decision Process, we derived Bellman equations, implemented planning algorithms, and demonstrated learning from experience.
This progression illustrates how abstract mathematical definitions translate directly into working decision-making systems.
Where MDPs Break Down
Markov Decision Processes struggle with:
- Partial observability
- Non-stationary environments
- Multi-agent interactions
- Sparse or delayed rewards
These limitations motivate extensions such as Partially Observable MDPs (POMDPs) and hierarchical reinforcement learning.
Conclusion
Finite Markov Decision Processes are not a perfect model of intelligence.
They are a useful abstraction that allows us to reason about decision-making systems.
Understanding MDPs is less about equations and more about understanding how systems make decisions over time.
References
- Sutton, R. S., & Barto, A. G. Reinforcement Learning: An Introduction. MIT Press.
- Morales, M. Grokking Deep Reinforcement Learning. Manning Publications.