NVDA +312.4% AAPL +45.2% TSLA +89.7% MSFT +52.1% AMZN +67.3% GOOGL +38.9% JNJ -12.4% META +124.5% NVDA +312.4% AAPL +45.2% TSLA +89.7% MSFT +52.1% AMZN +67.3% GOOGL +38.9% JNJ -12.4% META +124.5%
Stanford AA 228 · Decision Making Under Uncertainty

Algorithmic Trading
with Markov Decision
Processes

A model-based reinforcement learning approach to stock trading that outperforms naive strategies using RSI-based state classification and value iteration.

01

Project Overview

This project develops a low-frequency algorithmic trading strategy designed for investment firms with significant market influence. Unlike high-frequency trading, our approach operates on weekly timeframes, making decisions that can meaningfully impact stock prices.

We frame the trading problem as a Markov Decision Process, using the Relative Strength Index (RSI) to classify market states and computing optimal policies through value iteration. The model-based approach leverages historical data from 2000-2020 to construct explicit transition and reward matrices.

Key insight: While model-free methods like Q-learning require massive datasets, stock trading's limited data (~365 points/year) makes model-based approaches more practical and interpretable.

MDP State Space

OVERSOLD RSI < μ - σ
NEUTRAL μ - σ ≤ RSI ≤ μ + σ
OVERBOUGHT RSI > μ + σ
BUY
HOLD
SELL
02

Technical Approach

State Classification

States derived from RSI (Relative Strength Index), a momentum indicator measuring price change magnitude. Weekly RSI values categorized into three states based on statistical thresholds (mean ± std).

Action Classification

Actions labeled based on deviation from expected closing price using 12-week trend regression. Buy if price exceeds +0.5σ, Sell if below -0.5σ, Hold within ±0.25σ.

Transition Matrix T

Captures P(s'|s,a) — probability of transitioning between states given an action. Built from observed state-action-state frequencies in training data.

Reward Matrix R

Records expected % price change for each state-action pair. Computed as average observed reward for each (state, action) combination.

Value Iteration

Iteratively computes optimal utility U(s) for each state until convergence. Discount factor γ=0.9 balances immediate vs. future rewards.

Policy Extraction

Optimal policy π*(s) extracted by selecting action maximizing expected utility. Results in one of 3³ = 27 possible policies across the three states.

BELLMAN OPTIMALITY EQUATION
a*(s) = argmaxa [ R(s,a) + γ Σs' T(s,a,s') · U(s') ]
a*(s) Optimal action for state s
R(s,a) Expected reward for taking action a in state s
γ = 0.9 Discount factor balancing immediate vs. future rewards
T(s,a,s') Probability of transitioning to s' after taking a in s
U(s') Utility of transitioning to state s'
Σs' Sum over all possible successor states
03

Results & Analysis

NVDA (NVIDIA)

Optimal = Best
NVDA Policy Histogram

JNJ (Johnson & Johnson)

Optimal Underperformed
JNJ Policy Histogram

Aggregate Performance: Top 20 Stocks by Market Cap

Normalized returns averaged across AAPL, MSFT, GOOGL, AMZN, NVDA, TSLA, META, and 13 others

Optimal
0.52
sell-buy-hold
0.44
sell-hold-buy
0.43
buy-buy-buy
0.31
hold-sell-sell
0.08

✓ Momentum-Driven Stocks

High-volatility, momentum-driven stocks (NVDA, TSLA) showed strongest performance under optimal policy. RSI-based state classification captures short-term sentiment effectively for speculative assets.

✓ Aggregate Outperformance

Optimal policy ranked #1 across all 27 possible policies when averaged over 20 stocks, validating the model-based approach for portfolio-level decision making.

⚠ Stable Stock Limitations

Low-volatility stocks (JNJ, PG) showed weak or negative returns under optimal policy. Prices driven by fundamentals rather than momentum—RSI less predictive.

⚠ Data Constraints

Limited historical data (~1000 weekly observations) constrains transition matrix accuracy. COVID-era data excluded to avoid anomalous patterns.