back to home

projects

stock trader rl environment

#python#fastapi#pydantic#docker#websocket#openenv#hugging face spaces

openenv-compliant reinforcement learning environment for evaluating llm trading agents on indian equity markets. qualified for the meta pytorch openenv hackathon finale (top 800 out of 32,000+ teams).

/ what it does

  • simulates daily stock trading on 68 nifty stocks using ~5 years of real historical ohlcv data
  • agents connect via http/websocket, receive market observations with technical indicators, and respond with plain-text trade actions (buy, sell, hold)
  • three difficulty tiers: single stock (20 days), portfolio (30 days), full autonomous (40 days) — each with escalating constraints like transaction costs, slippage, position limits, and regime gates

/ how it works

  • market simulator replays historical price windows with a 50-day lookback buffer for indicator computation
  • feature engine computes rsi, macd, bollinger bands, volume spikes, trend, momentum, and volatility — served as human-readable text summaries for llm agents
  • step-level reward shaping: pnl reward, discipline bonus, regime gate penalty, trade limit violations
  • task-specific graders score the full trajectory on sharpe ratio, discipline, regime compliance, and risk management

/ why it matters — rlvr & grpo

  • the grading system is designed as a verifiable reward function (rlvr) — deterministic scores that replace traditional reward models
  • this enables grpo-based training: generate multiple rollouts through the environment, rank them by grader score, and update model weights to favor better trading trajectories
  • no separate reward model or critic needed — the environment's graders are the reward signal
  • the trader agent project is the target policy model — the goal is to train it using this environment's verifiable rewards to improve its trading decisions

/ what's next (in progress)

  • integrating with unsloth/trl for grpo-based rl training of the trader agent
  • vllm deployment for inference optimization during rollout generation
  • data pipeline for collecting and processing thousands of training rollouts
  • the environment (phase 1) is complete — now building the training loop (phase 2)

/ how it works

01agent connects via http/websocket and resets environment with a task and seed
02environment selects a random market window and returns initial observation
03agent reads market summary with technical indicators and submits trade action
04environment executes trade with realistic costs/slippage, computes reward, advances to next day
05at episode end, grader scores the full trajectory on task-specific criteria

/ features

meta pytorch hackathon finale
qualified for the finale (top 800 out of 32,000+ teams). built and presented the environment to meta engineers in bangalore, april 2026.
three difficulty tiers
single stock (easy, 20 days), portfolio (medium, 30 days), and full autonomous (hard, 40 days) with escalating constraints — transaction costs, slippage, position limits, trade caps, and regime gates.
verifiable reward design (rlvr)
deterministic grading functions that score agents on sharpe ratio, discipline, regime compliance, and risk management. designed to serve as verifiable rewards for grpo-based rl training — no separate reward model needed.
real market data & technical analysis
68 nifty stocks with ~5 years of daily ohlcv data. featureengine computes rsi, macd, bollinger bands, volume spikes, trend, momentum, and volatility from a 50-day lookback buffer.
llm-native interface
plain-text action space (buy, sell, hold) and human-readable market summaries — any llm can act as an agent without special tooling. invalid actions gracefully default to hold.
seed-reproducible episodes
fully deterministic episodes for reproducible evaluation. same seed produces same market window and sequence.

nyc eta engine

#python#pytorch#lightgbm#pandas#mlflow#docker#hugging face hub

3-model ensemble (neural net + lightgbm + ft-transformer) for nyc taxi eta prediction. trained on 37m trips, achieves 252.7s mae — 28% better than xgboost baseline, inference under 5ms.

/ what it does

  • predicts taxi trip duration given pickup zone, dropoff zone, timestamp, and passenger count using 37 million real nyc yellow taxi trips from 2023
  • learns all spatial relationships from trip data via zone embeddings — no external geography, shapefiles, or hardcoded coordinates. if zone ids mapped to a different city, the model would work equally well
  • serves predictions in under 5ms on cpu with a 3-model ensemble, packaged in a ~500mb docker container

/ the ensemble

  • model 1: dual-branch embedding network (560k params) — zone embeddings with hash-based pair embedding, 24 continuous features, residual blocks. best at smooth interpolation for common routes
  • model 2: lightgbm (81 trees) — gradient-boosted trees with zone ids as native categoricals. near-zero bias (-6s) on rare pairs where the neural net struggles (-106s bias)
  • model 3: ft-transformer (406k params) — implemented from scratch, each feature projected into a 128-dim token, 3-layer self-attention with [cls] aggregation. positive bias (+65s) offsets nn's negative bias
  • ensemble weights optimized via grid search on full dev set: 0.6 nn + 0.2 lgbm + 0.2 ft-transformer

/ feature engineering

  • 14 zone-pair statistics with bayesian shrinkage — smooths sparse pairs toward pickup-zone mean with a fallback hierarchy: pair → pickup zone → dropoff zone → global mean
  • 6 traffic-regime time buckets (late night, early morning, am rush, midday, pm rush, evening) with per-regime pair statistics
  • 10 temporal features: cyclical hour/dow/month encoding, rush hour flags, night flags, normalized minute-of-day
  • zone-pair median alone (296.7s) beats xgboost (351s) with zero ml — the signal is in the feature engineering

/ results

  • 252.7s mae — 28% better than xgboost baseline (351s). ensemble reduced mae from 261s (best single model) to 253s
  • nn: 261.2s (precision on common routes) | lgbm: 261.7s (low bias on rare pairs) | ft: 284.7s (different error pattern, bias offset)
  • diagnostic-driven tuning identified rare-pair bias as the true bottleneck — no amount of nn tuning could fix it, lightgbm solved it
  • inference under 5ms per request, total model weights 6.3mb (2.3 + 2.4 + 1.6)

/ how it works

01download and clean 37m nyc taxi trips (2023), split temporally into train/dev
02compute zone-pair statistics with bayesian shrinkage across 6 traffic regimes
03train neural net (37m rows, huber loss), lightgbm (10m rows, mae), ft-transformer (10m rows, l1)
04optimize ensemble weights via grid search on full 1.23m dev set
05evaluate on held-out dev set, pick best checkpoint by mae (not training loss)

/ features

3-model ensemble
neural net + lightgbm + ft-transformer with complementary strengths. each model has a different inductive bias — embeddings vs tree splits vs self-attention. ensemble reduced mae from 261s to 253s.
ft-transformer from scratch
feature tokenizer transformer (gorishniy et al., neurips 2021). each feature projected into a 128-dim token, [cls] token aggregates via 3-layer self-attention. captures cross-feature interactions the mlp misses.
learned zone embeddings
50-dim embeddings for 266 zones learn spatial relationships purely from trip patterns. no external geography needed — model is transferable to any city with zone ids.
bayesian shrinkage for sparse pairs
handles rare and unseen zone pairs gracefully. shrinkage prior smooths toward pickup-zone mean; fallback hierarchy prevents cold-start failures.
diagnostic-driven tuning
deep diagnostics (parameter health, rare-pair analysis, regularization checks) revealed rare-pair bias as the true bottleneck. prevented wasted experiments on architecture changes.
memory-efficient training
37m rows processed in 2m-row chunks. keeps memory under 6gb, enabling free-tier gpu training on colab/kaggle t4.

autonomous trader agent

#python#fastapi#postgresql#docker#github actions#zerodha kite connect

autonomous trading system for indian equity markets using cross-sectional reversal scoring on 96 nifty stocks. backtested 8.6% cagr with 60% win rate over 5.4 years.

/ the strategy

  • cross-sectional reversal — ranks 96 nifty stocks by magnitude of decline over a 5-21 day lookback, buys the most oversold, holds for 5 trading days
  • the edge is behavioral: panic selling pushes stocks below fair value, creating a mean-reversion opportunity that algorithms can't easily arbitrage away
  • information coefficient: +0.020 (large-cap), +0.025 (midcap) — a small but consistent edge compounded over thousands of trades

/ research journey

  • tested 6 strategies systematically before finding the edge
  • 5 failed: intraday ml prediction, breakout detection, 5-min mean reversion, 30-min trend following, cross-sectional ml — indian large-cap stocks are too efficient at intraday resolution
  • daily reversal was the only signal that survived — driven by human psychology, not technical patterns
  • evolved through 4 versions of allocation logic, each improving capital efficiency — the underlying signal never changed

/ how it works

  • 3-state regime classifier (bull/neutral/weak) using nifty vs 50-dma, momentum, and market breadth with a 2-day persistence filter
  • adaptive confidence scoring: continuous 0-1 score combining ic, rolling win rate, momentum, and breadth for smooth capital allocation
  • risk controls: regime-based exposure gates, soft drawdown dampening, recovery boost, kill switches on declining win rates or negative ic, panic filters
  • a/b pipeline testing with independent scan intervals, capital pools, and paper broker instances for isolated comparison

/ results

  • backtested over 5.4 years (oct 2020 – jan 2025): 8.6% cagr, 42% total return, 60% win rate
  • survived the 2025-26 bear market with 6.5% cagr and 9-16% max drawdown
  • large-cap returns: +38% | midcap returns: +108% (2.8x higher)
  • ~52% average capital deployment — the rest held as a protective cash buffer

/ what's next

  • this is the target policy model for rl training — the stock-trader-env project provides the verifiable reward environment
  • goal: use grpo to train the agent's decision-making on thousands of simulated rollouts, optimizing for sharpe ratio and risk discipline
  • replacing rule-based scoring with a learned policy that adapts to market conditions

/ how it works

01regime classifier evaluates market conditions (bull/neutral/weak)
02confidence scorer computes allocation weight from ic, win rate, momentum, breadth
03reversal scanner ranks stocks by decline magnitude across lookback windows
04risk guardian validates exposure limits, drawdown gates, and kill switches
05trade executor places orders via zerodha kite connect (cnc for swing holding)

/ features

cross-sectional reversal scoring
ranks 96 nifty stocks by decline magnitude. information coefficient: +0.020 (large-cap), +0.025 (midcap). exploits behavioral overreaction — a structural edge driven by psychology, not patterns algorithms can arbitrage away.
3-state regime classifier
classifies market as bull (65-85% exposure), neutral (50-75%), or weak (8-40%) using nifty vs 50-dma, momentum, and breadth. 2-day persistence filter prevents whipsawing.
adaptive confidence scoring
continuous 0-1 scoring combining information coefficient, rolling win rate, momentum, and market breadth. replaces hard thresholds for smoother capital allocation.
a/b pipeline testing
two independent pipelines with separate scan intervals and capital pools. each pipeline runs its own paper broker instance for isolated comparison.
risk management layers
regime-based exposure gates, soft drawdown dampening (gentle in bull, aggressive in weak), recovery boost when signal improves during drawdown recovery, and kill switches that pause trading on declining win rates or negative ic.
research-driven development
tested 6 strategies systematically before finding the edge. 5 failed (ml prediction, breakouts, intraday mean reversion, trend following, cross-sectional ml). every version improvement came from better capital allocation — the signal never changed.