Two-tier machine learning system to classify every MLB pitch into meaningful outcomes using real Statcast data (2021–2024).
This work is a core part of baseball analytics: accurately predicting the outcome of a pitch using available data. It's designed to serve both predictive and strategic functions, suitable for scouting or game-calling analysis.
pybaseball API for 2021–2024 regular seasonsFeatures were designed to reflect both context and pitch physics:
The system uses a two-tiered prediction pipeline. Tier 1 determines the broad outcome, and Tier 2 determines the details of a ball in play.
┌──────────────┐
│ Input Pitch │
└──────┬───────┘
│
┌──────▼──────┐
│ Tier 1 ML │
└──────┬──────┘
Ball/Strike │ Ball In Play
▼
┌─────┴─────┐
│ Tier 2 ML │
└───────────┘
(Single / Double / Triple / HR / Out)
Evaluation was done across all 9 combinations of Tier1 × Tier2 models.
| Tier 1 | Tier 2 | Tier 1 Accuracy | Tier 1 F1 | Tier 2 Accuracy | Tier 2 F1 |
|---|---|---|---|---|---|
| xgboost | xgboost | 0.724 | 0.715 | 0.766 | 0.744 |
| xgboost | catboost | 0.724 | 0.715 | 0.610 | 0.652 |
| xgboost | lightgbm | 0.724 | 0.715 | 0.629 | 0.667 |
| lightgbm | xgboost | 0.715 | 0.713 | 0.766 | 0.744 |
| lightgbm | catboost | 0.715 | 0.713 | 0.610 | 0.652 |
| lightgbm | lightgbm | 0.715 | 0.713 | 0.629 | 0.667 |
| catboost | xgboost | 0.706 | 0.703 | 0.766 | 0.744 |
| catboost | catboost | 0.706 | 0.703 | 0.610 | 0.652 |
| catboost | lightgbm | 0.706 | 0.703 | 0.629 | 0.667 |
This project simulates real-world decision systems in professional baseball. With consistent high-tier performance, it demonstrates the feasibility of predictive pitch modeling at scale, even with just Statcast data.
All code, data processing functions, and model training pipelines are available here: GitHub Repository.