Pitch Outcome Prediction

Introduction and Motivation

This work is a core part of baseball analytics: accurately predicting the outcome of a pitch using available data. It's designed to serve both predictive and strategic functions, suitable for scouting or game-calling analysis.

Data Collection and Modeling System

Data Source: Statcast data via pybaseball API for 2021–2024 regular seasons
Two-Tier Classifier:
- Tier 1: Classifies pitch into Ball, Strike, or Ball In Play
- Tier 2: For a Ball In Play only — classifies further into Single, Double, Triple, HR, or Out
Most Optimal Models: Gradient boosting models: XGBoost, LightGBM, CatBoost
Evaluation: Weighted F1 Score and Accuracy

Feature Engineering

Features were designed to reflect both context and pitch physics:

Game Context: inning, outs, runners on base, score differential
Pitch Features: speed, spin rate, axis, location, movement
Sequence Dynamics: pitch number in at-bat, ball-strike count
Advanced: release extension, effective speed, launch angle/speed (Tier 2)

Model Architecture

The system uses a two-tiered prediction pipeline. Tier 1 determines the broad outcome, and Tier 2 determines the details of a ball in play.

          ┌──────────────┐
          │ Input Pitch  │
          └──────┬───────┘
                 │
          ┌──────▼──────┐
          │  Tier 1 ML  │
          └──────┬──────┘
     Ball/Strike │ Ball In Play
                 ▼
           ┌─────┴─────┐
           │ Tier 2 ML │
           └───────────┘
    (Single / Double / Triple / HR / Out)

Performance Summary

Evaluation was done across all 9 combinations of Tier1 × Tier2 models.

Tier 1	Tier 2	Tier 1 Accuracy	Tier 1 F1	Tier 2 Accuracy	Tier 2 F1
xgboost	xgboost	0.724	0.715	0.766	0.744
xgboost	catboost	0.724	0.715	0.610	0.652
xgboost	lightgbm	0.724	0.715	0.629	0.667
lightgbm	xgboost	0.715	0.713	0.766	0.744
lightgbm	catboost	0.715	0.713	0.610	0.652
lightgbm	lightgbm	0.715	0.713	0.629	0.667
catboost	xgboost	0.706	0.703	0.766	0.744
catboost	catboost	0.706	0.703	0.610	0.652
catboost	lightgbm	0.706	0.703	0.629	0.667

Key Insights

XGBoost consistently performs best across both tiers — strong generalization and calibration
Tier 2 accuracy of 76.6% for batted ball outcomes is competitive with proprietary models
Model generalizes across seasons

Conclusion

This project simulates real-world decision systems in professional baseball. With consistent high-tier performance, it demonstrates the feasibility of predictive pitch modeling at scale, even with just Statcast data.

Code

All code, data processing functions, and model training pipelines are available here: GitHub Repository.