MLB Pitch Outcome Prediction

Two-tier machine learning system to classify every MLB pitch into meaningful outcomes using real Statcast data (2021–2024).

Introduction and Motivation

This work is a core part of baseball analytics: accurately predicting the outcome of a pitch using available data. It's designed to serve both predictive and strategic functions, suitable for scouting or game-calling analysis.

Data Collection and Modeling System

Feature Engineering

Features were designed to reflect both context and pitch physics:

Model Architecture

The system uses a two-tiered prediction pipeline. Tier 1 determines the broad outcome, and Tier 2 determines the details of a ball in play.

          ┌──────────────┐
          │ Input Pitch  │
          └──────┬───────┘
                 │
          ┌──────▼──────┐
          │  Tier 1 ML  │
          └──────┬──────┘
     Ball/Strike │ Ball In Play
                 ▼
           ┌─────┴─────┐
           │ Tier 2 ML │
           └───────────┘
    (Single / Double / Triple / HR / Out)
  

Performance Summary

Evaluation was done across all 9 combinations of Tier1 × Tier2 models.

Tier 1Tier 2Tier 1 AccuracyTier 1 F1Tier 2 AccuracyTier 2 F1
xgboostxgboost0.7240.7150.7660.744
xgboostcatboost0.7240.7150.6100.652
xgboostlightgbm0.7240.7150.6290.667
lightgbmxgboost0.7150.7130.7660.744
lightgbmcatboost0.7150.7130.6100.652
lightgbmlightgbm0.7150.7130.6290.667
catboostxgboost0.7060.7030.7660.744
catboostcatboost0.7060.7030.6100.652
catboostlightgbm0.7060.7030.6290.667

Key Insights

Conclusion

This project simulates real-world decision systems in professional baseball. With consistent high-tier performance, it demonstrates the feasibility of predictive pitch modeling at scale, even with just Statcast data.

Code

All code, data processing functions, and model training pipelines are available here: GitHub Repository.