How we trained our base MLB model

In this guide, we will walk you through the steps and code we used to train our base MLB model. This documentation details the processes from data retrieval to model training and evaluation.

Github Repo > https://github.com/sportstensor/MLB

1. Data Collection and Preprocessing

The process begins with collecting MLB game data using the build_db() function in database_creator.py. This function:

  • Retrieves data for MLB seasons from 2000 to the current year (excluding 2020) using an API.

  • Extracts relevant information for each game, including scores, hits, and team IDs.

  • Saves this data to a CSV file named 'mlb_fixture_details.csv'.

2. Feature Engineering

The combined_db_creation() function in database_creator.py processes the raw data to create features for the model:

  • Calculates running totals for each team (runs scored, runs against, hits, games played, etc.).

  • Computes derived features like run difference, hits per game, win/loss ratio, and average score.

  • Incorporates ELO ratings from an external source ('mlb_elo.csv').

  • Creates a new CSV file 'mlb_model_ready_data_comb.csv' with all the engineered features.

3. Data Preparation

The get_data() function in retrieve_data.py:

  • Loads the prepared data from 'mlb_model_ready_data_comb.csv'.

  • Removes outliers by filtering out games where the score is above the 95th percentile.

4. Data Scaling

The scale_data() function in retrieve_data.py:

  • Separates features (X) and target variables (y).

  • Uses MinMaxScaler to scale all features and target variables to a range of 0 to 1.

  • Returns the scaled data and the scaler objects for later use.

5. Model Architecture

The model is defined in the load_or_run_model() function in model.py:

  • It's a simple Sequential model using Keras.

  • The model has an input layer matching the number of features and a single dense layer with 2 units (for home and away scores) using ReLU activation.

6. Model Training

Also in the load_or_run_model() function:

  • The data is split into training and test sets (80/20 split).

  • The model is compiled using Adam optimizer and mean squared error loss.

  • Early stopping and model checkpointing are used to prevent overfitting and save the best model.

  • The model is trained for up to 150 epochs with a batch size of 32.

7. Model Evaluation

After training, the model is evaluated on the test set:

  • Predictions are made and then inverse-transformed back to the original scale.

  • Various metrics are calculated:

    • Percentage of correct score predictions

    • Percentage of correct outcome predictions (win/loss/draw)

    • Mean Squared Error (MSE)

    • Mean Absolute Error (MAE)

    • R-squared (R2) score

8. Model Persistence

The trained model is saved to a file named 'basic_model_comb.keras' for later use.

9. Making Predictions

The prep_pred_input() function in retrieve_data.py prepares input data for making predictions:

  • For past games, it retrieves the actual game data from the database.

  • For future games, it uses current team statistics to create the input features.

  • In both cases, the input data is scaled using the same scalers used during training.

10. Deployment

The activate() function in miner_dashboard.py demonstrates how to use the trained model:

  • It loads the trained model and scalers.

  • Prepares input data for a list of fixtures.

  • Makes predictions and adjusts them to ensure a winner (no ties in baseball).

  • Prints the predicted scores along with actual scores for comparison (if available).

This base MLB model provides a foundation for predicting game scores. It uses historical game data and current team statistics to make predictions, with a simple neural network architecture. The model's performance can be further improved by tuning hyperparameters, adding more complex layers, or incorporating additional relevant features.

Last updated