How we trained our base MLS model

Here's a guide explaining how the base MLS model was trained:

Github Repo > https://github.com/sportstensor/MLS

1. Data Retrieval and Preparation

The process begins with the get_data function in retrieve_data.py. This function is responsible for collecting and organizing MLS fixture data:

  • It scrapes data from FBRef for the past 5 years of MLS matches (excluding 2020).

  • For each match, it collects:

    • Date

    • Home Team

    • Away Team

    • Home Team Goal Difference (before the match)

    • Away Team Goal Difference (before the match)

    • Home Team Score

    • Away Team Score

  • The data is stored in an Excel file (mls_fixture_data.xlsx).

2. Data Scaling

The scale_data function in retrieve_data.py prepares the data for model input:

  • It scales the input features (team IDs, goal differences) and target variables (scores) using MinMaxScaler.

  • This normalization helps in improving model performance.

3. Model Architecture

The model is defined in the load_or_run_model function in model.py:

  • It's a simple Sequential model using Keras.

  • The model consists of an input layer and a single dense layer with 2 units and ReLU activation.

  • The model uses Adam optimizer and Mean Squared Error as the loss function.

4. Training Process

The training process, also in load_or_run_model, includes:

  • Splitting the data into training and test sets (80% train, 20% test).

  • Using Early Stopping to prevent overfitting, with a patience of 6 epochs.

  • Training for a maximum of 150 epochs with a batch size of 32.

  • Saving the best model based on the lowest loss.

5. Model Evaluation

After training, the model is evaluated on the test set:

  • It calculates the percentage of correct score predictions and correct outcome predictions.

  • It computes Mean Squared Error, Mean Absolute Error, and R-squared values for both home and away score predictions.

6. Model Usage

The trained model is used in the activate function in miner_dashboard.py:

  • It loads the trained model and scalers.

  • For given fixtures, it prepares the input data using the prep_pred_input function.

  • It then uses the model to predict scores for these fixtures.

  • The predicted scores are inverse-transformed to the original scale.

7. Additional Notes

  • The model uses a unique numerical encoding for each team, based on their historical presence in the MLS.

  • The data collection process uses proxy rotation to avoid IP blocks during web scraping.

  • The model can handle both historical and future fixtures, adjusting its input preparation accordingly.

This base model provides a foundation for predicting MLS match outcomes, which can be further refined and improved in future iterations.

Last updated