How we trained our base MLS model

Here's a guide explaining how the base MLS model was trained:

Github Repo > https://github.com/sportstensor/MLS

1. Data Retrieval and Preparation

The process begins with the get_data function in retrieve_data.py. This function is responsible for collecting and organizing MLS fixture data:

It scrapes data from FBRef for the past 5 years of MLS matches (excluding 2020).
For each match, it collects:
- Date
- Home Team
- Away Team
- Home Team Goal Difference (before the match)
- Away Team Goal Difference (before the match)
- Home Team Score
- Away Team Score
The data is stored in an Excel file (mls_fixture_data.xlsx).

2. Data Scaling

The scale_data function in retrieve_data.py prepares the data for model input:

It scales the input features (team IDs, goal differences) and target variables (scores) using MinMaxScaler.
This normalization helps in improving model performance.

3. Model Architecture

The model is defined in the load_or_run_model function in model.py:

It's a simple Sequential model using Keras.
The model consists of an input layer and a single dense layer with 2 units and ReLU activation.
The model uses Adam optimizer and Mean Squared Error as the loss function.

4. Training Process

The training process, also in load_or_run_model, includes:

Splitting the data into training and test sets (80% train, 20% test).
Using Early Stopping to prevent overfitting, with a patience of 6 epochs.
Training for a maximum of 150 epochs with a batch size of 32.
Saving the best model based on the lowest loss.

5. Model Evaluation

After training, the model is evaluated on the test set:

It calculates the percentage of correct score predictions and correct outcome predictions.
It computes Mean Squared Error, Mean Absolute Error, and R-squared values for both home and away score predictions.

6. Model Usage

The trained model is used in the activate function in miner_dashboard.py:

It loads the trained model and scalers.
For given fixtures, it prepares the input data using the prep_pred_input function.
It then uses the model to predict scores for these fixtures.
The predicted scores are inverse-transformed to the original scale.

7. Additional Notes

The model uses a unique numerical encoding for each team, based on their historical presence in the MLS.
The data collection process uses proxy rotation to avoid IP blocks during web scraping.
The model can handle both historical and future fixtures, adjusting its input preparation accordingly.

This base model provides a foundation for predicting MLS match outcomes, which can be further refined and improved in future iterations.

Last updated 11 months ago