How we trained our base MLS model
Here's a guide explaining how the base MLS model was trained:
Github Repo > https://github.com/sportstensor/MLS
1. Data Retrieval and Preparation
The process begins with the get_data
function in retrieve_data.py
. This function is responsible for collecting and organizing MLS fixture data:
It scrapes data from FBRef for the past 5 years of MLS matches (excluding 2020).
For each match, it collects:
Date
Home Team
Away Team
Home Team Goal Difference (before the match)
Away Team Goal Difference (before the match)
Home Team Score
Away Team Score
The data is stored in an Excel file (
mls_fixture_data.xlsx
).
2. Data Scaling
The scale_data
function in retrieve_data.py
prepares the data for model input:
It scales the input features (team IDs, goal differences) and target variables (scores) using MinMaxScaler.
This normalization helps in improving model performance.
3. Model Architecture
The model is defined in the load_or_run_model
function in model.py
:
It's a simple Sequential model using Keras.
The model consists of an input layer and a single dense layer with 2 units and ReLU activation.
The model uses Adam optimizer and Mean Squared Error as the loss function.
4. Training Process
The training process, also in load_or_run_model
, includes:
Splitting the data into training and test sets (80% train, 20% test).
Using Early Stopping to prevent overfitting, with a patience of 6 epochs.
Training for a maximum of 150 epochs with a batch size of 32.
Saving the best model based on the lowest loss.
5. Model Evaluation
After training, the model is evaluated on the test set:
It calculates the percentage of correct score predictions and correct outcome predictions.
It computes Mean Squared Error, Mean Absolute Error, and R-squared values for both home and away score predictions.
6. Model Usage
The trained model is used in the activate
function in miner_dashboard.py
:
It loads the trained model and scalers.
For given fixtures, it prepares the input data using the
prep_pred_input
function.It then uses the model to predict scores for these fixtures.
The predicted scores are inverse-transformed to the original scale.
7. Additional Notes
The model uses a unique numerical encoding for each team, based on their historical presence in the MLS.
The data collection process uses proxy rotation to avoid IP blocks during web scraping.
The model can handle both historical and future fixtures, adjusting its input preparation accordingly.
This base model provides a foundation for predicting MLS match outcomes, which can be further refined and improved in future iterations.
Last updated