A framework for Machine Learning application in Sport scenario

Wellington Martins dos Santos
3 min readFeb 8, 2022

The proposed framework to use in Sport Scenario needs to consider the following topics:

(1) Domain understanding; (2) Data understanding; (3) Data preparation and feature extration, divided into, define subsets of data and data preprocessing (Match versus external features); (4) Sport prediction model evaluation (Model performance, training and testing data); (5) finally model deployment.

1 Domain understanding: This requires know the problem, modelling goal and specific characteristcs of the sport;

2 Data understanding: Somenthing to consider is how data will be obtained, using some datasets publically available, for example, what type of data it has been dealing with? Players, training data, matchs performance along the years. All these informations are important to set possible analyses. Class variable is also consider important, for example, the outcome will be win, lose and draw when dealing with classification problem. Numeric problem like predict the points based on the difference on home points and away points;

3Data preparation and feature extraction: Regarding features, seems reasonable to split data into match-related and external features. Matched-related features are informations about the actual event sports (Meters gained, passes made, and so on), any situation possible to count on the event. Otherwise, External feature are known prior to the upcoming event (Distance travaled, meters covered, number of exercise routines and so forth);

4 Sport prediction model (Perfomance, training and testing data): First step is select potential models, look for past literature, experimentation of preselected models with subset of date will provide the best technic to use. Model performance is evaluated based on the correct evaluation using a standard classification matrix, accuracy classification or Receiver Operating Characteristic (ROC) curve to highly imbalance data. Regarding Training and testing, cross-validation and separetion of data by training and test is appropriate. Preserve the order of training data, upcoming matches are predicted base on past events/matches. It is also important to split trainingversus testing based on the problem it has been solved as following:

Prior seasons may not be relevant to predict matches in future seasons, particularly in sports where team rosters and strengths can change significantly from year to year. This approach may not give a reliable picture of model performance (although this could be mitigated to some extent if player level data is included, and so player changes would be captured from season to season).

5Model deployment: After set all the import parameters to the model, test with new trainingdata therefore, it is possible to make it available online and dynamically receiving input data, match/event and external features.

Conclusion

Prediction in sport scenario requires good accuracy. Results are obtained using mathematical and statistical models verified by an domain expert. This article provides information about Cross-industry standard process for data mining (CRISP-DM) a framework for Sport Result Prediction (SRP-CRISP-DM).

Bunker and Thabtah 2017 — A machine learning framework for sport result prediction.

--

--

Wellington Martins dos Santos

Sports Scientist, Physical Trainer so far but I still wanna be a Fitness Funcional Athlete and programmer — 27 years old