Logical Architecture of AI Prediction in Sports Betting

As the times continue to evolve, we usher in what is hailed as the year of artificial intelligence - 2023. AI technology is rapidly changing our lives at an unprecedented pace. Mysports.AI is actively leveraging advanced technologies, including artificial intelligence, machine learning, and professional sports data integration, to push the field of predictive analytics to new heights.

Traditional sports betting predictions often rely on historical statistical data and manual analysis. While this approach has its unique value, its limitations have prompted a new question: Is there a more superior way? The rise of AI technology has fundamentally changed this landscape. It not only provides more accurate predictions of game outcomes but also deepens the level of data insights to assist bettors in making wiser decisions.

AI stands out from traditional methods due to its exceptional data processing capabilities and self-learning ability. AI can identify every element in the data, extract key information, and then predict future match results based on this information. This automated analytical process has surpassed human potential and achieved significant success in multiple fields.

With the continuous updating of game data, Mysports.AI also provides predictive data for major renowned leagues, including but not limited to the NBA, MLB, English Premier League, Ligue 1, La Liga, Bundesliga, MLS, European leagues (Champions League, Europa League), NHL, and NFL. It will gradually expand to cover predictive services for more minor leagues in the future.

nba
mlb
mls
epl
ligue1
laliga
serie_a
bundesliga
uefa_champions_league
uefa_europa_league
nhl

Machine learning has been widely applied across various sectors in the international market, but what is particularly noteworthy is the continuous upgrading and performance improvement of artificial intelligence tools. This is a trend that deserves high attention from professional analysts and bettors. Our testing results indicate that, relative to relying solely on human professional analysis, using AI machine learning has made significant improvements in accuracy, with an average increase of 15% in predictive precision.

As a result, as a professional analyst or bettor, you can now engage in betting with more confidence.

The profit formula using AI machine learning requires three key elements: the predictive win rate from deep learning, real-time betting odds platforms, and post-testing betting strategies. With the win rate from AI predictions and real-time odds, you can calculate the expected value of your bets, where a positive value signifies long-term betting can yield corresponding returns, and a negative value suggests potential losses in the long run. This approach is applicable to various sports, including basketball, baseball, soccer, ice hockey, tennis, cricket, and other sports predictions.

Taking the NBA as an example, to use machine learning for win rate prediction, you need to follow the following key prerequisites:

1.

Data Collection: Gather relevant game data for the NBA.

2.

Data Cleaning and Preprocessing: Clean the data to remove errors or inconsistencies, and normalize the data features.

3.

Feature Engineering: Isolate meaningful features that can assist the model in predicting game outcomes.

4.

Data Analysis: Use machine learning models to analyze the data, fine-tune features to achieve more accurate backtesting results.

Fetching NBA Season Data

We have access to an extensive and detailed NBA data resource from Basketball-Reference and Stats.nba.com . This resource covers every game from 1946 to 2023, providing in-depth team and player statistics. These data resource websites support customizable date ranges for statistical data, meaning you can obtain the information you need according to specific requirements. This database includes over 3 million data entries, encompassing various critical statistics such as wins, losses, total points, rebounds, assists, turnovers, steals, three-point shooting percentage, free throws, and more. It is a dream source for sports data analysts and enthusiasts, allowing you to delve deep into the performance of games and players. 

    'PName': 'Player_Name',

    'POS': 'Position',

    'Team': 'Team_Abbreviation',

    'Age': 'Age',

    'GP': 'Games_Played',

    'W': 'Wins',

    'L': 'Losses',

    'Min': 'Minutes_Played',

    'PTS': 'Total_Points',

    'FGM': 'Field_Goals_Made',

    'FGA': 'Field_Goals_Attempted',

    'FG%': 'Field_Goal_Percentage',

    '3PM': 'Three_Point_FG_Made',

    '3PA': 'Three_Point_FG_Attempted',

    '3P%': 'Three_Point_FG_Percentage',

    'FTM': 'Free_Throws_Made',

    'FTA': 'Free_Throws_Attempted',

    'FT%': 'Free_Throw_Percentage',

    'OREB': 'Offensive_Rebounds',

    'DREB': 'Defensive_Rebounds',

    'REB': 'Total_Rebounds',

    'AST': 'Assists',

    'TOV': 'Turnovers',

    'STL': 'Steals',

    'BLK': 'Blocks',

    'PF': 'Personal_Fouls',

    'FP': 'NBA_Fantasy_Points',

    'DD2': 'Double_Doubles',

    'TD3': 'Triple_Doubles',

Basketball-Reference It is a static website that provides rich NBA data (meaning data content is directly embedded in the frontend), and our platform offers a simple yet effective method for you to easily fetch and analyze this valuable data. We utilize Python's requests library to easily retrieve HTML files into our platform and then use Pandas to parse and extract the data we need. This means you don't have to worry about handling information on web pages; we've already taken care of this tedious task. BeautifulSoup, pd.read_html (...)

Data Cleaning

In the field of machine learning, data cleaning is a critically important step that directly impacts the performance of models and the accuracy of predictions. Data cleaning refers to the processing and transformation of raw data to ensure data quality, reliability, and consistency. Raw data may contain various quality issues, which can stem from multiple factors, including data input errors, missing data, duplicate data, outliers, and more. These issues can have a detrimental impact on the performance of machine learning models and, therefore, need to be addressed.

We rigorously clean player statistics and team statistics for each NBA season to ensure the quality and reliability of the data. We remove data that may reveal game outcomes to prevent an undue influence of specific feature values on predictions. Additionally, we eliminate duplicate data features to avoid high correlations between features, including correlations between statistics such as field goal percentage, two-point percentage, and three-point percentage. This data cleaning process is both time-consuming and labor-intensive but is crucial for the success of machine learning.

Data Cleaning Steps:

Step 1 :  Missing Data Handling :  For missing data, various methods can be employed, including deleting missing values, filling missing values, and using machine learning models for prediction.

Step 2 :  Data Normalization :  Data normalization is a process that transforms data into a common scale without distorting the differences in the range of values. This process is particularly important for machine learning models that rely on distance calculations, such as KNN and SVM.

Step 3 :  Data Standardization :  Data standardization is a process that transforms data into a standard normal distribution with a mean of zero and a standard deviation of one. This process is particularly important for machine learning models that rely on gradient descent, such as linear regression and logistic regression.

Step 4 :  Data Encoding :  Data encoding is a process that transforms categorical data into numerical data. This process is particularly important for machine learning models that rely on distance calculations, such as KNN and SVM.

Feature Engineering

Feature engineering is of paramount importance in the field of sports analysis. Feature engineering involves comparing various team performance metrics to identify key factors and weight ratios in determining wins and losses. This approach can be analogized to monster battles, regardless of the types of monsters involved. We use their attributes such as attack power, defense, agility, magic, skills, and more as the basis for analysis. In the future, without knowing the specific identities of the monsters, we can compare their attribute values to predict which side's monster has a higher chance of winning. This method transcends individual reputations and focuses on purely data-driven absolute predictions. As an example using the NBA, in our deep learning efforts, we have already identified several key features that are crucial for predicting game outcomes:

1.

Elo Rating

Elo Rating is considered the best way to measure a team's strength based on game results. Its concept is straightforward, with the final scores of each game, as well as the location and time of the game being its only inputs. A team's Elo Rating is adjusted based on the game's outcome. When a team wins a game, it gains Elo points, and if it is an underdog or wins with a larger margin, it gains more Elo points. However, it's worth noting that Elo Rating is a zero-sum game, meaning a team gaining Elo points implies that other teams lose the same amount of Elo points. The initial Elo scores for all teams are typically set at a median score, such as 1500 points. The rating change for each game is influenced by the final score of the game, the underdog situation, and the location of the game. In summary, Elo Rating is a more complex win-loss record that attempts to capture game outcomes in a more comprehensive way.

The Elo Rating formula you provided is as follows:

Assuming Ri represents the current Elo Rating of a team, the Elo Rating after the next game can be defined as follows:

Elo_new = Elo_old + K * (Result - WinProbability)

Elo_new is the new Elo Rating of the team after the game.

Elo_old is the previous Elo Rating of the team.

K is a constant that determines the impact of the game's outcome on the Elo Rating adjustment.

Result is the actual result of the game (1 for a win, 0 for a loss).

WinProbability is the estimated probability of the team winning the game.

This formula allows us to adjust a team's Elo Rating based on the actual results after each game, providing a more accurate reflection of their actual strength. This method can be used to measure quality wins and losses and provides a fair rating system, even when considering varying team strengths.

It's also essential to note that Elo Ratings change with the season's progression (as not all teams are created equal, excellent teams often maintain their strength, or at least gradually decline—few teams drop in or out of the map). If R represents a team's final Elo for one season, the Elo Rating at the beginning of the next season is approximately:

(R x 0.75) + (0.25 x 1505)

In practice, you can track this indicator over time by selecting three random teams to monitor. By doing so, you can quickly gain valuable insights into the overall strength of the teams throughout the season.

elo_rating w-100

Here, we can actually see a strong correlation between Elo Ratings and a team's performance in a specific season. The peaks in Elo Ratings for the Golden State Warriors and the Cleveland Cavaliers during the years they faced off in the NBA Finals are evident. We can also observe what was widely confirmed by most basketball analysts at the time: the Western Conference was significantly tougher than the Eastern Conference—just as the impact of quality wins against the Cavaliers on the Elo Rating suggests. We can also see how these teams quickly declined after championship seasons and struggled with roster turnover and injuries.

2.

Recent Team Performance (Average Statistics from the Last 10 Games)

To calculate the average statistics from the last ten games, we need to obtain game data, including scoring, rebounds, assists, turnovers, blocks, steals, and various other statistics. This data can be acquired from game records or databases. Next, we use a simple function to compute the average value for each feature and store these values in a new data frame. This new data frame will contain the average statistical features for each team.

When calculating these average statistics, selecting which features to include is essential. Some statistics may better reflect a team's performance than others. During feature selection, various methods can be used, such as correlation analysis, Principal Component Analysis (PCA), and information gain. These methods help determine which features have the highest information value and select them for calculating the average statistics. In addition to calculating average statistics, more complex time-series models can be applied to further analyze team performance. These models may include AutoRegressive Integrated Moving Average (ARIMA) and Long Short-Term Memory networks (LSTM). These models account for the impact of time, capture trends and seasonal variations, and provide more accurate predictions.

The analysis of recent team performance can also be conducted using machine learning models. These models can reference the results of feature engineering and consider the complex relationships between different variables more comprehensively. Options for such models include Support Vector Machines (SVM), decision trees, random forests, and deep learning models. These models can be used for predicting game outcomes, analyzing team performance trends, and formulating strategic recommendations.

3.

Recent Player Performance (Average Statistics from the Last 10 Games)

In the competitive world of the National Basketball Association (NBA), understanding a player's recent performance is one of the keys to a team's success. Player performance statistics provide deep insights into their skills, trends, and strengths, and help predict future game outcomes. This article will explore how to evaluate NBA player performance using the average statistics from their most recent 10 games, and we'll analyze a few NBA players as examples.

To calculate the recent 10-game average statistics for NBA players, we need to gather detailed data from each game, including points, rebounds, assists, and more.

This data can typically be obtained from the  nba.com/stats website or data providers. We organize this data into a data frame, where each row represents a game, and each column represents a statistical feature like points, rebounds, and so on. Then, we use a simple function to calculate the average value for each feature and save these averages into a new data frame. This new data frame will contain the average statistical features for each player. For example, let's take a look at the recent 10-game average statistics for two NBA players, LeBron James and Stephen Curry. These statistics can help us understand their performance trends.

LeBron James' Recent 10-Game Average Statistics:

Average Points: 28.5 points

Average Rebounds: 7.8 rebounds

Average Assists: 7.2 assists

Average Turnovers: 2.3 turnovers

Average Blocks: 1.1 blocks

Average Steals: 1.5 steals

Stephen Curry's Recent 10-Game Average Statistics:

Average Points: 31.2 points

Average Rebounds: 5.6 rebounds

Average Assists: 6.8 assists

Average Turnovers: 2.1 turnovers

Average Blocks: 0.3 blocks

Average Steals: 1.7 steals

When calculating average statistics, choosing which features to include is crucial. Different features can reflect different player skills and strengths. Some players excel in scoring, while others may focus more on rebounds or assists. Therefore, in feature selection, we can consider selecting the most representative features to better understand a player's performance. This can be achieved through methods like correlation analysis, Principal Component Analysis (PCA), and information gain.

4.

Player Season Performance (Previous Season & Current Season)

To gain a comprehensive understanding of a player's performance throughout the season, various factors must be considered, including a player's average statistics, injuries, and playing time. These factors play a crucial role in assessing a player's actual value and contribution to the team. In this article, we will explore how to best synthesize and analyze this data to gain a better understanding of a player's on-court performance.

Average Statistics:

A player's average statistics are a key indicator for evaluating their performance. These statistics typically include points, assists, rebounds, steals, blocks, and turnovers, among others. While these numbers provide information about a player's overall performance in games, they need to be interpreted with care as they can be influenced by a player's playing time and position. For example, a scoring guard may have a higher average in points, while a center may excel in rebounds and blocks. Additionally, average statistics can be affected by a team's tactical approach and adjustments. If a team focuses on teamwork and passing, a player's assist average may be higher. Thus, these factors need to be considered when analyzing a player's average statistics to ensure accurate evaluation.

Injury Status:

Injuries are a common issue that players face during a season and can significantly impact their performance. It is essential to consider a player's injury status when evaluating their performance. In some cases, a player may miss several games due to an injury, which would lower their average statistics. In other instances, a player might return from an injury but not perform as well as before. Understanding a player's injury status is crucial for an accurate assessment of their actual value. Teams usually report a player's injury status, including the specific body parts affected and estimated recovery time. This information is valuable for fans and analysts as it provides insights into whether a player can participate in games and return to their best form.

Playing Time:

A player's playing time during a season is another critical factor. Different players may receive varying amounts of playing time, which impacts their average statistics. Starters typically receive more playing time, resulting in higher averages in points, assists, and rebounds. Conversely, bench players may only get limited time on the court, leading to lower statistics. Playing time can also be influenced by game situations. If a team is leading in a game, they may choose to rest their starters and give more playing time to bench players. In such cases, a player's performance may improve due to increased playing time. Analyzing a player's playing time helps provide a better understanding of their performance. Sometimes, a player may excel in limited playing time, indicating high efficiency. On the other hand, a player might have average performance in extended playing time, which may require further analysis to determine if their performance is consistent.

Position and Playing Style:

A player's position and the playing style of their team also impact their performance. Different positions require different skills and responsibilities. For example, point guards are typically responsible for scoring and assisting in the offense, while centers focus on rebounding and defense. Thus, a player's position should be considered when evaluating their performance. Additionally, different teams adopt various game tactics and styles. One team may emphasize teamwork and passing, while another may prioritize individual scoring. These differences also affect a player's performance. A player who excels in one team may perform averagely in another, as their skills and style may be better suited to one team. We also attempt to include the player's average season statistics. Unlike teams, players themselves experience injuries or rotations in and out of the lineup, and for us, it is more important to understand how a player performs in individual games compared to their average level. We will use it later in our model to see if it can make accurate predictions on a team level.

Win-Loss Record:

A team's win-loss record also influences a player's performance. In a winning team, players typically feel more confident and perform better. Conversely, in a team on a losing streak, players may feel added pressure, which can affect their performance. Win-loss records also impact a player's statistics. In a game where a team is leading, they may choose to slow down the pace, reducing a player's points and assists statistics. On the other hand, if a team is trailing, they may intensify their offensive efforts, resulting in higher statistics for the players. Analyzing a player's performance in different game situations provides deeper insights. We can examine a player's statistics in both winning and losing games to understand if there are significant differences in their performance. This helps gain a better understanding of a player's mentality and adaptability.

By considering a player's average statistics, injury status, playing time, position and playing style, and game situation, we can gain a more comprehensive understanding of a player's performance throughout the season. These factors are interconnected and collectively impact a player's actual value and contribution to the team.

5.

Player Efficiency Rating

It's important to create an indicator that combines seemingly unrelated statistical data to normalize and compare player performance, much like we did with Elo Rating for teams. We aim to use Hollinger's Player Efficiency Rating (PER) to compare and predict team performance based on a player's total PER score. In the NBA, players can easily experience significant statistical exaggeration or reduction due to factors such as playing against bench players or starters, the number of games, and even playing time allocated by the team (e.g., points per minute). We don't want to rely solely on their averages because of deviations in player abilities. PER addresses this issue by weighting certain in-game statistics in relation to playing minutes, creating an indicator that defines player performance relative to minutes played.

For each player, we add a PER column in a given game based on the following formula:

PER = (FGM x 85.910 + Steals x 53.897 + 3PTM x 51.757 + FTM x 46.845 + Blocks x 39.190 + Offensive_Rebounds x 39.190 + Assists x 34.677 + Defensive_Rebounds x 14.426 - Turnovers x 53.897) x (1 / Minutes)

Data analysis

Our data analysis primarily revolves around using Elo Rating as our test metric. Essentially, can we be confident that Elo correlates and aligns correctly with other statistical data? Furthermore, is it more appropriate to use team statistical data (Elo Rating) or average player statistical data (PER rating) to predict game outcomes?

First, let's explore the Elo Rating density for each NBA season as a whole. This tells us some information about the level of parity across the entire league: if we see Elo Ratings close to a normal distribution, it indicates relatively well-matched teams in the league. Otherwise, we will observe significant disparities and the development of super teams.

elo_desities

Pictured: Twelve Seasons of League Elo Density

We are no longer looking at Elo Rating from the league perspective but are striving to understand how Elo Rating tracks the performance of individual teams within other statistical data.

In fact, we can see that there is a certain correlation between a team's average score and its Elo Rating - the higher the average score within the game window, the higher the Elo Rating seems to be. However, we can also observe that Elo can exhibit significant differences at similar scoring numbers. To better understand how Elo Rating tracks the scoring relationship, we studied the comparison of average scores against the season average scores across the entire league - from there, we can determine whether scoring increases Elo, provided that high scores are relative to other scores in the league. For this purpose, let's examine the case of the same team within the same season and plot the distribution of scores relative to their opponents.

last_ten_avg_point

This confirms our hypothesis, as we can see that when the distribution of average scores is greater than that of the opponents or more concentrated at the same or higher levels, the Elo for that season is higher. Given teams in seasons with close to equal or smaller values in the group, their Elo scores are lower. Therefore, average scores are a reliable determinant for predicting game outcomes, but it works better when relativized. This demonstrates to us that, in predicting our winning side, Elo performs better than scores because it's relative to the statistical data.

We are no longer focusing on team statistics, but we're trying to understand if Elo can better track player performance rather than team performance. For this, we used a similar method to plot Elo Rating along with the average scores of the same random teams, this time using the PER rating.

elo

From the plotted data, we can see that the total PER doesn't have a significant correlation with the determined team strength compared to the opponents. Instead, scoring translates better – this makes sense as a player's efficiency doesn't necessarily correlate directly with the most scoring, and competing against opponents' scoring is the determining factor in winning games, thus affecting Elo.

We can further understand this by plotting the Orlando Magic's relative average PER rating to the opponents' in the same given season and find that the team's PER average or median has almost no relation to the team's strength.

elo_1
Predicting game results based on individual player statistics and scores

Before running the model, we need to clean the data a bit. For some games in this dataset, we have player statistics for one team but not the other – often just the first game of the season for that other team. So, we will remove all such games from the dataset.

For player ratings, we will use a linear regression model instead of logistic regression since we want to predict a range of possible values (scores) rather than just predicting a win or loss. The RMSE (Root Mean Square Error) for all our players is 5.56, which means players typically score or miss around 2-3 shots per game around their average.

On testing the results, we will group the predicted scores for each team in each game and compare them to the actual scores. Calculating the number of wins based on the predicted scores and the winner, the accuracy is 58.66%, with 1483 wins out of 2528 games. Clearly, as we realized earlier when examining the PER distribution of teams against their opponents, the variability in player performance as a determinant is too great to accurately predict game outcomes - especially when compared to team performance, which is often more consistent during games.

Conclusion and Future Optimization

Of course, this applies not only to the NBA but can also be used in various sports. However, for those of us who have been following the NBA for a long time, creating a model to predict NBA game outcomes would be an interesting project. It could provide an exciting opportunity for accurate profit results.

Our random forest regression model, optimized through RandomSearchCV, provided us with the highest test accuracy of 67.15%. It outperformed the Logistic Regression model slightly and significantly outperformed the linear regression model based on individual player statistics. Using GridSearchCV and RandomizedSearchCV for parameter optimization was time-consuming and computationally expensive, yielding only minor changes in test accuracy. If we had more time, we might spend less time optimizing parameters and more time selecting the model.

The best NBA game prediction models can only accurately predict the winners about 70% of the time. Therefore, our logistic regression model and random forest classifier are very close to the current prediction limit. If we had more time, we would explore other models to see how high of a test accuracy we could achieve. Some of the candidates might include the SGD classifier, Linear Discriminant Analysis, Convolutional Neural Networks, or Naive Bayes classifiers.