Andrew Lau
There's a great powerlifting data set that collates hundreds of thousands of powerlifting competition results. I thought I'd do a bit of exploration and fit a quick and dirty XGBoost model to see if I can gather some insight on the impact on the total lifted in competition of:
I'm curious about the impacts of the above features on the typical powerlifter. Of course, one of the issues with this dataset is the self-selection bias of lifters (e.g. lifters tend to compete when they feel strong/uninjured). Particularly when looking at the impact of age, older lifters may just choose not to lift after years of accumulated injuries and declining performance and may not appear in the dataset.
The problem here is rather straightforward (there is a lot of clean data (>2m rows), there aren't that many features available) and I'm just curious about the "pure effect" of the features on the target. To make the problem even simpler, the general impact of features like age, gender, drug testing and bodyweight are known within the powerlifting community (e.g. we know that in general males lift more and heavier lifters lift more).
Thus, a quick and dirty model will suffice (e.g. I will just use the tried and tested XGBoost with no feature engineering, basic pre-processing, minimal hyperparameter tuning).
Data sourced from: https://www.openpowerlifting.org/
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
import seaborn as sns
from sklearn.impute import SimpleImputer
import numpy as np
from sklearn.metrics import mean_squared_error
import shap
import statsmodels.api as sm
from sklearn.model_selection import RandomizedSearchCV
import pickle
pd.set_option('display.max_columns', None)
N_JOBS = 6
# import the data
data = pd.read_csv("openpowerlifting-2021-01-20.csv",
low_memory=False)
data.shape
display(data)
Let's have a play around with the data and see what it looks like.
Let's take a cheeky look at a random subset of the data.
data_EDA = data.sample(n=100_000, random_state=1)
data_EDA.TotalKg.hist()
# Let's check out are target, the comp total
fig, (ax1, ax2, ax3, ax4) = plt.subplots(1, 4, figsize=(15,5))
#, legend=True after updating pandas
data_EDA.Best3SquatKg.hist(ax=ax1)
data_EDA.Best3BenchKg.hist(ax=ax2)
data_EDA.Best3DeadliftKg.hist(ax=ax3)
data_EDA.TotalKg.hist(ax=ax4)
Distribution of the target as well as the best bench, squat and deadlift is consistent with my anectoal experience and seems reasonable.
data.describe()
# there seems to be negatives in the weights lifted. I'm guessing the negatives are bad lifts
Let's plot some one-way charts to give us a feel of certain features on the target. Of course, these won't be "pure effects" as such, as they will be confounding by correlated variables (mix effects - for example a one-way chart of total lifted vs. weight may overestimate the pure effect of weight as higher weights are correlated with being male).
def one_way(x_var, y_var, data, axis):
ax = sns.regplot(x=x_var, y=y_var, data=data, lowess=True, scatter_kws={"alpha":0.1},
line_kws={"color": "red"}, ax=axis)
Let's have a look at the impact of some features on amount lifted
data_subset = data_EDA.loc[data_EDA.Tested == "Yes"].sample(n=25_000, random_state=1)
fig, ((ax1, ax2, ax3, ax4), (ax5, ax6, ax7, ax8), (ax9, ax10, ax11, ax12)) = plt.subplots(3, 4, figsize=(17.5,10))
one_way("Age", "Best3SquatKg", data=data_subset[data_subset.Best3SquatKg > 0],
axis=ax1)
one_way("BodyweightKg", "Best3SquatKg", data=data_subset[data_subset.Best3SquatKg > 0],
axis=ax2)
sns.boxplot(x="Sex", y="Best3SquatKg", data=data_subset[data_subset.Best3SquatKg > 0], ax=ax3)
sns.boxplot(x="Equipment", y="Best3SquatKg", data=data_subset[data_subset.Best3SquatKg > 0], ax=ax4)
one_way("Age", "Best3BenchKg", data=data_subset[data_subset.Best3BenchKg > 0],
axis=ax5)
one_way("BodyweightKg", "Best3BenchKg", data=data_subset[data_subset.Best3BenchKg > 0],
axis=ax6)
sns.boxplot(x="Sex", y="Best3BenchKg", data=data_subset[data_subset.Best3BenchKg > 0], ax=ax7)
sns.boxplot(x="Equipment", y="Best3BenchKg", data=data_subset[data_subset.Best3BenchKg > 0], ax=ax8)
one_way("Age", "Best3DeadliftKg", data=data_subset[data_subset.Best3DeadliftKg > 0],
axis=ax9)
one_way("BodyweightKg", "Best3DeadliftKg", data=data_subset[data_subset.Best3DeadliftKg > 0],
axis=ax10)
sns.boxplot(x="Sex", y="Best3DeadliftKg", data=data_subset[data_subset.Best3DeadliftKg > 0], ax=ax11)
sns.boxplot(x="Equipment", y="Best3DeadliftKg", data=data_subset[data_subset.Best3DeadliftKg > 0], ax=ax12)
plt.show()
The average lifter appears to peak around the age of 30, before declining over time. The deterioration as the average lifter ages is better than what I was expecting. As expected bodyweight has a significant impact on weight lifted. Again, I note that these are averages and are impacted by conflating/correlated variables. As expected, males tend to lift more and equipped lifters also tend to lift more.
features_list = ['Age',
# 'AgeClass',
# 'Bench1Kg',
# 'Bench2Kg',
# 'Bench3Kg',
# 'Bench4Kg',
# 'Best3BenchKg',
# 'Best3DeadliftKg',
# 'Best3SquatKg',
# 'BirthYearClass',
'BodyweightKg',
'Country',
# 'Date',
# 'Deadlift1Kg',
# 'Deadlift2Kg',
# 'Deadlift3Kg',
# 'Deadlift4Kg',
# 'Division',
# 'Dots',
'Equipment',
'Event',
# 'Federation',
# 'Glossbrenner',
# 'Goodlift',
# 'MeetCountry',
# 'MeetName',
# 'MeetState',
# 'MeetTown',
# 'Name',
'ParentFederation',
# 'Place',
'Sex',
# 'Squat1Kg',
# 'Squat2Kg',
# 'Squat3Kg',
# 'Squat4Kg',
# 'State',
# 'T',
'Tested',
# 'TotalKg',
# 'WeightClassKg',
# 'Wilks'
]
# feature selection
X = data[features_list]
# one hot encoding on full dataset to avoid categorical levels not in training but in valid/test
X = pd.get_dummies(X, drop_first=True)
X_col_names = X.columns
y = data["TotalKg"]
# removing rows where there isn't a total, the age and bodyeight are invalid
y_notna = ~y.isna()
X_valid_age = X.Age > 10
X_valid_weight = X.BodyweightKg > 20
mask = y_notna & X_valid_age & X_valid_weight
y = y[mask]
X = X[mask]
print("na in y_train:", y.isna().sum())
print("na in X_train:", np.sum(np.isnan(X)))
# train test split
# 20% test, 20% validation, 60% training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.25, random_state=1)
data.shape
# imputing missing values. not needed for XGB, used for testing out other ML models that can't handle missing data
# imp = SimpleImputer(missing_values=np.nan, strategy='median')
# X_train = imp.fit_transform(X_train)
# X_valid = imp.fit_transform(X_valid)
# X_test = imp.fit_transform(X_test)
We now fit a model to see get a better understanding of the "pure effects" of the features. There is quite a bit of data and the problem is fairly straightforward, so we just default to the trusty XGBoost model. We conduct a light random gridsearch. Early stopping rounds and a validation data set were used to help find the optimal number of estimators at each fit.
# fitted this earlier on Google Cloud Platform
# xgb_1 = XGBRegressor(objective="reg:squarederror", n_estimators=500, n_jobs=16)
# eval_set = [(X_valid, y_valid)]
# gbm_param_grid = {
# 'max_depth': [2, 6],
# 'learning_rate': [0.01, 0.1],
# 'gamma': [0, 0.5, 2],
# 'colsample_bytree':[0.8, 1]
# }
# fit_params={"early_stopping_rounds":25,
# "eval_set" : [[X_valid, y_valid]]
# }
# # Perform random search: grid_mse
# random_search = RandomizedSearchCV(xgb_1,
# cv=4,
# param_distributions=gbm_param_grid,
# n_iter=5,
# n_jobs=2,
# verbose=False,
# random_state=123,
# )
# random_search.fit(X_train, y_train, **fit_params)
# pickle.dump(random_search, open("random_search.pickle", 'wb'))
random_search = pickle.load(open("random_search.pickle", 'rb'))
random_search.best_estimator_.get_params()
best_iter = random_search.best_estimator_.best_iteration
y_train_pred = random_search.best_estimator_.predict(X_train, ntree_limit=best_iter)
y_valid_pred = random_search.best_estimator_.predict(X_valid, ntree_limit=best_iter)
y_test_pred = random_search.best_estimator_.predict(X_test, ntree_limit=best_iter)
print("MSE on training data:", mean_squared_error(y_train, y_train_pred))
print("MSE on validation data:", mean_squared_error(y_valid, y_valid_pred))
print("MSE on test data:", mean_squared_error(y_test, y_test_pred))
def mean_absolute_percentage_error(y_true, y_pred):
return np.mean(abs(y_true / y_pred - 1))
print("MAPE on training data:", mean_absolute_percentage_error(y_train, y_train_pred))
print("MAPE on validation data:", mean_absolute_percentage_error(y_valid, y_valid_pred))
print("MAPE on test data:", mean_absolute_percentage_error(y_test, y_test_pred))
The mean squared error for the validation/test dataset aren't significantly worse than the the training data. No signs of overfitting here. MSE is a little hard to interpret, so we also look at MAPE (mean absolute percentage error) to get a feel of the average percentage prediction error. Around 13-14% on all splits of the data, seems like a reasonable model.
We compare the average fitted and actuals by key factors to see if there are any segments that do not fit well. The fit looks quite good. The fit deteriorates a bit for higher ages/weights, but they are low exposure segments.
pd.DataFrame({"factor":X_valid.loc[:, "Age"].round(), "y_true":y_valid, "y_pred":y_valid_pred}).groupby("factor").mean().plot()
pd.DataFrame({"factor":X_valid.loc[:, "BodyweightKg"].round(), "y_true":y_valid, "y_pred":y_valid_pred}).groupby("factor").mean().plot()
pd.DataFrame({"factor":X_valid.loc[:, "Sex_M"], "y_true":y_valid, "y_pred":y_valid_pred}).groupby("factor").mean().plot()
The next diagnostic we look at is the quantile plot. We compare predicted and actuals by quantiles.
def quantile_plot(y_true, y_pred, quantiles=10, scale=1.0, title=None, return_df=False):
"""
description: plots a quantile plot based on sorting observations by y_pred
and grouping them into quantiles.
inputs:
y_true - array like object of actuals
y_pred - array like object of predicted values
quantiles - number of quantiles/bins to separate the data
scale - the predicted values can be scaled to focus on the ability of
the model to rank the observations
title - plot title
return_df - False. Whether to return a pandas dataframe with the quantiles
returns:
DataFrame with y_true and y_pred grouped into quantiles sorted by y_pred
"""
if scale == "auto":
scale_used = sum(y_true) / sum(y_pred)
else:
scale_used = scale
df = pd.DataFrame({'Actual':np.array(y_true).reshape(-1),
'Predicted':np.array(y_pred).reshape(-1) * scale_used})
df = df.groupby(pd.qcut(df.rank(method="first").loc[:, "Predicted"], quantiles)).mean()
df.loc[:, "Quantile"] = (list(range(1, quantiles + 1)))
df = df.set_index("Quantile")
df.plot()
plt.gca().set_ylabel("Target")
plt.xticks(rotation = 90)
if title:
plt.title(title)
else:
plt.title("Predicted vs. Observed (" + str(quantiles) + " quantiles)")
if return_df:
return df
quantile_plot(y_train, y_train_pred, quantiles=100, scale=1.0, title="training")
quantile_plot(y_valid, y_valid_pred, quantiles=100, scale=1.0, title="validation")
quantile_plot(y_valid, y_valid_pred, quantiles=100, scale=1.0, title="test")
Quantile plot looks good - the predictions are close to the actuals in each quantile and there is decent separation between the high and low quantiles
def gini_score(y_true, y_pre, **kwargs):
"""
description: model scoring metric. returns gini index.
input: y_true, y_pre
returns: gini score
adapted from: https://zhiyzuo.github.io/Plot-Lorenz/
"""
sorted_arr = np.array([x for _, x in sorted(zip(y_true, y_pre))])
n = sorted_arr.size
coef_ = 2. / n
const_ = (n + 1.) / n
weighted_sum = sum([(i + 1) * yi for i, yi in enumerate(sorted_arr)])
return coef_ * weighted_sum / (sorted_arr.sum()) - const_
def lorenz_curve(y_true, y_pred):
"""
description: plots the lorenz curve based on y_true and y_pred.
input: y_true, y_pre
adapted from: https://zhiyzuo.github.io/Plot-Lorenz/
"""
X = np.array([x for _,x in sorted(zip(y_true, y_pred))])
X_lorenz = X.cumsum() / X.sum()
X_lorenz = np.insert(X_lorenz, 0, 0)
X_lorenz[0], X_lorenz[-1]
fig, ax = plt.subplots(figsize=[6,6])
## scatter plot of Lorenz curve
ax.scatter(np.arange(X_lorenz.size)/(X_lorenz.size-1), X_lorenz,
marker='x', color='orange', s=5, label="Predicted Separation")
# repeat for perfect separation
X = np.array([x for _,x in sorted(zip(y_true, y_true))])
X_lorenz = X.cumsum() / X.sum()
X_lorenz = np.insert(X_lorenz, 0, 0)
X_lorenz[0], X_lorenz[-1]
## scatter plot of Lorenz curve
ax.scatter(np.arange(X_lorenz.size)/(X_lorenz.size-1), X_lorenz,
marker='x', color='red', s=5, label="Perfect Separation")
## line plot of equality\
ax.plot([0,1], [0,1], color='k')
ax.set_title("Gini Index - Predicted: %.4f, Gini Index Perfect: %.4f" % (gini_score(y_true, y_pred), gini_score(y_true, y_true)))
ax.legend()
lorenz_curve(y_train, y_train_pred)
lorenz_curve(y_valid, y_valid_pred)
lorenz_curve(y_test, y_test_pred)
# compute SHAP values
explainer = shap.TreeExplainer(random_search.best_estimator_)
shap_values = explainer.shap_values(X_train)
shap.summary_plot(shap_values, features=X_train, feature_names=X_col_names, plot_type="bar")
The most important feature is the event, which makes sense, given the drastic differences in totals from a full powerlifting meet (squat, bench and deadlift) and a bench only meet. It's interesting how gender came up above bodyweight.
shap.summary_plot(shap_values, features=X_train, feature_names=X_col_names)
The above chart shows the impact of the top features and the distributions of their SHAP values. When the colours mix, it implies an interaction with something else.
Let's look at some SHAP partial dependencies to analyse the pure effects of each feature
shap.dependence_plot("rank(0)", shap_values, X_train, feature_names = X_col_names, alpha=0.01)
shap.dependence_plot("rank(1)", shap_values, X_train, feature_names = X_col_names, alpha=0.01)
shap.dependence_plot("rank(2)", shap_values, X_train, feature_names = X_col_names, alpha=0.01)
shap.dependence_plot("rank(3)", shap_values, X_train, feature_names = X_col_names, alpha=0.01)
shap.dependence_plot("rank(4)", shap_values, X_train, feature_names = X_col_names, alpha=0.01)
Event_SBD is such a strong factor, it's interacting with quite a lot here. Interesting how lifters between the ages of 20-40 (in their prime), doing full powerlifting meets tended to lift more (above the impact of the full powerlifting meet).
These clouds of SHAP values are a little harder to interpret, so I've averaged it out in the graphs below.
def shap_plot_mean(var):
shap_df = pd.DataFrame({"SHAP Value":shap_values[:,X_col_names.get_indexer([var])[0]], var:X_train.iloc[:,X_col_names.get_indexer([var])[0]]}).round().groupby(var).mean()
shap_df.plot()
shap_plot_mean("Age")
shap_plot_mean("BodyweightKg")
shap_plot_mean("Sex_M")
Based off the above, you'll only taper off in performance after the age of 35 (noting the self-selection bias of the data), not suprisingly bodyweight increases weight lifted (albeit with diminishing returns), and the typical male powerlifter lifts about 140kg more than the typical female powerlifter.