An Exploration of the Open Powerlifting dataset

Andrew Lau
There's a great powerlifting data set that collates hundreds of thousands of powerlifting competition results. I thought I'd do a bit of exploration and fit a quick and dirty XGBoost model to see if I can gather some insight on the impact on the total lifted in competition of:

  • age, particularly when people tend to "peak" in their lifting careers
  • bodyweight
  • gender
  • equipment

I'm curious about the impacts of the above features on the typical powerlifter. Of course, one of the issues with this dataset is the self-selection bias of lifters (e.g. lifters tend to compete when they feel strong/uninjured). Particularly when looking at the impact of age, older lifters may just choose not to lift after years of accumulated injuries and declining performance and may not appear in the dataset.

The problem here is rather straightforward (there is a lot of clean data (>2m rows), there aren't that many features available) and I'm just curious about the "pure effect" of the features on the target. To make the problem even simpler, the general impact of features like age, gender, drug testing and bodyweight are known within the powerlifting community (e.g. we know that in general males lift more and heavier lifters lift more).

Thus, a quick and dirty model will suffice (e.g. I will just use the tried and tested XGBoost with no feature engineering, basic pre-processing, minimal hyperparameter tuning).

Data sourced from: https://www.openpowerlifting.org/

In [34]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
import seaborn as sns
from sklearn.impute import SimpleImputer
import numpy as np
from sklearn.metrics import mean_squared_error
import shap
import statsmodels.api as sm
from sklearn.model_selection import RandomizedSearchCV
import pickle

pd.set_option('display.max_columns', None)

N_JOBS = 6
In [35]:
# import the data
data = pd.read_csv("openpowerlifting-2021-01-20.csv",
                  low_memory=False)
In [36]:
data.shape
Out[36]:
(2301966, 41)
In [37]:
display(data)
Name Sex Event Equipment Age AgeClass BirthYearClass Division BodyweightKg WeightClassKg Squat1Kg Squat2Kg Squat3Kg Squat4Kg Best3SquatKg Bench1Kg Bench2Kg Bench3Kg Bench4Kg Best3BenchKg Deadlift1Kg Deadlift2Kg Deadlift3Kg Deadlift4Kg Best3DeadliftKg TotalKg Place Dots Wilks Glossbrenner Goodlift Tested Country State Federation ParentFederation Date MeetCountry MeetState MeetTown MeetName
0 Dominik Gabriel M B Raw 17.0 16-17 14-18 T 16-17 102.5 110 NaN NaN NaN NaN NaN 120.0 130.0 -140.0 NaN 130.0 NaN NaN NaN NaN NaN 130.0 1 79.18 78.36 74.83 58.87 NaN Slovakia NaN WUAP WUAP 2015-06-16 Czechia NaN Praha European Championships
1 Marek Herák M B Multi-ply 19.0 18-19 19-23 T 18-19 59.8 60 NaN NaN NaN NaN NaN 165.0 -175.0 175.0 -180.0 175.0 NaN NaN NaN NaN NaN 175.0 1 148.11 149.71 146.22 84.82 NaN Slovakia NaN WUAP WUAP 2015-06-16 Czechia NaN Praha European Championships
2 Miroslav Adamove M B Multi-ply 18.0 18-19 14-18 T 18-19 87.7 90 NaN NaN NaN NaN NaN 250.0 -270.0 -270.0 NaN 250.0 NaN NaN NaN NaN NaN 250.0 1 163.81 161.77 155.22 85.72 NaN Slovakia NaN WUAP WUAP 2015-06-16 Czechia NaN Praha European Championships
3 Kamil Lipiňski M B Multi-ply 19.0 18-19 19-23 T 18-19 89.5 90 NaN NaN NaN NaN NaN -170.0 -180.0 -180.0 NaN NaN NaN NaN NaN NaN NaN NaN DQ NaN NaN NaN NaN NaN Poland NaN WUAP WUAP 2015-06-16 Czechia NaN Praha European Championships
4 Gabriel Kováč M B Multi-ply 20.0 20-23 19-23 Juniors 81.6 82.5 NaN NaN NaN NaN NaN 260.0 280.0 -290.0 NaN 280.0 NaN NaN NaN NaN NaN 280.0 1 190.86 188.82 181.77 100.89 NaN Slovakia NaN WUAP WUAP 2015-06-16 Czechia NaN Praha European Championships
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2301961 Fady Meilad M SBD Raw 34.5 24-34 24-39 Open 94.5 105 200.0 210.0 -225.0 NaN 210.0 150.0 160.0 -165.0 NaN 160.0 230.0 250.0 -255.0 NaN 250.0 620.0 2 391.49 386.58 369.78 80.48 Yes Egypt NaN AfricanPF IPF 2015-10-24 Morocco NaN Meknes African Classic Powerlifting Championships
2301962 Mohamed Omar M SBD Raw 30.5 24-34 24-39 Open 110.3 120 150.0 NaN NaN NaN 150.0 -100.0 100.0 115.0 NaN 115.0 -230.0 240.0 -265.0 NaN 240.0 505.0 2 298.79 296.94 283.83 60.95 Yes Egypt NaN AfricanPF IPF 2015-10-24 Morocco NaN Meknes African Classic Powerlifting Championships
2301963 Boudjemaa Nait Ladjemil M SBD Raw 33.5 24-34 24-39 Open 116.4 120 270.0 290.0 -300.0 NaN 290.0 170.0 177.5 182.5 NaN 182.5 280.0 300.0 312.5 NaN 312.5 785.0 1 455.52 454.68 435.41 92.48 Yes Algeria NaN AfricanPF IPF 2015-10-24 Morocco NaN Meknes African Classic Powerlifting Championships
2301964 Mohamed Eldeken M SBD Raw 38.5 35-39 24-39 Open 128.3 120+ 150.0 NaN NaN NaN 150.0 100.0 NaN NaN NaN 100.0 150.0 NaN NaN NaN 150.0 400.0 2 224.87 226.78 216.80 45.19 Yes Egypt NaN AfricanPF IPF 2015-10-24 Morocco NaN Meknes African Classic Powerlifting Championships
2301965 Mohamed Bouafia M SBD Raw 38.0 35-39 24-39 Open 130.5 120+ 350.0 375.0 390.0 NaN 390.0 210.0 220.0 222.5 NaN 222.5 335.0 355.0 -367.5 NaN 355.0 967.5 1 541.12 546.84 522.26 108.52 Yes Algeria NaN AfricanPF IPF 2015-10-24 Morocco NaN Meknes African Classic Powerlifting Championships

2301966 rows × 41 columns

Exploratory data analysis

Let's have a play around with the data and see what it looks like.

Target distribution

Let's take a cheeky look at a random subset of the data.

In [38]:
data_EDA = data.sample(n=100_000, random_state=1)
In [39]:
data_EDA.TotalKg.hist()
Out[39]:
<AxesSubplot:>
In [40]:
# Let's check out are target, the comp total
fig, (ax1, ax2, ax3, ax4) = plt.subplots(1, 4, figsize=(15,5))

#, legend=True after updating pandas
data_EDA.Best3SquatKg.hist(ax=ax1)
data_EDA.Best3BenchKg.hist(ax=ax2)
data_EDA.Best3DeadliftKg.hist(ax=ax3)
data_EDA.TotalKg.hist(ax=ax4)
Out[40]:
<AxesSubplot:>

Distribution of the target as well as the best bench, squat and deadlift is consistent with my anectoal experience and seems reasonable.

In [41]:
data.describe()
# there seems to be negatives in the weights lifted. I'm guessing the negatives are bad lifts
Out[41]:
Age BodyweightKg Squat1Kg Squat2Kg Squat3Kg Squat4Kg Best3SquatKg Bench1Kg Bench2Kg Bench3Kg Bench4Kg Best3BenchKg Deadlift1Kg Deadlift2Kg Deadlift3Kg Deadlift4Kg Best3DeadliftKg TotalKg Dots Wilks Glossbrenner Goodlift
count 1.405630e+06 2.272788e+06 548366.000000 541887.000000 525775.000000 5965.000000 1.512067e+06 917694.000000 906210.000000 876436.000000 16381.000000 2.019193e+06 629286.000000 616535.000000 587107.000000 15711.000000 1.638153e+06 2.137665e+06 2.118599e+06 2.118599e+06 2.118599e+06 1.929263e+06
mean 3.100391e+01 8.405366e+01 112.103324 92.430403 32.275855 72.392890 1.749615e+02 83.474699 55.112395 -18.091854 23.887011 1.193173e+02 159.314554 129.661601 16.392483 77.594206 1.893182e+02 3.755357e+02 2.731590e+02 2.721775e+02 2.566639e+02 6.390920e+01
std 1.318477e+01 2.273521e+01 144.052702 169.183631 196.035332 188.238942 6.995331e+01 104.345559 129.211113 143.203392 164.077946 5.483921e+01 109.758061 160.004725 212.291169 188.829659 6.252989e+01 2.069084e+02 1.302450e+02 1.297588e+02 1.234186e+02 1.619867e+01
min 0.000000e+00 1.510000e+01 -555.000000 -580.000000 -600.500000 -550.000000 -5.080200e+02 -502.500000 -575.000000 -575.000000 -500.000000 -5.225000e+02 -461.000000 -470.000000 -587.500000 -461.000000 -4.100000e+02 1.000000e+00 6.800000e-01 6.700000e-01 6.400000e-01 5.000000e-01
25% 2.100000e+01 6.705000e+01 87.500000 70.000000 -160.000000 -100.000000 1.225000e+02 55.000000 -52.500000 -140.000000 -125.000000 7.711000e+01 125.000000 115.000000 -205.000000 -100.000000 1.400000e+02 2.000000e+02 1.407300e+02 1.402100e+02 1.327900e+02 5.219000e+01
50% 2.750000e+01 8.196000e+01 145.000000 142.900000 110.000000 131.000000 1.700000e+02 105.000000 95.000000 -60.000000 75.000000 1.150000e+02 180.000000 177.500000 117.500000 142.500000 1.900000e+02 3.500000e+02 2.938400e+02 2.929600e+02 2.735200e+02 6.362000e+01
75% 3.900000e+01 9.890000e+01 200.000000 202.500000 190.000000 200.000000 2.200000e+02 145.000000 145.000000 120.000000 155.000000 1.542200e+02 225.000000 230.000000 205.000000 206.380000 2.350000e+02 5.307000e+02 3.715400e+02 3.700700e+02 3.504900e+02 7.517000e+01
max 9.800000e+01 2.602000e+02 555.000000 577.500000 560.000000 592.390000 5.810500e+02 467.500000 487.500000 478.540000 487.610000 5.034900e+02 450.000000 460.400000 457.500000 440.500000 4.604000e+02 1.407500e+03 7.952200e+02 7.933300e+02 7.569000e+02 1.464900e+02

One-way analysis

Let's plot some one-way charts to give us a feel of certain features on the target. Of course, these won't be "pure effects" as such, as they will be confounding by correlated variables (mix effects - for example a one-way chart of total lifted vs. weight may overestimate the pure effect of weight as higher weights are correlated with being male).

In [42]:
def one_way(x_var, y_var, data, axis):
    ax = sns.regplot(x=x_var, y=y_var, data=data, lowess=True, scatter_kws={"alpha":0.1},
                     line_kws={"color": "red"}, ax=axis)

Let's have a look at the impact of some features on amount lifted

In [43]:
data_subset = data_EDA.loc[data_EDA.Tested == "Yes"].sample(n=25_000, random_state=1)

fig, ((ax1, ax2, ax3, ax4), (ax5, ax6, ax7, ax8), (ax9, ax10, ax11, ax12)) = plt.subplots(3, 4, figsize=(17.5,10))

one_way("Age", "Best3SquatKg", data=data_subset[data_subset.Best3SquatKg > 0],
       axis=ax1)
one_way("BodyweightKg", "Best3SquatKg", data=data_subset[data_subset.Best3SquatKg > 0],
       axis=ax2)
sns.boxplot(x="Sex", y="Best3SquatKg", data=data_subset[data_subset.Best3SquatKg > 0], ax=ax3)
sns.boxplot(x="Equipment", y="Best3SquatKg", data=data_subset[data_subset.Best3SquatKg > 0], ax=ax4)


one_way("Age", "Best3BenchKg", data=data_subset[data_subset.Best3BenchKg > 0],
       axis=ax5)
one_way("BodyweightKg", "Best3BenchKg", data=data_subset[data_subset.Best3BenchKg > 0],
       axis=ax6)
sns.boxplot(x="Sex", y="Best3BenchKg", data=data_subset[data_subset.Best3BenchKg > 0], ax=ax7)
sns.boxplot(x="Equipment", y="Best3BenchKg", data=data_subset[data_subset.Best3BenchKg > 0], ax=ax8)

one_way("Age", "Best3DeadliftKg", data=data_subset[data_subset.Best3DeadliftKg > 0],
       axis=ax9)
one_way("BodyweightKg", "Best3DeadliftKg", data=data_subset[data_subset.Best3DeadliftKg > 0],
       axis=ax10)
sns.boxplot(x="Sex", y="Best3DeadliftKg", data=data_subset[data_subset.Best3DeadliftKg > 0], ax=ax11)
sns.boxplot(x="Equipment", y="Best3DeadliftKg", data=data_subset[data_subset.Best3DeadliftKg > 0], ax=ax12)

plt.show()

The average lifter appears to peak around the age of 30, before declining over time. The deterioration as the average lifter ages is better than what I was expecting. As expected bodyweight has a significant impact on weight lifted. Again, I note that these are averages and are impacted by conflating/correlated variables. As expected, males tend to lift more and equipped lifters also tend to lift more.

Pre-processing

Feature selection and missing data imputation

I'm just fitting a quick and dirty model here, so I'm removing duplicated/redudant features and not bothering with any feature engineering.

In [44]:
features_list = ['Age',
#  'AgeClass',
#  'Bench1Kg',
#  'Bench2Kg',
#  'Bench3Kg',
#  'Bench4Kg',
#  'Best3BenchKg',
#  'Best3DeadliftKg',
#  'Best3SquatKg',
#  'BirthYearClass',
 'BodyweightKg',
 'Country',
#  'Date',
#  'Deadlift1Kg',
#  'Deadlift2Kg',
#  'Deadlift3Kg',
#  'Deadlift4Kg',
#  'Division',
#  'Dots',
 'Equipment',
 'Event',
#  'Federation',
#  'Glossbrenner',
#  'Goodlift',
#  'MeetCountry',
#  'MeetName',
#  'MeetState',
#  'MeetTown',
#  'Name',
 'ParentFederation',
#  'Place',
 'Sex',
#  'Squat1Kg',
#  'Squat2Kg',
#  'Squat3Kg',
#  'Squat4Kg',
#  'State',
#  'T',
 'Tested',
#  'TotalKg',
#  'WeightClassKg',
#  'Wilks'
                ]
In [45]:
# feature selection
X = data[features_list]

# one hot encoding on full dataset to avoid categorical levels not in training but in valid/test
X = pd.get_dummies(X, drop_first=True)
X_col_names = X.columns

y = data["TotalKg"]

# removing rows where there isn't a total, the age and bodyeight are invalid
y_notna = ~y.isna()
X_valid_age = X.Age > 10
X_valid_weight = X.BodyweightKg > 20

mask = y_notna & X_valid_age & X_valid_weight

y = y[mask]
X = X[mask]

print("na in y_train:", y.isna().sum())
print("na in X_train:", np.sum(np.isnan(X)))
na in y_train: 0
na in X_train: Age                       0
BodyweightKg              0
Country_Afghanistan       0
Country_Algeria           0
Country_American Samoa    0
                         ..
ParentFederation_WRPF     0
ParentFederation_WUAP     0
ParentFederation_XPC      0
Sex_M                     0
Sex_Mx                    0
Length: 224, dtype: int64
In [46]:
# train test split
# 20% test, 20% validation, 60% training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.25, random_state=1)
In [47]:
data.shape
Out[47]:
(2301966, 41)
In [48]:
# imputing missing values. not needed for XGB, used for testing out other ML models that can't handle missing data
# imp = SimpleImputer(missing_values=np.nan, strategy='median')
# X_train = imp.fit_transform(X_train)
# X_valid = imp.fit_transform(X_valid)
# X_test = imp.fit_transform(X_test)

Fitting a model - XGBoost

We now fit a model to see get a better understanding of the "pure effects" of the features. There is quite a bit of data and the problem is fairly straightforward, so we just default to the trusty XGBoost model. We conduct a light random gridsearch. Early stopping rounds and a validation data set were used to help find the optimal number of estimators at each fit.

In [49]:
# fitted this earlier on Google Cloud Platform
# xgb_1 = XGBRegressor(objective="reg:squarederror", n_estimators=500, n_jobs=16)
# eval_set = [(X_valid, y_valid)]


# gbm_param_grid = {
#     'max_depth': [2, 6],
#     'learning_rate': [0.01, 0.1],
#     'gamma': [0, 0.5, 2], 
#     'colsample_bytree':[0.8, 1]
# }

# fit_params={"early_stopping_rounds":25,
#             "eval_set" : [[X_valid, y_valid]]
#            }

# # Perform random search: grid_mse
# random_search = RandomizedSearchCV(xgb_1, 
#                                    cv=4,
#                                    param_distributions=gbm_param_grid, 
#                                    n_iter=5,
#                                    n_jobs=2,
#                                    verbose=False, 
#                                    random_state=123,
#                                   )

# random_search.fit(X_train, y_train, **fit_params)
In [50]:
# pickle.dump(random_search, open("random_search.pickle", 'wb'))
random_search = pickle.load(open("random_search.pickle", 'rb'))
In [51]:
random_search.best_estimator_.get_params()
Out[51]:
{'objective': 'reg:squarederror',
 'base_score': 0.5,
 'booster': 'gbtree',
 'colsample_bylevel': 1,
 'colsample_bynode': 1,
 'colsample_bytree': 1,
 'gamma': 0,
 'gpu_id': -1,
 'importance_type': 'gain',
 'interaction_constraints': '',
 'learning_rate': 0.1,
 'max_delta_step': 0,
 'max_depth': 6,
 'min_child_weight': 1,
 'missing': nan,
 'monotone_constraints': '()',
 'n_estimators': 500,
 'n_jobs': 16,
 'num_parallel_tree': 1,
 'random_state': 0,
 'reg_alpha': 0,
 'reg_lambda': 1,
 'scale_pos_weight': 1,
 'subsample': 1,
 'tree_method': 'exact',
 'validate_parameters': 1,
 'verbosity': None}

Diagnostics

MSE and MAPE

In [52]:
best_iter = random_search.best_estimator_.best_iteration
y_train_pred = random_search.best_estimator_.predict(X_train, ntree_limit=best_iter)
y_valid_pred = random_search.best_estimator_.predict(X_valid, ntree_limit=best_iter)
y_test_pred = random_search.best_estimator_.predict(X_test, ntree_limit=best_iter)

print("MSE on training data:", mean_squared_error(y_train, y_train_pred))
print("MSE on validation data:", mean_squared_error(y_valid, y_valid_pred))
print("MSE on test data:", mean_squared_error(y_test, y_test_pred))

def mean_absolute_percentage_error(y_true, y_pred):
    return np.mean(abs(y_true / y_pred - 1))

print("MAPE on training data:", mean_absolute_percentage_error(y_train, y_train_pred))
print("MAPE on validation data:", mean_absolute_percentage_error(y_valid, y_valid_pred))
print("MAPE on test data:", mean_absolute_percentage_error(y_test, y_test_pred))
MSE on training data: 4349.150095184332
MSE on validation data: 4446.135423677414
MSE on test data: 4410.636658830921
MAPE on training data: 0.13601618227032103
MAPE on validation data: 0.13763610656733644
MAPE on test data: 0.13728857659985094

The mean squared error for the validation/test dataset aren't significantly worse than the the training data. No signs of overfitting here. MSE is a little hard to interpret, so we also look at MAPE (mean absolute percentage error) to get a feel of the average percentage prediction error. Around 13-14% on all splits of the data, seems like a reasonable model.

One-ways by key factors

We compare the average fitted and actuals by key factors to see if there are any segments that do not fit well. The fit looks quite good. The fit deteriorates a bit for higher ages/weights, but they are low exposure segments.

In [53]:
pd.DataFrame({"factor":X_valid.loc[:, "Age"].round(), "y_true":y_valid, "y_pred":y_valid_pred}).groupby("factor").mean().plot()
pd.DataFrame({"factor":X_valid.loc[:, "BodyweightKg"].round(), "y_true":y_valid, "y_pred":y_valid_pred}).groupby("factor").mean().plot()
pd.DataFrame({"factor":X_valid.loc[:, "Sex_M"], "y_true":y_valid, "y_pred":y_valid_pred}).groupby("factor").mean().plot()
Out[53]:
<AxesSubplot:xlabel='factor'>

Quantile plot

The next diagnostic we look at is the quantile plot. We compare predicted and actuals by quantiles.

In [54]:
def quantile_plot(y_true, y_pred, quantiles=10, scale=1.0, title=None, return_df=False):
    """ 
    description: plots a quantile plot based on sorting observations by y_pred
        and grouping them into quantiles.
    inputs:
        y_true - array like object of actuals
        y_pred - array like object of predicted values
        quantiles - number of quantiles/bins to separate the data
        scale - the predicted values can be scaled to focus on the ability of
            the model to rank the observations
        title - plot title
        return_df - False. Whether to return a pandas dataframe with the quantiles
    returns:
        DataFrame with y_true and y_pred grouped into quantiles sorted by y_pred
    """
    if scale == "auto":
        scale_used = sum(y_true) / sum(y_pred)
    else:
        scale_used = scale

    df = pd.DataFrame({'Actual':np.array(y_true).reshape(-1),
                                  'Predicted':np.array(y_pred).reshape(-1) * scale_used})
    df = df.groupby(pd.qcut(df.rank(method="first").loc[:, "Predicted"], quantiles)).mean()
    df.loc[:, "Quantile"] = (list(range(1, quantiles + 1)))
    df = df.set_index("Quantile")

    df.plot()
    plt.gca().set_ylabel("Target")    
    plt.xticks(rotation = 90)
    
    if title:
        plt.title(title)    
    else:
        plt.title("Predicted vs. Observed (" + str(quantiles) + " quantiles)")

    if return_df:
        return df

quantile_plot(y_train, y_train_pred, quantiles=100, scale=1.0, title="training")
quantile_plot(y_valid, y_valid_pred, quantiles=100, scale=1.0, title="validation")
quantile_plot(y_valid, y_valid_pred, quantiles=100, scale=1.0, title="test")

Quantile plot looks good - the predictions are close to the actuals in each quantile and there is decent separation between the high and low quantiles

Lorenz curve

In [55]:
def gini_score(y_true, y_pre, **kwargs):
    """
    description: model scoring metric. returns gini index.
    input: y_true, y_pre
    returns: gini score
    adapted from: https://zhiyzuo.github.io/Plot-Lorenz/
    """
    sorted_arr = np.array([x for _, x in sorted(zip(y_true, y_pre))])
    n = sorted_arr.size
    coef_ = 2. / n
    const_ = (n + 1.) / n
    weighted_sum = sum([(i + 1) * yi for i, yi in enumerate(sorted_arr)])

    return coef_ * weighted_sum / (sorted_arr.sum()) - const_
def lorenz_curve(y_true, y_pred):
    """
    description: plots the lorenz curve based on y_true and y_pred.
    input: y_true, y_pre
    adapted from: https://zhiyzuo.github.io/Plot-Lorenz/
    """
    X = np.array([x for _,x in sorted(zip(y_true, y_pred))])

    X_lorenz = X.cumsum() / X.sum()
    X_lorenz = np.insert(X_lorenz, 0, 0)
    X_lorenz[0], X_lorenz[-1]
    fig, ax = plt.subplots(figsize=[6,6])
    ## scatter plot of Lorenz curve
    ax.scatter(np.arange(X_lorenz.size)/(X_lorenz.size-1), X_lorenz,
               marker='x', color='orange', s=5, label="Predicted Separation")

    # repeat for perfect separation
    X = np.array([x for _,x in sorted(zip(y_true, y_true))])

    X_lorenz = X.cumsum() / X.sum()
    X_lorenz = np.insert(X_lorenz, 0, 0)
    X_lorenz[0], X_lorenz[-1]

    ## scatter plot of Lorenz curve
    ax.scatter(np.arange(X_lorenz.size)/(X_lorenz.size-1), X_lorenz,
               marker='x', color='red', s=5, label="Perfect Separation")

    ## line plot of equality\
    ax.plot([0,1], [0,1], color='k')
    ax.set_title("Gini Index - Predicted: %.4f, Gini Index Perfect: %.4f" % (gini_score(y_true, y_pred), gini_score(y_true, y_true)))
    ax.legend()
    
lorenz_curve(y_train, y_train_pred)
lorenz_curve(y_valid, y_valid_pred)
lorenz_curve(y_test, y_test_pred)

Interpreting results

Now that we're happy with this model, let's draw some insight from it!

Feature importance

In [56]:
# compute SHAP values
explainer = shap.TreeExplainer(random_search.best_estimator_)
shap_values = explainer.shap_values(X_train)
In [57]:
shap.summary_plot(shap_values, features=X_train, feature_names=X_col_names, plot_type="bar")

The most important feature is the event, which makes sense, given the drastic differences in totals from a full powerlifting meet (squat, bench and deadlift) and a bench only meet. It's interesting how gender came up above bodyweight.

In [58]:
shap.summary_plot(shap_values, features=X_train, feature_names=X_col_names)

The above chart shows the impact of the top features and the distributions of their SHAP values. When the colours mix, it implies an interaction with something else.

  • A full powerlifting meeting has a positive impact on total
  • Being male has a positive impact on total
  • A heavier bodyweight has a positive impact on total
  • Age is too heavily interacted with other features to discern a simple effect here
  • Raw (minimal equipment) has a negative impact on total

Partial dependencies

Let's look at some SHAP partial dependencies to analyse the pure effects of each feature

In [59]:
shap.dependence_plot("rank(0)", shap_values, X_train, feature_names = X_col_names, alpha=0.01)
shap.dependence_plot("rank(1)", shap_values, X_train, feature_names = X_col_names, alpha=0.01)
shap.dependence_plot("rank(2)", shap_values, X_train, feature_names = X_col_names, alpha=0.01)
shap.dependence_plot("rank(3)", shap_values, X_train, feature_names = X_col_names, alpha=0.01)
shap.dependence_plot("rank(4)", shap_values, X_train, feature_names = X_col_names, alpha=0.01)
Passing parameters norm and vmin/vmax simultaneously is deprecated since 3.3 and will become an error two minor releases later. Please pass vmin/vmax directly to the norm when creating it.
Passing parameters norm and vmin/vmax simultaneously is deprecated since 3.3 and will become an error two minor releases later. Please pass vmin/vmax directly to the norm when creating it.
Passing parameters norm and vmin/vmax simultaneously is deprecated since 3.3 and will become an error two minor releases later. Please pass vmin/vmax directly to the norm when creating it.
Passing parameters norm and vmin/vmax simultaneously is deprecated since 3.3 and will become an error two minor releases later. Please pass vmin/vmax directly to the norm when creating it.

Event_SBD is such a strong factor, it's interacting with quite a lot here. Interesting how lifters between the ages of 20-40 (in their prime), doing full powerlifting meets tended to lift more (above the impact of the full powerlifting meet).

These clouds of SHAP values are a little harder to interpret, so I've averaged it out in the graphs below.

In [60]:
def shap_plot_mean(var):
    shap_df = pd.DataFrame({"SHAP Value":shap_values[:,X_col_names.get_indexer([var])[0]], var:X_train.iloc[:,X_col_names.get_indexer([var])[0]]}).round().groupby(var).mean()
    shap_df.plot()
    
shap_plot_mean("Age")
shap_plot_mean("BodyweightKg")
shap_plot_mean("Sex_M")

Based off the above, you'll only taper off in performance after the age of 35 (noting the self-selection bias of the data), not suprisingly bodyweight increases weight lifted (albeit with diminishing returns), and the typical male powerlifter lifts about 140kg more than the typical female powerlifter.