Code in this example can be found on my github
Introduction
Gradient boosting methods are commonly used in the Machine Learning field. It’s a rather straight forward process as it utilized “tree boosting” optimization methods by combining random forest algorithms with a learning rate. Gradient boosting algorithms are seeking to minimize an objective function.
Most common machine learning algorithms are using a similar basic objective function which is based on a frequentist approach towards statistics. The Bayesian approach treats the models in terms of a probability distribution, instead of giving you an exact output parameter.
We can use a Bayesian approach to determine the model parameters. This approach allows us to incorporate our prior beliefs about the shape of trees and the overall ensemble structure into the model. After that, the algorithm will update the priors based on the data using an MCMC back-fitting technique. Monte Carlo Markov chain (MCMC) is commonly used in Bayesian statistics to approximate complex distributions in order to estimate the posterior distribution of model parameters. For further details see here
Bayesian Additive Regression Trees (BART) utilizes similar tree boosting methods but through a Bayesian framework where predictions are drawn from a posterior distribution, which is a probability distribution of model parameters given the observed data. The BART model can be expressed as follows:
$$ \begin{equation} Y = f(X) + \varepsilon ≈ T^M_1 (X) + T^M_2 (X) + . . . + T^M_m (X) + \varepsilon, \hspace{0.5cm} \varepsilon ∼ N_n (0, σ^2I_n) \end{equation} $$
\(Y\) represent a \(n * 1 \) vector of the response variable and \(X\) is the \(n * p \) matrix of predictor columns. The \(\epsilon\) is the error vector. The \(m\) is the number of regression trees composed from the tree structure denoted as \(T\), and the terminal nodes (leaves) denoted by \(M\), representing together an entire tree as \(T^M\) with the structure and leaf parameter.
Additive regression trees have splitting nodes that gives you smaller prediction spaces by adding them up leaving a better picture of the overall prediction space. You have a bunch of weak learners to get a clear picture of the whole. You are taking each small part of the regression tree that outputs a weak learner by itself and adding them up to get a bigger picture. To address the overfitting problem, the Bayesian approach penalizes against it through regularization priors. In the same way the regularization term does for the boosting algorithms. Data goes through each tree and only the residual flows to the next tree, the regularization prior balances the data in order to prevent overfitting.
BART uses non-parametric regression models which is commonly used when relationships between variables are more complex and difficult to express. Non-parametric regression can also be useful for exploratory data analysis, as they can help to identify patterns and relationships that may not be apparent with simple summary statistics.
Approach
I’ve gathered some economic data from the FRED site (the “fredr” R-package allows you to access the Fred database via an API) and I’m trying to develop a BART algorithm to forecast the unemployment rate for America.
My dataset can be seen below. Observation starting from 2004 until the end of 2022; the UNRATE is the predictive variable \(\hat{y}\) with the remaining independent variables \(x\).
> str(data_frame)
tibble [228 × 8] (S3: tbl_df/tbl/data.frame)
$ date : Date[1:228], format: "2004-01-01" "2004-02-01" ...
$ UNRATE (unemployment rate) : num [1:228] 5.7 5.6 5.8 5.6 5.6 5.6 5.5 5.4 5.4 5.5 ...
$ T10Y3M (10Y-3M spread) : num [1:228] 3.25 3.14 2.87 3.39 3.68 3.45 3.14 2.78 2.44 2.3 ...
$ STLFSI4 (Fed St. Louis Stress Index) : num [1:228] -0.42 -0.494 -0.446 -0.597 -0.501 ...
$ EFFR (Effective Federal Funds Rate) : num [1:228] 1 1.01 1 1.01 1 1.03 1.27 1.43 1.62 1.76 ...
$ T5YIFR (5y5y Forward Infl. Expectations) : num [1:228] 2.49 2.43 2.45 2.52 2.74 2.65 2.55 2.49 2.4 2.36 ...
$ REAINTRATREARAT1YE (1-year Real Interest Rate) : num [1:228] -0.1377 -0.4014 -0.0643 -0.4123 -0.5069 ...
$ CPIAUCSL (CPI-Index) : num [1:228] 2.03 1.69 1.74 2.29 2.9 ...
I split the data into training and testing using the R package “caret”, 20% of the data will go into the test set, while the remaining 80% will go into the training set.
library(caret)
y <- data$UNRATE
df <- within(data, rm(UNRATE))
set.seed(42)
test_inds = createDataPartition(y = 1:length(y), p = 0.2, list = F)
Now running bartMachine on the training data bart_machine = bartMachine(df_train, y_train)
. The default MCMC uses a burn-in rate of
250 and 1000 iteration with 50 trees. All of those parameters can be specified
manually.
BART uses L1 & L2 regularization to reduce overfitting, introduce penalties and
reduce complexity especially with high dimensional data. The Pseudo-Rsq is for
non-linear models, it has the same interpretability as a normal R-squared.
The p-value of the shapiro-wilk test tells us about the data distribution. If the p-value is less than or equal to the significance level (usually 0.05), then we reject the null hypothesis and conclude that the data is not normally distributed.
> summary(bart_machine)
bartMachine v1.3.3.1 for regression
training data size: n = 180 and p = 6
built in 1.1 secs on 3 cores, 50 trees, 250 burn-in and 1000 post. samples
sigsq est for y beforehand: 2.141
avg sigsq estimate after burn-in: 0.06986
in-sample statistics:
L1 = 17.18
L2 = 2.65
rmse = 0.12
Pseudo-Rsq = 0.9967
p-val for shapiro-wilk test of normality of residuals: 0.81473
p-val for zero-mean noise: 0.91896
We can use the “rmse_by_num_trees” function to find the optimum number of trees
for the model. I’ve given it a sequence from 15 to 75 trees by 5 increments with
3 number of replicant trees. The RMSE tree chart is used in order to illustrate
the out-of-sample predictive capacity of our model. With additional
hyperparameter optimization we can build a better bartmachine model in the
future.
As you can see it shows us the path of the trees with its respective RMSE. Then
we can use the trees with the minimum RMSE and run the bartmachine
again. The
tree looks pretty static, an increase in the number of trees did not particularly
perform better.
bart_machine <- bartMachine(df_train, y_train, num_trees=20)
Using the “plot_convergence_diagnostics” function we can see how the MCMC performs. Overall, the tree nodes perform relatively constant. On the top left, the Siqsq estimate converges after ~200 interactions inside the interval. The three subsequent plots separated by gray lines are the post-burn-in iterations from each of the three computing cores employed during the model.
Next up, the “check_bart_error_assumptions” chart show us the error normality distribution using QQ-plots. We can see the residuals are normally distributed, no need of any adjustment.
Lastly, we will see how well our model performs in-sample and out-of-sample.
Bayesian statistics uses Credible intervals instead of Confidence intervals. Credible intervals provide a range of values where we can be certain that the true parameter lies within, given the available data and the assumptions of the model. The width of a credible interval reflects the amount of uncertainty in the prediction, with wider intervals indicating higher uncertainty. We can manually adjust the interval (default is 95%), a low interval means a wider grey bar of the uncertainty in the predictor. The out of sample chart is more important since it gives us an impression how well the model performs on the testing data. For the out-of-sample chart we can now provide a predictive range of each data point which is about 85% accurate given a 90% credible interval.
The Prediction intervals are drawn from the posterior distribution of the MCMC
process described earlier, they are wider than the Credible intervals since they
reflecting the uncertainty of the error term. Prediction intervals tells us
about the precision of our individual predictions, a Credible interval gives us
information about the likely range of true parameter values. As you can see
below, the posterior of each prediction is getting larger as the unemployment
rate increases. This reflects the uncertainty around the prediction value on the
independent variable. If the distribution is wide or has high variance, then
the posterior for that prediction will be larger. This is common in models with
high levels of noise or when the data points are widely spread out. Therefore,
it is important to consider the overall distribution of the target values and
the model’s accuracy when interpreting the size of the posterior.
I’ve
chosen a CI of 90% since the posterior distributions are getting larger with a
higher unemployment rates, reflecting the high degree of uncertainty around
these values. A lower CI results in less predictive values that are being
captured by the model and resulting in a lower boundary or limit of the
prediction range.
Lastly, we can calculate some metrics to compare our bartmachine precitions with the testset.
I’ve inserted the out-of-sample prediction into a dataframe and then calculated the ratios below.
y_pred <- predict(bart_machine, df_test)
summary(lm(y_test~y_pred))$r.squared
sqrt(mean((y_test - y_pred)^2))
cor.test(y_test, y_pred, method=c("pearson")) # Pearson R-squared
[1] 0.9364641 # R-squared
[1] 0.5188173 # RMSE
Pearson's product-moment correlation
data: y_test and y_pred
t = 26.038, df = 46, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.9428118 0.9818702
sample estimates:
cor
0.9677108
Additional Diagnostic Charts
In addition, bartmachine allows us to see the same kind of variable importance plot as xgboost. Running the “investigate_var_importance” function to see which variable has the most significant influence on the model’s predictions.
Then we can choose the most important variable (in this case the EFFR-rate) and
look at the partial dependence plot. The PD-Plot shows us how the target variable’s
predicted values change as the selected input variable varies. The
PD is plotted in black and a default 95% credible intervals plotted in blue for
the other variables in the dataset. Points plotted are at the 5%ile, 10%ile,
20%ile, . . . , 60%ile and 75%ile of the values of the predictor. We can see as
the EFFR increases the partial effect decreases. Which would be a negative
relationship between the input variable and the predictor variable. In other
words, as the value of X increases, the predicted value decreases. This
indicates that the increased Federal Funds rate is positively associated with
higher unemployment.
Note: It is important to note that interpreting partial
dependence plots requires considering the specific context of the data and the
model.
Conclusion
Overall, BART is an interesting algorithm with some unique capabilities, but
their suitability will depend on the complexity of the dataset and the task at
hand. Sometimes a Bayesian approach is preferred since it does not find the
single best value, rather a range of possible values determined by the posterior
distribution.
BART can be compared to gradient algorithms like XGBoost or Catboost. BART may be more appropriate when dealing with complex nonlinear relationships, while XGBoost may be better suited for simpler problems where speed and scalability are important.
References
Coqueret, G., & Guida, T. (2022, October 18). Machine Learning for Factor Investing. http://www.mlfactor.com/bayes.html
Kapelner, A., & Bleich, J. (2016). bartMachine: Machine Learning with Bayesian Additive Regression Trees. Journal of Statistical Software, 70(4). https://doi.org/10.18637/jss.v070.i04
Koehrsen, W. (2018, April 14). Introduction to Bayesian Linear Regression. https://towardsdatascience.com/introduction-to-bayesian-linear-regression-e66e60791ea7
Mamun, O. (2021, May 3). A Primer to Bayesian Additive Regression Tree with R. https://towardsdatascience.com/a-primer-to-bayesian-additive-regression-tree-with-r-b9d0dbf704d