694: CatBoost: Powerful, efficient ML for large tabular datasets
This is 5 Minute Friday on CatBoost.
Welcome back to the Super Data Science Podcast.
Today's episode is dedicated to CatBoost, which is short for Category and Boosting.
This is a powerful open-source tree-boosting algorithm that has been garnering a lot of
attention recently in the machine learning community.
CatBoost has been around since 2017 when it was released by Yandex, a tech giant based
in Moscow.
I've provided links to the original paper, as well as to all of the official technical
documentation in the show notes, in case you'd like to dig into either of those.
In a nutshell, CatBoost, like the more established and regularly keggle leaderboard topping approaches
XGBoost and LightGBM, is at its heart a decision tree algorithm that leverages gradient boosting.
If you're unfamiliar with these gradient-boosted tree approaches, check out episode 681 with
XGBoost expert Matt Harrison to understand them.
Relatively briefly here, however, tree-boosting algorithms like CatBoost, XGBoost, and LightGBM
follow a three-step approach.
In Step 1, we initialize the model.
So initially, we start with typically a simple decision tree, and then the prediction made
by this initial model is considered the baseline.
In Step 2, we iterate.
So in each subsequent iteration, we take a new decision tree that is added to an ensemble
making this kind of like a random forest of decision trees.
So this iteration, this decision tree, this training process involves adjusting the
weights of training examples with a focus on the misclassified or poorly predicted instances.
So this means that the new tree that we iterate on is built to minimize the errors of the
previous ensemble.
It's this focus on error minimization that makes tree-boosted algorithms so powerful and
efficient.
All right, so that's Step 2.
So we step one, we initialize the model, Step 2, we have iterative training, and then
in Step 3, that's where we combine everything together into a big ensemble, like again,
a random forest.
And so the predictions from all of the trees in the ensemble are combined to form the
final prediction.
The combination mechanism varies depending on the algorithm and task, but typically involves
averaging or weighted averaging.
All right, and so this comparison that I've made a couple of times with random forests
in the ensemble, that isn't exactly correct.
It's just kind of a loose wave saying that we're combining decision trees together.
The way that we do it in random forests is different from the way that we do it in
these gradient-boosted approaches, but hopefully that just conveys this general concept of taking
a bunch of decision trees and ensembling them together.
Okay, so boosting tree-boosting, that explains the boost part of cat-boost.
And then let's dig into the cat part, the category part of the cat-boost name.
And so that comes from cat-boost's superior handling of categorical features.
If you've trained models with categorical data before, you've likely experienced the
TDM of preprocessing and feature engineering with categorical data.
Cat-boost comes to the rescue here, efficiently dealing with categorical variables by implementing
a novel algorithm that eliminates the need for extensive preprocessing or manual feature
engineering.
Boost handles categorical features automatically by employing techniques such as target encoding
and one-hot encoding.
Really quickly, one-hot encoding is where we represent all of the possible categories
for a given variable as a vector of zeros except that for the single category that is represented
by a given row of our data, we set it to one, hence one-hot.
And then target encoding is also known as meaning coding and it simply involves replacing
a categorical feature with the mean of the target variable for that category.
So these two things together, one-hot encoding and target encoding are what cat-boost
implements automatically into its novel approach to doing tree boosting.
And so if you're interested in learning more about one-hot encoding and target encoding,
I've included links to those approaches in the show notes for you to check out.
But yeah, back to cat-boost, it takes advantage of those approaches and in addition to using
those approaches, one-hot encoding and target encoding to have superior handling of categorical
features, cat-boost also makes use of something called ordered boosting, which is a specialized
gradient-based optimization scheme that takes advantage of the natural ordering of categorical
variables, allowing cat-boost to minimize its loss function as it's training more efficiently
relative to other kinds of gradient-boosting approaches like XGBoost and LightGBM.
In addition, cat-boost makes use of symmetric decision trees, which have a fixed tree depth,
and this enables cat-boost to have a faster training time relative to XGBoost and a comparable
training time to LightGBM, which LightGBM is famous for its speed, so that's impressive.
And then on top of all that, cat-boost also has built-in regularization techniques such
as the well-known L2 regularization approach, as well as ordered boosting and the symmetric
trees already discussed.
So altogether, with this L2 regularization, the ordered boosting, the symmetric trees,
this makes cat-boost unlikely to overfit to training data relative to other kinds of
booster tree algorithms, which can be prone to overfitting.
Altogether, this means that cat-boost may be the best performing option for a broad range
of tasks, including classification, regression, ranking, and recommendation systems.
If you're working with categorical variables, then it's an even better bet for you.
If you're working with natural language data, no problem because character strings can
be vectorized into numbers.
All right, so hopefully you're excited about using cat-boost now if you hadn't heard
a bet before, or if you hadn't dug into it much before, and if you are, remember that
it's open source, so it's completely free.
In addition to that, installation is really easy.
It can be installed in all of the most popular data science environments, such as Python,
R, and Apache Spark, where you can even use it on the command line.
It includes GPU acceleration, allowing you to train models faster or handle large data
sets, such as for large-scale machine learning tasks that you need to do across multiple
GPUs, and cat-boost allows for interpretability.
So it implicitly includes SHAP values, so that you can understand the contribution of
each model feature and explain the model output.
All right, I hope you're excited to jump on the bandwagon and try out cat-boost for modeling
tabular data if you haven't already.
We're working with raw media inputs, such as inputs, videos, or audio, or for generative
models.
You're probably still going to want to use deep learning for all those kinds of things.
So when you've got images, video, audio, or you're going to be generating something like
national language or an image or whatever, you're probably going to want to use deep learning.
But if you're working with tabular data, like you find in the spreadsheet, a boosted
tree approach like cat-boost is likely the way to go.
All right, that's it for this week.
Thanks to Sean Kostler again on the data science team at my machine learning company, Nebula,
for providing the topic idea today and some of the content of today's episode through
a recent edition of his excellent Let's Talk text newsletter.
And thanks again for tuning in.
Until next time, my friend, keep on rockin' it out there, and I'm looking forward to
enjoying another round of the Super Data Science podcast with you very soon.