694: CatBoost: Powerful, efficient ML for large tabular datasets

This is 5 Minute Friday on CatBoost. Welcome back to the Super Data Science Podcast. Today's episode is dedicated to CatBoost, which is short for Category and Boosting. This is a powerful open-source tree-boosting algorithm that has been garnering a lot of attention recently in the machine learning community. CatBoost has been around since 2017 when it was released by Yandex, a tech giant based in Moscow. I've provided links to the original paper, as well as to all of the official technical documentation in the show notes, in case you'd like to dig into either of those. In a nutshell, CatBoost, like the more established and regularly keggle leaderboard topping approaches XGBoost and LightGBM, is at its heart a decision tree algorithm that leverages gradient boosting. If you're unfamiliar with these gradient-boosted tree approaches, check out episode 681 with XGBoost expert Matt Harrison to understand them. Relatively briefly here, however, tree-boosting algorithms like CatBoost, XGBoost, and LightGBM follow a three-step approach. In Step 1, we initialize the model. So initially, we start with typically a simple decision tree, and then the prediction made by this initial model is considered the baseline. In Step 2, we iterate. So in each subsequent iteration, we take a new decision tree that is added to an ensemble making this kind of like a random forest of decision trees. So this iteration, this decision tree, this training process involves adjusting the weights of training examples with a focus on the misclassified or poorly predicted instances. So this means that the new tree that we iterate on is built to minimize the errors of the previous ensemble. It's this focus on error minimization that makes tree-boosted algorithms so powerful and efficient. All right, so that's Step 2. So we step one, we initialize the model, Step 2, we have iterative training, and then in Step 3, that's where we combine everything together into a big ensemble, like again, a random forest. And so the predictions from all of the trees in the ensemble are combined to form the final prediction. The combination mechanism varies depending on the algorithm and task, but typically involves averaging or weighted averaging. All right, and so this comparison that I've made a couple of times with random forests in the ensemble, that isn't exactly correct. It's just kind of a loose wave saying that we're combining decision trees together. The way that we do it in random forests is different from the way that we do it in these gradient-boosted approaches, but hopefully that just conveys this general concept of taking a bunch of decision trees and ensembling them together. Okay, so boosting tree-boosting, that explains the boost part of cat-boost. And then let's dig into the cat part, the category part of the cat-boost name. And so that comes from cat-boost's superior handling of categorical features. If you've trained models with categorical data before, you've likely experienced the TDM of preprocessing and feature engineering with categorical data. Cat-boost comes to the rescue here, efficiently dealing with categorical variables by implementing a novel algorithm that eliminates the need for extensive preprocessing or manual feature engineering. Boost handles categorical features automatically by employing techniques such as target encoding and one-hot encoding. Really quickly, one-hot encoding is where we represent all of the possible categories for a given variable as a vector of zeros except that for the single category that is represented by a given row of our data, we set it to one, hence one-hot. And then target encoding is also known as meaning coding and it simply involves replacing a categorical feature with the mean of the target variable for that category. So these two things together, one-hot encoding and target encoding are what cat-boost implements automatically into its novel approach to doing tree boosting. And so if you're interested in learning more about one-hot encoding and target encoding, I've included links to those approaches in the show notes for you to check out. But yeah, back to cat-boost, it takes advantage of those approaches and in addition to using those approaches, one-hot encoding and target encoding to have superior handling of categorical features, cat-boost also makes use of something called ordered boosting, which is a specialized gradient-based optimization scheme that takes advantage of the natural ordering of categorical variables, allowing cat-boost to minimize its loss function as it's training more efficiently relative to other kinds of gradient-boosting approaches like XGBoost and LightGBM. In addition, cat-boost makes use of symmetric decision trees, which have a fixed tree depth, and this enables cat-boost to have a faster training time relative to XGBoost and a comparable training time to LightGBM, which LightGBM is famous for its speed, so that's impressive. And then on top of all that, cat-boost also has built-in regularization techniques such as the well-known L2 regularization approach, as well as ordered boosting and the symmetric trees already discussed. So altogether, with this L2 regularization, the ordered boosting, the symmetric trees, this makes cat-boost unlikely to overfit to training data relative to other kinds of booster tree algorithms, which can be prone to overfitting. Altogether, this means that cat-boost may be the best performing option for a broad range of tasks, including classification, regression, ranking, and recommendation systems. If you're working with categorical variables, then it's an even better bet for you. If you're working with natural language data, no problem because character strings can be vectorized into numbers. All right, so hopefully you're excited about using cat-boost now if you hadn't heard a bet before, or if you hadn't dug into it much before, and if you are, remember that it's open source, so it's completely free. In addition to that, installation is really easy. It can be installed in all of the most popular data science environments, such as Python, R, and Apache Spark, where you can even use it on the command line. It includes GPU acceleration, allowing you to train models faster or handle large data sets, such as for large-scale machine learning tasks that you need to do across multiple GPUs, and cat-boost allows for interpretability. So it implicitly includes SHAP values, so that you can understand the contribution of each model feature and explain the model output. All right, I hope you're excited to jump on the bandwagon and try out cat-boost for modeling tabular data if you haven't already. We're working with raw media inputs, such as inputs, videos, or audio, or for generative models. You're probably still going to want to use deep learning for all those kinds of things. So when you've got images, video, audio, or you're going to be generating something like national language or an image or whatever, you're probably going to want to use deep learning. But if you're working with tabular data, like you find in the spreadsheet, a boosted tree approach like cat-boost is likely the way to go. All right, that's it for this week. Thanks to Sean Kostler again on the data science team at my machine learning company, Nebula, for providing the topic idea today and some of the content of today's episode through a recent edition of his excellent Let's Talk text newsletter. And thanks again for tuning in. Until next time, my friend, keep on rockin' it out there, and I'm looking forward to enjoying another round of the Super Data Science podcast with you very soon.