717: Overcoming Adversaries with A.I. for Cybersecurity, with Dr. Dan Shiebler

This is episode number 717 with Dr. Dan Sheebler, head of Machine Learning and AI at Abnormal Security. Today's episode is brought to you by GraphBase, the Unified Data Layer, by ODSC, the Open Data Science Conference, and by Model Bit for Deploying Models in Seconds. Welcome to the Super Data Science Podcast, the most listened to podcast in the data science industry. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. I'm your host, John Cron. Thanks for joining me today. And now, let's make the complex simple. Welcome back to the Super Data Science Podcast today, the wildly intelligent and clear speaking Dr. Dan Sheebler returns to the show for his fifth visit. Dan is head of Machine Learning at Abnormal Security, a cybercrime detection firm that has grown to over $100 million in annually recurring revenue in just four years. And there he manages a team of over 50 engineers. Previously, he worked at Twitter first as a staff machine learning engineer and then as an ML engineering manager. He holds a PhD in AI theory from the University of Oxford and obtained a perfect 4.0 GPA in his computer science and neuroscience joint bachelors from Brown University. Today's episode is on the technical side, so might appeal most to hands-on practitioners like data scientists and ML engineers, but anyone who'd like to understand the state of the art in cyber security should give it a listen. In this episode, Dan details the machine learning approaches needed to tackle the uniquely adversarial application of cybercrime detection, talks about how to carry out real-time ML modeling, what his PhD research on category theory entailed in how it applies to the real world, and he opined on the major problems facing humanity in the coming decades that he thinks AI will be able to help with and those that he thinks AI won't. All right, you ready for this absorbing episode? Let's go. Dan, welcome back yet again to the Super Data Science podcast. I guess you're in New York as usual. That's right. Thanks, John. Happy to be back. Yes, we've got an exciting episode today, so previously you've been on the show going all the way back to episode number 59, and then you came back while Curel was still hosting, that was episode number 345. My first time hosting you was episode 451, and then episode number 630, you did a five-minute Friday style episode where you answered a question specifically about resilient machine learning, which actually will build upon a bit more in today's episode. And yeah, something kind of cool for our listeners to check out. If you don't watch the video version, today I'm filming from Detroit, and this hotel that I'm in, the foundation hotel Detroit, it's wild. Like, you know, it just was expecting to record in my hotel room, but I was looking as leafing through the hotel booklet, and they have a dedicated podcast studio. So I'm actually, I've got this suite on airsign behind me. Other than that, it's like, it's just a quiet room. There's like lots of curtains and stuff. I'm using all my own equipment, but yeah, it's kind of a cool look for the video today. So yeah, so we've got tons of content for you, Dan, building on the resilient ML stuff a bit, and focusing on what you've been doing since your last episode. So it's been several years now, since your full length episode with me. And in that time, there's been a lot of changes. Most notably, you're working at a firm called abnormal security. And so you're addressing the high stakes challenges of cybercrime over there with machine learning. So what makes this particular adversarial machine learning challenge where, you know, you're not just building a machine learning model that is acting in a vacuum. It's very much the opposite. The models that you build people are trying to reverse engineer them on a regular basis to be able to overcome the kind of security that you're developing with your ML models. So this kind of adversarial scenario is adversarial machine learning challenge. How is that unique relative to the other kinds of machine learning models that you build historically? Totally. So abnormal security is a company that builds detection systems for identifying cyberattacks that are coming in through email and through accounts that people have on various SaaS platforms, including email, but things like your Slack, your Octa, your Zoom. And so there's really two kinds of attacks that's more concerned about. One is an account has been compromised. And we're trying to identify that this account has been taken over. The attacker has gotten the credentials. And now the person who's operating this account is no longer the account owner. It's the compromised attacker. And the other option is inbound attacks. There's an email message generally or sometimes other types of messages where a attacker is transmitting a payload which could be a phishing link or a message that's eliciting the recipients to update the bank account information or perhaps malware or anything else that's the initial vector to begin a cyber attack. And so the machine learning models that we build operates at the level of individual events which are the messages that are being sent, the sign-in events that we're observing for the accounts and a number of other kinds of events. And at each of these events we're trying to identify is this an attack? Is this malicious or is this normal behavior? And this is a very adversarial situation because the person on the other end, the attacker is going out of their way to try to cloak their actions. They're trying to make the messages that they're sending look as similar as possible to safe business messages. They're trying to sign in utilizing infrastructure and technology that allows them to cloak the facts that they are attacked or hide their identity and obfuscates anything that they're doing so that it looks like a normal individual. And so our machine learning models that we're utilizing need to take advantage of what are the things that the attacker might not know that we know for instance or how do we try to build something that is resilient to different kinds of modifications to the attacker might utilize and can really get at the heart of what separates the normal business traffic and communications from what the attacker is at the time we can do. Yeah, yeah, yeah. So I imagine this involves a broad range of different kinds of models. So I know you mentioned online in some of our research we dug out that some of these models involve probabilistic models that are like relatively straightforward I imagine relatively efficient all the way up to large language models which presumably are you know they're a lot more expensive to run they aren't necessarily as fast at inference time. So given the kinds of attacks that you're trying to identify how do you decide what kind of model you're going to be using for a particular type of threat. So there's really three kinds of models that we utilize which each of them try to capture something a little bit different and have different trade-offs in terms of what they have access to the cost that they are that they require in order to utilize them their speed at which they can be invoked and their efficacy and at the range of different kinds of attacks that can affect the capture. So everything that we build all of the models we build are powered by aggregate signals which are the most important component of our approach toward cyber security. So basically this is a special type of feature that we build over raw data that then powers all the different kinds of our models. And so this is a sort of foundation of our detection strategy. And so these are aggregates over raw email sign-ins and other kinds of raw events at individual entity levels. So for example, we would aggregate for a particular person all of the emails that that person has received and be able to say things like how many times has they received an email that has this header in it or this kind of phrase in it or this kind of attachments that's routed through this IP, utilizes this infrastructure, has this HTML tag, each of these different kinds of little individual signals that could lead to identifying some information about this email, about the sign-in events, we aggregate at the level of each person who's receiving each piece of infrastructure that's sending each piece of infrastructure within IPs and domains that messages and sign-ins get routed through. And through this we build this historical picture basically a summarization of everything that's happened up until this point that serves as our foundational feature infrastructure. So this is a very sort of structured way of building representations of features. And so it means that there's now a number of different ways that we can utilize these derived signals in models effectively. And so the simplest thing for us to do is, turistics, turistics and rules built on top of these signals. This is already a very heavily data-driven approach. Fundamentally these aggregate signals themselves are basically simple models. They are basically probabilistic models that demonstrate what's the percentage of time given some condition that X is true. You could construct these turistics and rules to look very similar to a Bayesian network on top of these individual aggregate signals with different sorts of conditionals that you're applying and different kinds of derived probabilities that you're building at the top of it. The next level of sophistication is basic trained models. So this would be things like logistic aggressions, xgboost, and we like deep and cross networks is our neural network architecture of choice for on this kind of multiple. Deep and cross? Deep and cross networks. Yes, so it's a network architecture. It's a very popular and ad tech. We have a number of people that abnormal have previously worked in ad tech. Basically it's a type of neural network where you consume both raw signals. It utilizes a deep layer as well as cross signals where basically you build derived signals from your individual raw features and then have those derived signals. You learn the derivation of your cross signals and then feed that into a deep network. So the cross layer functions in a way where it can take things like here's a frequency that some attribute is true and here's a boolean signal that says yes true or not true and then you can do like a multiplication of these in order to build like a derived signal for instance. So these are sorts of cross features that's the space of potential cross features that you can build is very very large and utilizing this network architecture allows just a 10 to the specific cross features that are most valuable. So it allows you to sort of remove a little bit of the work required to build sophisticated cross features without having a giant parameter space. So it's nice for cases where you have both deep embeddings and a lot of boolean and continuous features that you're consuming at the same time and you're trying to you want to do something a little bit different with like the dense continuous signals within an embedding and the individual boolean and continuous signals that represent more sparse information. And so the deep and cross network enables you to it's like an inductive bias that's built onto that kind of architecture. It's we utilize normal feed forward networks as well at this in this intermediate category of models that we train but we like the deep and cross because we've seen good performance with it's and there's nice implementations online. Yeah that's very cool. So I hadn't heard of this kind of like cross layer specifically in a deep neural network before. In my mind I kind of I imagine the the the workhorse layer of dense layer as being capable of doing some of the things that you're describing. So a dense layer should be able to in many circumstances identify whether across of two input features are together creating a lot of signal because that that dense layer that what that means that that denseness is that it it it recombines any possible inputs from the preceding layer. So it sounds like this kind of cross layer as opposed to being a general purpose dense layer that happens to be able to do those kinds of multi-term interactions this cross layer is explicitly designed to do that. Yeah so it's basically it's similar to how a convolutional neural network is inherently less expressive than a feed forward neural network but still more the more performance on image tasks than a raw feed forward network is it's embedding in the inductive bias that's these particular kinds of multiplications between your features to cross them is a useful thing that you want to do for that category of feature and so the we we utilize it when we have these signals where there's like one signal that tracks the frequency of an event and then another signal tracks the presence of that event where these are two features that really only make sense when they're combined together and are very difficult to cross with with other kinds of signals that's their their poignancy relies on their combination and so building those crosses explicitly through the cross network is useful for that that kind of application. Very cool. This episode is brought to you by Graphbase. Graphbase is the easiest way to unify, extend and cache all your data sources via a single GraphQL API deployed to the edge closest to your web and mobile users. Graphbase also makes an effortless to turn open API or MongoDB sources into GraphQL APIs. Not only that but the Graphbase command line interface lets you build locally and when deployed each Git branch automatically creates a preview deployment API for easy testing and collaboration that sure sounds great to me. Check Graphbase out yourself by signing up for a free account at Graphbase.com. That's g-r-a-f-e-a-s-e.com. Well, so I digress a bit into this. So you were saying that there's three kinds of model that you use. So you're describing the first was the heuristic models, rules-based ones, and then you were kind of talking about intermediate complexity machine learning models, so things like random forests, logistic regression models, these deep and cross deep neural networks. So yeah, I don't know if I missed any there if you wanted to go any deeper on that one or if you wanted to jump down to model type number three. Yeah, so model type number three is large language models. And so we utilize both the out of the box, open AI APIs for certain tasks as well as building our own fine-tuned variants of we've utilized Falcon and Lama and fine-tuned those to a few different tasks. And when you think about these three different categories, they kind of grow in a crescending amount of costs required to run its increased latency and decreased speed and different characteristics in terms of their ease of use. The first category, the third category are perhaps the easiest to use and modify because large language models, you can repurpose with prompt engineering and rules, you can repurpose by tweaking things, whereas the intermediate category of deep neural networks and such really requires retraining in order to incorporate new information. And so all three have pros and cons and can be applied to different types of use cases and challenges within the ecosystem of different kinds of attacks for trying to catch for different customers. Nice, that's a really good high level summary of the kinds of models that you work with. And yeah, it's interesting to think about how that third tier, those large language models that they've become so complex now that they're actually, as you say, I hadn't thought of it this way before, but they are, they become as easy to use as a simple heuristic model because you just, you change your prompt and they're so flexible. You don't need to be retraining the entire model, you know, maybe you could potentially in that third category, you could also be inserting in some like some heft layers. And those are then very fast to fine tunes. You could have this huge architecture, like you mentioned Falcon, it's a 40 billion parameter model, but you could use parameter efficient fine tuning heft to fine tune to some specific task of yours, maybe just have a few hundred or a few thousand examples of some of some task that you'd like it to be able to specialize in. And you can train that in minutes or hours, even though the architecture is so gigantic, because there might only be a million or so or, you know, a hundred million parameters that you're training. Yeah, in this parameter efficient fine tuning technique, as opposed to trying to do, yeah, the whole 40 billion. It is definitely the case that we've observed at least that once you go down the routes of fine tuning these models, you lose some of their generalized ability and ability to adapt them to different tasks. We find these models as sort of uber classifiers that can be applied to classification tasks by taking them and utilizing their size and their really deep understanding of raw fundamental concepts and ability to reason as bases for being able to be applied to representations of our data in a text form that they can understand, that then their fine tuned to understand better. And the fine, we have a couple of different kinds of message classification tasks that we operate, both the just identify whether or not something is an attack as well as identifying attacker objectives and triaging a messages that are submitted to a fishing mailbox product that we operate as well. And each of these are slightly different kinds of tasks that require slightly different kinds of behavior that involves some amount of human interface that we've seen in the past and that's where we're trying to incorporate large language models to reduce the human burden on the areas that involve that because the cost characteristics of these models make them very, very difficult for us to utilize them in an application like scanning every sign in or every email that we process. It's really cost prohibitive to do something like that with models of the size, but something that already involves some human interaction is much more manageable to incorporate these models in. Nice, yeah, that makes a lot of sense. Yeah, large language models being able to augment or automate something where a human would be required is probably going to be more cost effective, whereas yeah, trying to like have huge volumes of emails be processed by an LLM would be like crazy, crazy expensive. So one of the big things about training any machine learning model, particularly when we're talking about that intermediate tier, your second tier. So, you know, the Grandinfor is like physical regression, now classical 10-year-old deep learning architectures, one of the big things is looking at, you know, in your kind of scenario, you'll have some true state of the world that you're trying to model. You have correct labels that you're trying to guess with your machine learning model. And anytime we're doing, we're trying to do that classification, we end up in machine learning with some false positives and some false negatives. And obviously, we want to try to minimize both, but in your context in cybersecurity is one of those false positives or false negatives kind of worse than the other? And do you try to like try to minimize one in particular? It's really a balance to be fair. I mean, I think the worst thing that can happen is that you miss a really serious attack and it causes a lot of damage to customers. So in that sense, false negatives are more of a larger existential problem, like the worst kind of false negative is the worst of all. But false positives that are a false positive problem and a high rate of false positives is equally bad because it incentivizes a business as can't operates if people are being stopped by their security solutions from engaging in normal business. And so customers will end up putting overrides and ignoring remediation criteria. And then they'll expose themselves to exactly those kinds of really bad false negatives and we'll have no ability to control for it at all. So yeah. The boy who called Wolf kind of scenario. Absolutely. Cool. Yeah, that's interesting. I kind of in my head. I was expecting you to answer that question and just say that like false negatives are the worst. We got to make sure we avoid those. But yeah, of course, if your clients are getting false positives all the time, then they're just going to ignore your tool. And then they're going to miss the real deal. So in the last few years, I understand that the threat and landscape has changed a fair bit. So how have you had to adopt your models out of normal security to handle those new challenges? So traditionally, cybersecurity solutions function by identifying indicators of compromise and stopping threats based on matching indicators of compromise between a particular threats and a new thing that may or may not be a threat, like a new message or a new kind of sign in. And an indicator of compromise in this case is something. It's a smoking gun. A link that is known to be bad, a domain that is known to be bad, an IP that has poor reputation, an attachment that's hash matches some known malware. There's many different kinds of indicators to compromise. But what has happened is that the costs and ease with which attackers can switch the routes, tools that they're utilizing has simply gone down. Attackers have had better and better access to systems that have allowed them to evade the types of recognition of indicators of compromise and send out attacks that have not don't match any the patterns of any previous attacks with much, much larger scale and much, much higher degree of ability to avoid detection. And this is certainly gotten substantially worse with the introduction of generative AI tooling. A generative AI tooling in particular enables the personalization of attacks to a particular recipient by combining something like somebody's LinkedIn profile and integrating that seamlessly and entirely automated into social engineering scams that are highly targeted for that person. And this avoids both the indicator of compromise style checks for the templates that's phishing emails would normally match as well as just increases the degree to which these kinds of messages and attacks look to the recipients to be malicious. So our strategy at abnormal is to avoid an over focus on indicators of compromise as the core tenet of our strategy. Our strategies instead to focus on identifying abnormalities in individual pieces of communication and emails and sign-ins that make them different from the types of normal business communication and rather than try to root cause attack instead try to spot things that don't look like the normal safe communication. So rather than is attack we'd use not safe as our core strategy and core objective. And so this enables us to be much more resilient to changes that attackers can make to their attacks to try to avoid indicators of compromise and also enables us to play to the greatest advantages that security defenders have over security attackers which is knowledge of the targets. That attacker who's attacking somebody doesn't know what's in that person's inbox. They don't know what emails that person received yesterday. They may know a little bit about their attacker if they've utilized open source intelligence. A little bit about the target of the utilized open source intelligence but that are unlikely to know nearly as much as a security solution that's plugged into that person's accounts has access to that company's data and information. And by leveraging this advantage this information be symmetry that defenders have access to were able to most effectively fight back against attackers. This expands very, very naturally to the growing threat posed by generative AI tech. Fascinating and very well said you have such a crisp way of speaking. It's so easy to understand you thanks. So yeah I mean this is now getting a little bit you know into the future maybe although maybe not that far in the future. Do you ever worry about how generative AI like I don't know some kind of open source alternate of like GPT-5 or GPT-6 you know something of that kind of capability that might be here in a few years that's like open source and so can be used for malicious purposes. Do you ever you know worry about LLMs being able to go beyond the kinds of attacks you're describing here like this personalization which you know allows for the automation of say phishing attacks where you can instead of needing to have a human be researching somebody and coming up with points for a phishing email that might make them feel like this is a trusted entity. The LLM can now do that automatically but yeah in the future and with like some kind of open source maliciously usable GPT-5 or GPT-6 variant this might be able to do much more like this might be able to like plan attacks like it's not just generating the text but it's actually it's in some ways it's the it's like an independent malicious actor that some malicious human can kind of just like set in motion and say like you know here's some money get as much possible money back. Yeah is that something you ever spend time thinking about or is that just that's too far out. I think multi-stage planning with deep reasoning is very very difficult. I think it's substantially more difficult than solving a range of different problems. So I am less concerned about this from a sort of like general existential threat perspective but that said I think in cybersecurity there's a surestic that you could utilize for identifying what kinds of attacks you'll see in the future that has proven to have been pretty effective and this falls closely within that which is that cybercriminals are financially motivated by and large not every cybercriminals financially motivated there are state actors that exist as well but there are a tiny percentage of the overall set of cyber attacks the vast majority of cyber attacks are sent by people who are trying to receive a return on an investment they've spent some money to invest in technology to cloak their identity technology to acquire internet assets that they'll utilize to send out a text these are domains and IPs and types of like internet connection ISP variants and they will try to get a return on the money they're spending and the tax strategies that enable them to get a return on the money they're spending have become more and more sophisticated and the past if you were going to do something like a spear fishing you know you need to spend a great deal of time investigating your targets and that time is money basically because you are assuming you're getting paid on some hourly basis you think of yourself as a cybercriminal comparing what you get paid at McDonald's to looking up someone's LinkedIn and utilizing it to generate spear fishing emails if a tool lets you send out 10 spear fishing emails in the time it previously would have taken you to spend send one now you're going to be able to start sending more of these and there's certain kinds of attacks that are very sophisticated that exist already that we see these types of vendor fraud attacks where a attacker will compromise a legitimate vendor's account which is a very expensive thing to do purchasing an account of an email address of someone in billing at a Fortune 500 company on the dark web is that's that's a very expensive asset you're unlikely to have for very long because the company likely has a security team that's going to find you and so there's it's a short lived expensive asset that an attacker is acquiring and attempting to get as much money out of it as possible before they lose access to that asset and so these kinds of attacks are very sophisticated very difficult to detect but we do see some of them we do it builds models and systems to detect them and it's a reasonable heuristic that things that we see a small amount of now because of their sophistication and because of the amount of money that attackers need to spend in order to generate them we will become sheepers for attackers to send in the future as technology advances advances as AI advances as cybercrime fields a larger ecosystem of tooling and systems attackers will be able to send more or worse fiscated effects at a lower and lower price points which will mean that the things that we see at a you know maybe once a week basis will become things we see every day or things we see 10 or 20 a day as this this thing moves closer and I think that this case that you're describing right now with a agent that is operating the planning and prospecting of attacks at a multi-stage basis where first they send a series of attacks to gain fishing emails at that person who's in a building at some vendor in order to get access to their account then they have access that account then sending messages from that accounts to the various customers of that products to tell them to update their bank account info there's this kind of complex sophisticated multiple stage attack that having that reaching lower price points I think that's it's feasible to imagine that that could happen and the best way to protect against it is to take seriously the attacks that we see rarely today with the expectation that they will become more and more common in the future. Be where our data-centric future comes to life at ODSC West 2023 from October 30th to November 2nd join thousands of experts and professionals in person or virtually as they all converge and learn the latest in deep learning large language models natural language processing generative AI and other topics driving our dynamic field network with fellow AI pros invest in yourself in their wide range of training talks and workshops and unleash your potential at the leading machine learning conference open data science conferences are often the highlight of my year I always have an incredible time we filmed many super data science episodes there and now you can use the code super at checkout and you'll get an additional 15% off your pass at odsc.com nice yeah that that is a very sensible heuristic that as soon as you started to explain it I was like yeah that makes a lot of sense so yeah so that certainly is something to keep an eye on I guess yeah we don't know how quickly if ever machines are going to have that like multi-stage planning capability but I don't know with how blown away I was in the jump from GPT 3.5 GPT 4 I'm like I you know being surprised should be on surprising yeah okay so clearly that kind of heuristic is something that's useful for helping you figure out what kinds of models you might need to start prototyping now I understand that you also do head-to-head competitor comparisons on a weekly basis so how does that help you with refining your models as well totally so I'll just talk a little bit about the the process and so most companies of decent size need to spend a decent amount of money on email security emails are the primary vector by which large businesses get attacked with malware phishing invoice fraud etc and there's a number of different ways that businesses can try to protect themselves the most common way is purchasing solutions like abnormal security and we have many competitors that offer similar products that's try to protect customers from these kinds of attacks and because of the sheer volume of these attacks and the length of time that email security has been around this has been a product's category for one of the longest time periods within the SaaS types of products as email security that measured in the time frame of decades rather than years like most SaaS products we have a pretty easy to understand way to compare to products you simply install both and you see which one catches more text and which one generates less false positives very simple to see very simple to evaluate and every week our sales team works with customers to install abnormal security in their environments and compare us against either the customer's current email security solution or competitor email security solutions the customer is also considering normally customers will consider a number of different solutions at different price points observe which ones require the most effort for them to manage which is basically the same thing as false positives less false positives means less effort to manage and which ones protect their customers the best which is the same thing as false negative rate lower false negative rate you're better protecting the employees at the business and if we are able to find a tax that no other solution finds and generate fewer false positives than other solutions then we'll win the deal and our revenue will increase and if we're not then we won't win the deal and so this is a very simple and exciting space to be in as a machine learning engineer because it's relatively rare that you get to build technology that is placed immediately into such a clear cut competitive environment where you are immediately tested not only against adversaries but also against other solutions attempting to do the exact same thing that you're doing you see very very quickly how good your system is and how you can measure that immediately in terms of the dollar value that businesses will pay to remove their current solution and replace you with abnormal security and so this serves as a strong rallying point and motivation function for the detection team and for abnormal security as a whole nice that is a really cool process and probably a kind of process that not probably definitely the kind of process that you described there is something that's easy for me to imagine for my business and probably a lot of other people could imagine for theirs yeah comparing false positive false negative rates against your competitors and probably a lot of clients or prospective clients would be able to estimate you know how much each false positive cost how much each false negative cost them and just be able to determine okay yeah so you know going with this product at this price point I'm going to save this much overall and that's the best one to go with yep okay so tying back to your previous they're your most recent Super Data Science episode number 630 where we were talking about resilient machine learning so maybe you could quickly kind of recap what that resilience means like basically this idea of having a robust machine learning system how is that particularly important in cyber security so resilient machine learning means as you say a robust machine learning system and specifically building your engine so that it is unlikely to fail catastrophically there will always be problems that you face sometimes these problems are acute problems where a single system goes down perhaps there's an outage in data service someone pushes bad code some type of data gets deleted accidentally sometimes it's changes in the underlying data distribution on your side perhaps you onboard a new customer that's in a new industry that you've never seen before or you have some kind of change in the way that you're categorizing the events that you're seeing such that it changes the underlying data and that powers your features your aggregates for instance and sometimes it's adversarial in cyber security this third category is constant attackers are changing what they're doing every week and every day in order explicitly to fight against the system that you're building and so there's a lot of strategies that you can apply to build this kind of failure resilience into your machine learning systems to make it so that when things change your system doesn't change with it and so this includes the data distribution shifts that is normally thought of as a core problem within all machine learning systems you train on one set of this on data you launch now there's a new set of data and you have to deal with that's and so that's one part of it but it's also incorporates things like feature dropouts where you have certain areas or signals that you rely on that are not available in certain environments or circumstances and it needs to be able to operate even when you have these kinds of outages that occur and you still need to be able to provide protection for your customers. Nice and so that makes a lot of sense in cyber security but then outside of cyber security why might our listeners be interested in the concept of resilient machine learning related to whatever their kind of data science modeling is or kind of software engineering that they do or the kinds of systems that they architect. So building systems that are resilience to changes in your customer distribution is a constant issue that's every data scientist faces especially at a growing business when you have your initial set of customers and there's an initial behavior that's present in the kinds of data that you're seeing you want to be able to build your systems so that's when that distribution changes when new customers are onboarded you are able to quickly adapt to these new distributions and so there's two main principles that you could apply towards having this kind of quick adaptation to new customers. One is fast retraining maybe you build a machine learning model you train it on your data then you have new data that's coming in you if you have assembled a concrete coherence data pipeline and data labeling pipeline then you'll be able to retrain your model and sometimes you could even automate the retraining process depending on the nature of your data and the environment that you're operating. Another approach that we lean into we lean into both of these approaches that have normal security and but one more approach that I think is very under-discussed for quick adaptation but has a lot of usability in this kind of case is to utilize features that represents the data distribution itself and so to make this clear perhaps rather than have a categorical feature to represent something like a user that's like this is this user's ID we're going to represent them with a single value that's going to be go into a one-hot lookup then then you're sort of expecting the model to memorize this user and if this user changes their behavior in the future you need to retrain the model's updated an alternative is to utilize features like what are the number of accounts that this user has followed in the last 10 days of giving a Twitter example what are the topics that this user has liked on tweets in the last seven days so these are features but they're features that represent current data they represent the past that abnormal this is the aggregate features we're talking about earlier your how many emails this person received from this kind of account at this point of time and day in the past this is a feature representation of the current information and so in the case where you have one customer that you're building a model on and then another customer that gets onboarded even if that second customer is a very different distribution maybe that first customer only had a 10 person customer service team this new customer as a 500 person customer service team as if you've represented what it means to be a customer service agent in terms of these kinds of signals like how frequently does this person receive emails from the outside as opposed to things like memorizing who these individual people are even memorizing like a categorical signal on his customer service then you'll be able to better adapt to these kinds of new circumstances because your features themselves will be modified and will adapt to this new distribution deploying machine learning models into production doesn't need to require hours of engineering effort or complex homegrown solutions in fact data scientists may now not need engineering help at all with model bit you deploy ML models into production with one line of code simply call modelbit.deploy in your notebook and model bit will deploy your model with all its dependencies to production in as little as 10 seconds models can then be called as a rest endpoint in your product or from your warehouse as a sequel function very cool try it for free today at modelbit.com that's m-o-d-e-l-b-i-t.com nice nice nice nicely said lots of practical tips there for any of our listeners on resilient ML when did you start getting into this so is this something that like you started getting into back a Twitter or it's like it's not related to your PhD stuff directly is it it was pretty important to Twitter a Twitter the one of the core issues that I faced within the revenue science organization which was the organization that operates the machine learning models for ad serving was fast performance for a new ad campaign so customers would launch ad campaigns that have a number of different creatives line items that combine a number of different creatives and want to target a particular audience what we need to do is very quickly identify what are the types of users for whom these ads will be most poignant and we don't have substantial categorical information about the ads themselves and even the users themselves their ad interaction behavior can change quite quickly if they previously weren't a situation where they weren't getting any ads that were interested and now suddenly where they're a you know they're really into sports and sports betting advertising is suddenly been legalized and now we can show sports betting ads for instance that changes their behavior and so being able to represent the most recent picture of behavior at each of these categorical signals very very critical towards out of the box performance of being able to give that kind of quick turnover advertisers would generally tolerate worse performance for a couple of days but not for a couple of weeks after beginning a campaign and so having that fast adaptation you can't really rely on training a model that's going to be able to have the capacity to capture that at that kind of scale that that quickly nice yeah that makes a lot of sense so on the note of like reaction times and speed another obviously super critical thing whether it was Twitter before or abnormal security now is the real-time nature of processing like I imagine I mean it's it's super critical in both situations it's hard to say like it's more important than one of the other like obviously in a social media platform people are expecting news in real time for example they're expecting updates from people that they're following in real-time but with cybersecurity arguably there's like a bigger danger to not being real-time yeah so obviously it's super important in cybersecurity to have real-time processing as well are you able to go into like any particular kinds of infrastructure or technologies or techniques that you employ to handle massive traffic in real-time yeah so so we utilize our most direct approach towards real-time information is aggregates model retraining is something we utilize air flow to instrument model retraining on a weekly basis because of trying we're trying to take advantage more of customer shifts than attacker shifts attacker shifts can happen much faster than that and so we utilize our aggregate engine for identifying and adapting to attacker shifts so this is both at the IOC level of trying to when we miss an attack or if we see a particular IOC within a net new attack that we're we've caught now being able to ensure that we catch everything else that has that same IOC so basically utilizing a combination of abnormality to catch the first attack and then IOC to catch everything else that looks similar to we need to very quickly identify okay this signal is now something that we've seen in a malicious message we need to distribute this out to somewhere else so make this very concrete they're kind of just to give the situation that I'll talk about the technology so if the attacker has purchased the domain and they're not utilizing that domain to send out messages that include a malicious link with that domain there that maybe they send out a hundred messages that I'll include this domain and maybe we were able to identify that some set of these messages are malicious and we're able to identify this by looking at the differences between the way this message was sent and the kinds of messages that the person is receiving this normally receives but maybe we don't do that for every one of these hundred messages maybe ten of these messages hit people who are receive a lot of messages that look really sketchy but are totally normal and because of that we're not able to spot on those ten people that this message was bad but we have seen on our other 90 that it was bad because where those were sent to people who receive mailing normal messages and so now we need we have this new piece of information which is that this domain is bad and we have this message that was what we didn't we wouldn't be able to identify as bad without this piece of information now at risk of hitting this user this individual so this is a case where we need to react very very quickly to pull this message and stop it from doing damage because we've identified that this indicator of compromise is bad by leveraging this information and now we need to act on it and so we utilize like a redis based key value store to track these types of indicators of compromise and so we stratify based on every kind of decision that our system makes and track each of the different types of indicators compromise you could extract from messages or sign-ins in this system and utilize a triggering replay system to identify based on like a last and aggregate within redis when any of these individual counts gets triggered we then submit from the last and redis aggregates back to our core reprocessing answer very cool very cool example you said you said a term in there which maybe you did define and I just missed it but I will see yes so indicator of compromise oh yeah yeah so you talked about that earlier in the episode but I wasn't used to it as an acronym yet nice yes yes it's such a it's an acronym that I'd never heard of before going into the security world but it's constantly bandied about it's it really just means anything that could indicate that's something is bad and generally it's referring to IP addresses and domains and email addresses and file hashes and things like that but there's a lot of other things that's it could refer to as well nice and so you talked about this a little bit earlier but maybe we can we can dig into it a bit more when we talk about real-time processing you've kind of now covered that you know things like this redis key value store allow you to do that efficiently and you into into into previous answers you talked about a resilient machine learning being adaptable in practice how does that mean that you need to be updating your models like is it is there like a routine to updating machine learning models or is it like a vent driven how does that work so we utilize a we've built an auto retraining framework that enables us to retrain our models of regular cadence we maintain a large number of different machine learning models which we retrain on different cadences our auto retraining pipeline covers our core models are most important models that's we hook up into it's and it's it's a series of different steps to to do a auto retraining first it's a collect all of the data that's going to utilize we need to process it and to extract features from that data we need to actually run the training process and in the most important stages of evaluation we need to identify that if we take the model that's currently deployed and turn it off and turn this new one on we're not going to suddenly flood a bunch of customers false positives we're not going to stop catching attacks that we're currently catching we're not going to dramatically increase your cost or latency or anything else and so we have a large suite of tests that run simulations with this new model in place of the old model and so this is a pretty heavy expensive process which is why we don't set this up for every single model we deploy and only our most important critical models and this we for our faster adaptation we primarily rely on aggregates for capturing changes in data distribution and we utilize auto retraining as a way to re-adapt as customer distribution shift over time and take on new signals our one thing that's relatively interesting about our normal process is we are constantly adding new signals we're constantly identifying what's something what's a new kind of aggregate to build what's a new kind of data source to subscribe to to be able to understand more about the indicators compromised within emails or sign an events what are new ways that we can transform apply natural language processing apply clustering techniques to better understand each piece of data we process and each one of these signals is something that could be useful in a model retraining and we set up our auto retraining process so that it automatically consumes certain kinds of signals that the team adds so we have the we're able to operate in a mechanism where one group is building new signals and then immediately setting up heuristics around those signals to to utilize as a heuristic kind of models and the auto retraining process picks up these signals automatically into the models of regular retrain and so in in this way we are able to most efficiently have this feedback loop between the sort of very hands-on work to optimize a signal so that the signal is powerful enough to work in a heuristic and that signal then being incorporated into our next automated retraining for our core machine learning. Nice so in the software engineering world there is a term CICD continuous integration continuous deployment that is you know a very common practice these days and so the analog for what you're describing could we call that CICT continuous integration continuous training for a lot of these core models that are in your auto retraining framework. So to be honest I would say no I generally think of continuous training as being a somewhat separate thing where you're really looking at less than 10 to 20 minute difference between when a sample shows up and when the weight update has been applied to the model that's deployed in production at Twitter we had several systems that utilized this framework where we did have what you would call CICT where we had models that were deployed and the time between when a person clicked on an ad or chose not to click on an ad and when that fact had been propagated into a feature update or a like a back propagation gradient step for the model that serves ads was less than less than 20 minutes that abnormal it's it's going to be substantially longer period of time because of our auto retraining but the there is a very fast turnaround time towards that information being incorporated but it goes through the aggregate signals it goes through the fact like this after this message is sent will extract all these signals from it update the aggregates the features and the next prediction are different so you can think of it as like if you're if you think of it's all sort of the same thing when you blur your eyes it's take a step back whether you're applying this update to the features or applying this update to the weights of the model but that abnormal are only real time updates are being applied to features whereas when I think of continuous training as as referring to the real time updates being applied to the rules. Yeah yeah yeah so in CICT like you were doing a Twitter you're talking about some actual training of the model way it's like a back propagation step whereas the kind of retraining that you're doing with your auto retraining framework this is more holistic so it's kind of like it's going all the way back to like future creation and aggregation and yeah and then you can take advantage of the kinds of cross terms that you were describing way back earlier in the episode being able to be recreated afresh so it's a more comprehensive retraining it's not just like yeah you know one step of back problem. Yeah that's right. Cool so I don't know how much you can get into this kind of thing but I can at least ask but are you able to give like examples of instances where a cybersecurity system would miss a threat or identify a false positive and then requires you as a human or your team to come in and make some changes to address that kind of miss. So yeah I can give one example there's I'll talk about it a little bit more vague there's a type of pattern that we observe in cyber security where there's things that are sometimes referred to as the Nigerian print scam which is essentially a type of scam where you begin the scam by saying I want to give you money in some way or another and the person then engages and they trick them into giving bank account details and so sometimes this is considered to be a less harmful scam because it's you're just trying to steal money you're not necessarily trying to steal credentials that would allow you to advance the business but many many things that begin with I want to give you money may end up with malware credentials paying account information many many very valuable things being stolen and so this this is a very important type of attack that we need to defend against however we have seen what there was one case where we integrated with a church and this church received a lot of messages from people in other countries saying something along the lines of hi here's the donation for $10,000 for $5,000 I want to give this money to you and these kinds of messages had many of the similar attributes to what you would see in the Nigerian print scams they were sent from previously unknown centers from shady parts of the world or shady infrastructure offering money to the recipients and so this this is a clear cut case of false positives going crazy and being a totally unmanageable from the perspective of this customer security team and so the strategy that we apply to this is relatively we have a few different types of approaches but the most scalable best approach is to start by trying to figure what is it about these messages that makes our models flag it extract that as a signal build an aggregate for that signal keyed on the user or keyed on the recipient so like how frequently does this recipient receive messages that have this signal in it and then retrain the models with that signal and going through this process enables the models to stop flagging these kinds of messages because now you were extracted away what is suspicious and taught the model that this type of suspiciousness is not something to block this message for for this user and so in this case it's this multi-stage process extract signal build aggregates retrain model with aggregates that's the the recipe that we utilize when when handling issues like this one. Sweet yeah thank you for being able to get into that example and maybe I'm just going off on a tangent here but one of the things that I remember about the night journey and print scams is that those scams which I don't see as much anymore but but yeah back when I did used to get them I am my understanding is that they were because because when I get them you know there were lots of spelling mistakes the grammar was poor and so I was like man these are so bad how do these ever work but then I later learned that them being bad is a feature not a bug because if because you're actually you're trying to find the most gullible people and the most gullible people will fall for like an email that looks terrible. There's a lot of different attacker philosophies on how to approach this and certainly the scam emails that are structured in a way such that they they filter out people who won't end up falling for it and therefore save the attacker time on the escalation stages because what will happen is you'll have to talk to the attacker more than to spend time and effort when everything kind of goes back most attacker behaviors can be explained by thinking about this from this simple cost of benefit analysis from the attackers perspective they want to maximize the number of dollars that get for every minute they spend operating something and time they need to spend on the phone with you is time they only want to spend if they think they have a decent chance of convincing you to give them your money. Yeah so cool that we can go into a specific example like that in a bit more detail. So one big thing that's changed for you since we did a full-length episode is you are now Dr. Dan Sheepler. So you finished your PhD work at Oxford University and during that time you were looking into applications of category theory to machine learning and you did define that for us back in episode 451 when we had that most recent full-length episodes several years ago now. My memory from then is that category theory was it had a lot of applications to clustering in particular and from everything that you've been saying so far it seems to me like clustering could be something that's very useful for identifying cybersecurity risks because yeah you're kind of going to have there's going to be particular kinds of features that like in the Nigerian Prince scam where you know we're programmer and spelling could be like this feature that could help with clustering model identify Nigerian Prince emails as opposed to emails that are not Nigerian Prince emails and maybe you could even be using clustering to to identify new kinds of threats that you haven't before you know just like oh like this is an interesting cluster over here it seems to correlate with this kind of attack like I don't know so where I'm getting out with my question is first of all congrats on the PHD and yeah is there any way that category theory applies to the kind of work that you're doing now at abnormal security on a high level yes in that there's a great benefit towards being able to look at the kinds of problems that we face through the template of how they fit into general categories of problems and then identify what's worked for other types of problems that share these characteristics with cybersecurity a lot of the challenges we face as as you described is in the realm of clustering and one thing that I studied a great deal in my PHD was the relationship between clustering and manifold learning so manifold learning is embeddings and things that's vectors and vector databases and embeddings and the types of ways that you can represent some kind of entity as a dense vector related to other entities query for them and group them together and understand their behavior and characteristics in this lower-dimensional form are all general characteristics that apply to a number of applications and the relationships between building a manifold on which you would project your your entities but basically means building embeddings for the data that you're working with which in the case that that we're operating with is things like employees IPs domains links attachments devices vendors companies these are the sort of core nouns that we reason about each of which are things where we derive embeddings and we group them together when you identify a new domain you want to understand what are the other domains that have similar characteristics to this one so this is something that can go through the process of derive an embedding for this and then feed it into a model that knows how to process some embeddings of domains or identify how the structure of this domain enables us to cluster it in a group with other domains and so the derivation of these kinds of strategies and how you would utilize this to build this kind of approach is something I would say benefit of me a great deal as we set out our strategy very cool yeah it's nice that that academic stuff can actually be useful in practice and yeah amazing to me that you did a PhD while working full-time in really challenging roles first at Twitter and then I'm going to add abnormal security it's an amazing yeah amazing accomplishment I really felt like I had my play full doing my Oxford PhD all on its own and to some extent like I would love now having been an industry for over a decade post PhD I'd love to be able to have the space to like do a PhD full-time I think I'd really relish that a lot more than than I did when I was much younger because yeah you see all these real world applications now and there's so many questions that I have that like I'd love to have an infinite amount of time to dig into so well so we dug up in our research on you that you deal that you do still manage to find some time for some other things so for example from your about me page it looks like you have an interest in both math and history podcasts so that's kind of interesting that specifically it kind of leads me to some more open-ended questions because also something that I know about you Dan and that you know you're really big into fitness and even even as we just before we started recording so if you're watching the video version of this is almost no way to tell the only thing that kind of gives it away is it seems like the cameras kind of on like a shaky service and the reason why that is is because Dan is on a treadmill right now so before we hit the record button Dan was actually like walking as he was talking to me as we were catching up and kind of getting set up here so you walk five or six miles a day it sounds like during a full work day that's a cool hack there but I also know that you you know you live in New York like I do you like riding around on your bike so how do these you know so I'm kind of this is all related to the math and history podcast over like so that kind of you know reflection that you do and thinking about time passing particularly with the history podcasts um do you what do you think about how AI might impact things like urban planning or transportation um particularly maybe in the context of climate change like um yeah I just I'm curious if you have any interesting thoughts on how AI machine learning might transform our urban world over the coming decades it's a good question and to be honest I'm not tremendously optimistic I think many of the I think many of the I'm very optimistic about a lot of things but the our urban world I think is something that's in many times they're the largest problems that we face are um due to people problems rather than technical problems I think that AI is a powerful tool for solving people problems and I'm sorry for solving technical problems and the less powerful tool for solving people problems so so one example is I had a professor uh back when I was a brown who I really uh there's an incredibly brilliant guy and one one problem that I remember he he was working on and he was Philip Klein uh professor brown uh and one one problem I remember he was working on was applying graph coloring algorithms to uh gerrymandering problems so so identifying how would you most equitably assign uh voting districts to a particular region based on where people live populations and I remember thinking this will never happen or ever come into play because nobody has an incentive to make things the most equitable way the incentives are to try to benefits whichever whichever policy you're trying to get passed or person you're trying to put into power and those are always how the decisions will be made for these kinds of things and perhaps that's a very uh kind of cynical perspective on this particular area but I think the the human angle on things like city construction is uh perhaps too dominant to to uh for technical approaches to to really have the same kind of transformative power that they'll have uh in in other areas at least would be um sort of immediate future okay yeah that's a good answer um I still like I want you know there's maybe there's like tangential ways uh you know I guess things like to the extent that AI could be helpful in keeping the plasma contained in a nuclear fusion reaction I guess we could have a lot more abundant energy um but yeah in terms of like actually like urban planning uh there's tons of there's tons of applications of energy and even global warming things like simulations are possible the chemical development is something that's tremendously enabled by simulations as well as all sorts of different areas than engineering and manufacturing there's many things in the world of atoms that advances in machine learning technology and AI technology have already shown tremendous advances on and we'll continue uh to do so but there's always a attention between what is possible from what technology makes cheap and efficient and effective and what are the incentives and structures of our society that's uh we need to operate. Great answer um so as we start to wrap up a little bit here um it's clear to me and probably to a lot of our listeners that you have a tremendous breadth of knowledge um do you have any particular tips for us on how we can stay updated and ensure we're continuously learning I guess both inside of our field in data science in AI but maybe outside of it as well. I think trying things out and exploring new technology when it becomes available just opening up some little projects and trying to challenge yourself to build something there's really nothing that lets you learn about something better than trying to build there's there's something about putting yourself in a situation where you need to demonstrate the knowledge that you've acquired that lets you understand it's at a at a really deeper level and so there's I like looking at various GitHub repos and quoting them and making small changes and building little toys for different applications I want to explore a new kind of technology and I find that that's really the best way to to challenge yourself and to grow a new areas technically. Yeah great answer and yeah definitely one that I agree with as well I mean that's that's always the thing is it's like just I don't know just reading a book for me yeah especially like a technical one like you know reading for pleasure okay that's one thing but when it's about you know learning some new machine learning approach I definitely just prefer being like okay I want to learn this cool thing what's something I can do with it and yeah it could be even as simple as like finding someone else's Jupyter notebook where you know I can just use that notebook and yeah like you're saying make small changes you know upload my own data site or something and just see how things go yep yep all right so Den you probably remember from your previous appearances on this show that before I like guess go we like to ask for a book recommendation you got anything new for us so I have recently been I've read a few history books um barbarians marauders and infidels I want to say uh is the name of the book it's really yeah barbarians marauders and infidels an incredible book this book on medieval warfare it just covers a number of different types of battles and locations of battles happening and covers the broad themes of what was the way that's from the fall the Roman Empire to the fall of Constantinople that warfare changed the introduction of the Magyars and the Vikings and the Arabs as three different groups that dramatically changed the landscape of the areas they operated in how the different weaponry the rise of artillery the rises and falls of different kinds of projectile weaponry the difference roles of the horse and the boat really just the the fascinating survey of a really fascinating and complex time in history and what it sort of says about the people who lived then and how their their lives are similar and different from the people today and it ties together a lot of your interest there you got security you got history uh nice that sounds like a great recommendation and amazing that you can offer such a um a detailed account of what's covered in the book um on a whim thank you very much for that suggestion Dan and thank you very much for a wonderful episode maybe we can check in again in a few years once more on how things are going with your very articulate way of speaking on such technical concepts no doubt our audience will be creating that again sounds good thanks john i'm really glad to be here today oh and i also did before he leave in the meantime of between now and that inevitable super data science episode maybe like super time data science 1,000 or 900 or something um uh before that episode if you want to be following your thoughts what's the best way for them to do that probably my twitter or my linkedin i would say so i'm uh d shibu on uh twitter it's just my d and then my last name without a r eight character ux could nice we'll be sure to include that in the show notes and yes now i would really will let you go yeah so thank you very much for being on the show and we'll catch up with you again soon thanks john learn impressive confident speaker always awesome to catch up with dan i hope you enjoyed the conversation in today's episode dan filled us in on the heuristic intermediate ml models as well as the large language models that they develop at abnormal security to identify cyber security risks and messages you talked about how false negatives are individually the biggest classification error to avoid in cyber security but false positives accumulate to create a dangerous boy who cried wolf situation as well he talked about how redis key value stores and an auto retraining framework allow for efficient on the fly model updates how the clustering associated with category theory is useful in real world applications and how AI is great at solving tech problems but not always human problems like those associated with urban planning and politics as always you can get all the show notes including the transcript for this episode the video recording any materials mentioned on the show the urls for dance and social media profiles as well as my own at super data science dot com slash seven one seven beyond social media another way we can interact is coming up on november 8th when i'll be hosting a virtual half day conference on building commercially successful lm applications it'll be interactive practical and it'll feature some of the most influential people in the large natural language model spaces speakers including some that have been on the show it'll be live in the orally platform which many employers and universities provide free access to otherwise you can grab a free 30 day trial of orally using our special code sds pod 23 we've got a link to that code ready for you in the show notes all right thanks by colleagues at nebula for supporting me while i create content like this super data science episode for you and thanks of course to evana mario natalie surge silvia zara and kira on the super data science team for producing another absorbing episode for us today you can support this show by checking out our sponsors links by sharing by reviewing by subscribing the most of all just keep on tuning in i'm so grateful to have you listening and i hope i can continue to make episodes you love for years and years to come until next time my friend keep on rockin out there and i'm looking forward to enjoying another round of the super data science podcast with you very soon