717: Overcoming Adversaries with A.I. for Cybersecurity, with Dr. Dan Shiebler
This is episode number 717 with Dr. Dan Sheebler, head of Machine Learning and AI at
Abnormal Security. Today's episode is brought to you by GraphBase, the Unified Data Layer,
by ODSC, the Open Data Science Conference, and by Model Bit for Deploying Models in Seconds.
Welcome to the Super Data Science Podcast, the most listened to podcast in the data science
industry. Each week, we bring you inspiring people and ideas to help you build a successful
career in data science. I'm your host, John Cron. Thanks for joining me today. And now, let's make
the complex simple. Welcome back to the Super Data Science Podcast today, the wildly intelligent
and clear speaking Dr. Dan Sheebler returns to the show for his fifth visit.
Dan is head of Machine Learning at Abnormal Security, a cybercrime detection firm that has
grown to over $100 million in annually recurring revenue in just four years. And there he manages
a team of over 50 engineers. Previously, he worked at Twitter first as a staff machine learning
engineer and then as an ML engineering manager. He holds a PhD in AI theory from the University of
Oxford and obtained a perfect 4.0 GPA in his computer science and neuroscience joint bachelors
from Brown University. Today's episode is on the technical side, so might appeal most to hands-on
practitioners like data scientists and ML engineers, but anyone who'd like to understand the state
of the art in cyber security should give it a listen. In this episode, Dan details the machine
learning approaches needed to tackle the uniquely adversarial application of cybercrime detection,
talks about how to carry out real-time ML modeling, what his PhD research on category theory entailed
in how it applies to the real world, and he opined on the major problems facing humanity in the coming
decades that he thinks AI will be able to help with and those that he thinks AI won't.
All right, you ready for this absorbing episode? Let's go.
Dan, welcome back yet again to the Super Data Science podcast. I guess you're in New York as
usual. That's right. Thanks, John. Happy to be back. Yes, we've got an exciting episode today, so
previously you've been on the show going all the way back to episode number 59,
and then you came back while Curel was still hosting, that was episode number 345.
My first time hosting you was episode 451, and then episode number 630, you did a five-minute
Friday style episode where you answered a question specifically about resilient machine learning,
which actually will build upon a bit more in today's episode. And yeah, something kind of cool
for our listeners to check out. If you don't watch the video version, today I'm filming from
Detroit, and this hotel that I'm in, the foundation hotel Detroit, it's wild. Like, you know,
it just was expecting to record in my hotel room, but I was looking as leafing through the hotel
booklet, and they have a dedicated podcast studio. So I'm actually, I've got this suite on airsign
behind me. Other than that, it's like, it's just a quiet room. There's like lots of curtains and
stuff. I'm using all my own equipment, but yeah, it's kind of a cool look for the video today.
So yeah, so we've got tons of content for you, Dan, building on the resilient ML stuff a bit,
and focusing on what you've been doing since your last episode. So it's been several years now,
since your full length episode with me. And in that time, there's been a lot of changes.
Most notably, you're working at a firm called abnormal security. And so you're addressing the
high stakes challenges of cybercrime over there with machine learning. So what makes this particular
adversarial machine learning challenge where, you know, you're not just building a machine learning
model that is acting in a vacuum. It's very much the opposite. The models that you build people
are trying to reverse engineer them on a regular basis to be able to overcome the kind of security
that you're developing with your ML models. So this kind of adversarial scenario is adversarial
machine learning challenge. How is that unique relative to the other kinds of machine learning models
that you build historically? Totally. So abnormal security is a company that builds detection systems
for identifying cyberattacks that are coming in through email and through accounts that people have
on various SaaS platforms, including email, but things like your Slack, your Octa, your Zoom.
And so there's really two kinds of attacks that's more concerned about. One is an account has been
compromised. And we're trying to identify that this account has been taken over. The attacker has
gotten the credentials. And now the person who's operating this account is no longer the account
owner. It's the compromised attacker. And the other option is inbound attacks. There's an email
message generally or sometimes other types of messages where a attacker is transmitting a payload
which could be a phishing link or a message that's eliciting the recipients to update the bank
account information or perhaps malware or anything else that's the initial vector to begin a cyber
attack. And so the machine learning models that we build operates at the level of individual events
which are the messages that are being sent, the sign-in events that we're observing for the
accounts and a number of other kinds of events. And at each of these events we're trying to identify
is this an attack? Is this malicious or is this normal behavior? And this is a very adversarial
situation because the person on the other end, the attacker is going out of their way to try to
cloak their actions. They're trying to make the messages that they're sending look as similar as
possible to safe business messages. They're trying to sign in utilizing infrastructure and technology
that allows them to cloak the facts that they are attacked or hide their identity and obfuscates
anything that they're doing so that it looks like a normal individual. And so our machine learning
models that we're utilizing need to take advantage of what are the things that the attacker might not
know that we know for instance or how do we try to build something that is resilient to different
kinds of modifications to the attacker might utilize and can really get at the heart of what
separates the normal business traffic and communications from what the attacker is at the time
we can do. Yeah, yeah, yeah. So I imagine this involves a broad range of different kinds of
models. So I know you mentioned online in some of our research we dug out that some of these
models involve probabilistic models that are like relatively straightforward I imagine relatively
efficient all the way up to large language models which presumably are you know they're a lot more
expensive to run they aren't necessarily as fast at inference time. So given the kinds of attacks
that you're trying to identify how do you decide what kind of model you're going to be using for
a particular type of threat. So there's really three kinds of models that we utilize which
each of them try to capture something a little bit different and have different trade-offs
in terms of what they have access to the cost that they are that they require in order to utilize
them their speed at which they can be invoked and their efficacy and at the range of different
kinds of attacks that can affect the capture. So everything that we build all of the models we
build are powered by aggregate signals which are the most important component of our approach toward
cyber security. So basically this is a special type of feature that we build over raw data that
then powers all the different kinds of our models. And so this is a sort of foundation of our
detection strategy. And so these are aggregates over raw email sign-ins and other kinds of raw
events at individual entity levels. So for example, we would aggregate for a particular person
all of the emails that that person has received and be able to say things like how many times has
they received an email that has this header in it or this kind of phrase in it or this kind of
attachments that's routed through this IP, utilizes this infrastructure, has this HTML tag,
each of these different kinds of little individual signals that could lead to identifying some
information about this email, about the sign-in events, we aggregate at the level of each person
who's receiving each piece of infrastructure that's sending each piece of infrastructure within
IPs and domains that messages and sign-ins get routed through. And through this we build this
historical picture basically a summarization of everything that's happened up until this point
that serves as our foundational feature infrastructure. So this is a very sort of structured way
of building representations of features. And so it means that there's now a number of different ways
that we can utilize these derived signals in models effectively. And so the simplest thing for us
to do is, turistics, turistics and rules built on top of these signals. This is already a very heavily
data-driven approach. Fundamentally these aggregate signals themselves are basically simple models.
They are basically probabilistic models that demonstrate what's the percentage of time given
some condition that X is true. You could construct these turistics and rules to look very similar to
a Bayesian network on top of these individual aggregate signals with different sorts of
conditionals that you're applying and different kinds of derived probabilities that you're building
at the top of it. The next level of sophistication is basic trained models. So this would be things
like logistic aggressions, xgboost, and we like deep and cross networks is our neural network
architecture of choice for on this kind of multiple. Deep and cross? Deep and cross networks. Yes,
so it's a network architecture. It's a very popular and ad tech. We have a number of people that
abnormal have previously worked in ad tech. Basically it's a type of neural network where you consume
both raw signals. It utilizes a deep layer as well as cross signals where basically you build
derived signals from your individual raw features and then have those derived signals. You learn
the derivation of your cross signals and then feed that into a deep network. So the cross layer
functions in a way where it can take things like here's a frequency that some attribute is true
and here's a boolean signal that says yes true or not true and then you can do like a multiplication
of these in order to build like a derived signal for instance. So these are sorts of cross features
that's the space of potential cross features that you can build is very very large and utilizing
this network architecture allows just a 10 to the specific cross features that are most valuable.
So it allows you to sort of remove a little bit of the work required to
build sophisticated cross features without having a giant parameter space. So it's nice for cases
where you have both deep embeddings and a lot of boolean and continuous features that you're
consuming at the same time and you're trying to you want to do something a little bit different
with like the dense continuous signals within an embedding and the individual boolean and
continuous signals that represent more sparse information. And so the deep and cross network
enables you to it's like an inductive bias that's built onto that kind of architecture.
It's we utilize normal feed forward networks as well at this in this intermediate category
of models that we train but we like the deep and cross because we've seen good performance with
it's and there's nice implementations online. Yeah that's very cool. So I hadn't heard of this
kind of like cross layer specifically in a deep neural network before. In my mind I kind of I
imagine the the the workhorse layer of dense layer as being capable of doing some of the things
that you're describing. So a dense layer should be able to in many circumstances identify
whether across of two input features are together creating a lot of signal because that that
dense layer that what that means that that denseness is that it it it recombines any possible
inputs from the preceding layer. So it sounds like this kind of cross layer as opposed to being
a general purpose dense layer that happens to be able to do those kinds of
multi-term interactions this cross layer is explicitly designed to do that. Yeah so it's basically
it's similar to how a convolutional neural network is inherently less expressive than a feed
forward neural network but still more the more performance on image tasks than a raw feed forward
network is it's embedding in the inductive bias that's these particular kinds of multiplications
between your features to cross them is a useful thing that you want to do for that category of
feature and so the we we utilize it when we have these signals where there's like one signal that
tracks the frequency of an event and then another signal tracks the presence of that event where
these are two features that really only make sense when they're combined together and are very
difficult to cross with with other kinds of signals that's their their poignancy relies on their
combination and so building those crosses explicitly through the cross network is useful for that
that kind of application. Very cool. This episode is brought to you by Graphbase. Graphbase is
the easiest way to unify, extend and cache all your data sources via a single GraphQL API
deployed to the edge closest to your web and mobile users. Graphbase also makes an effortless
to turn open API or MongoDB sources into GraphQL APIs. Not only that but the Graphbase command line
interface lets you build locally and when deployed each Git branch automatically creates a preview
deployment API for easy testing and collaboration that sure sounds great to me. Check Graphbase out
yourself by signing up for a free account at Graphbase.com. That's g-r-a-f-e-a-s-e.com.
Well, so I digress a bit into this. So you were saying that there's three kinds of model
that you use. So you're describing the first was the heuristic models, rules-based ones,
and then you were kind of talking about intermediate complexity machine learning models,
so things like random forests, logistic regression models, these deep and cross deep neural networks.
So yeah, I don't know if I missed any there if you wanted to go any deeper on that one or if
you wanted to jump down to model type number three. Yeah, so model type number three is large
language models. And so we utilize both the out of the box, open AI APIs for certain tasks as
well as building our own fine-tuned variants of we've utilized Falcon and Lama and fine-tuned
those to a few different tasks. And when you think about these three different categories,
they kind of grow in a crescending amount of costs required to run its increased latency and
decreased speed and different characteristics in terms of their ease of use. The first category,
the third category are perhaps the easiest to use and modify because large language models,
you can repurpose with prompt engineering and rules, you can repurpose by tweaking things,
whereas the intermediate category of deep neural networks and such really requires retraining
in order to incorporate new information. And so all three have pros and cons and can be applied
to different types of use cases and challenges within the ecosystem of different kinds of attacks
for trying to catch for different customers. Nice, that's a really good high level summary of
the kinds of models that you work with. And yeah, it's interesting to think about how
that third tier, those large language models that they've become so complex now that they're
actually, as you say, I hadn't thought of it this way before, but they are, they become as easy
to use as a simple heuristic model because you just, you change your prompt and they're so flexible.
You don't need to be retraining the entire model, you know, maybe you could potentially
in that third category, you could also be inserting in some like some heft layers.
And those are then very fast to fine tunes. You could have this huge architecture, like you mentioned
Falcon, it's a 40 billion parameter model, but you could use parameter efficient fine tuning
heft to fine tune to some specific task of yours, maybe just have a few hundred or a few thousand
examples of some of some task that you'd like it to be able to specialize in. And you can train
that in minutes or hours, even though the architecture is so gigantic, because there might only be
a million or so or, you know, a hundred million parameters that you're training.
Yeah, in this parameter efficient fine tuning technique, as opposed to trying to do, yeah, the whole
40 billion. It is definitely the case that we've observed at least that once you go down the
routes of fine tuning these models, you lose some of their generalized ability and ability
to adapt them to different tasks. We find these models as sort of uber classifiers that can be
applied to classification tasks by taking them and utilizing their size and their really deep
understanding of raw fundamental concepts and ability to reason as bases for being able to
be applied to representations of our data in a text form that they can understand, that then
their fine tuned to understand better. And the fine, we have a couple of different kinds of
message classification tasks that we operate, both the just identify whether or not something is
an attack as well as identifying attacker objectives and triaging a messages that are submitted
to a fishing mailbox product that we operate as well. And each of these are slightly different
kinds of tasks that require slightly different kinds of behavior that involves some amount of human
interface that we've seen in the past and that's where we're trying to incorporate large language
models to reduce the human burden on the areas that involve that because the cost characteristics of
these models make them very, very difficult for us to utilize them in an application like
scanning every sign in or every email that we process. It's really cost prohibitive to do
something like that with models of the size, but something that already involves some human
interaction is much more manageable to incorporate these models in.
Nice, yeah, that makes a lot of sense. Yeah, large language models being able to
augment or automate something where a human would be required is probably going to be more
cost effective, whereas yeah, trying to like have huge volumes of emails be processed by an LLM
would be like crazy, crazy expensive. So one of the big things about training any machine learning
model, particularly when we're talking about that intermediate tier, your second tier. So,
you know, the Grandinfor is like physical regression, now classical 10-year-old deep learning architectures,
one of the big things is looking at, you know, in your kind of scenario, you'll have some true
state of the world that you're trying to model. You have correct labels that you're trying to guess
with your machine learning model. And anytime we're doing, we're trying to do that classification,
we end up in machine learning with some false positives and some false negatives. And obviously,
we want to try to minimize both, but in your context in cybersecurity is one of those false positives
or false negatives kind of worse than the other? And do you try to like try to minimize one in
particular? It's really a balance to be fair. I mean, I think the worst thing that can happen
is that you miss a really serious attack and it causes a lot of damage to customers. So in that sense,
false negatives are more of a larger existential problem, like the worst kind of false negative
is the worst of all. But false positives that are a false positive problem and a high rate of false
positives is equally bad because it incentivizes a business as can't operates if people are being
stopped by their security solutions from engaging in normal business. And so customers will end up
putting overrides and ignoring remediation criteria. And then they'll expose themselves to
exactly those kinds of really bad false negatives and we'll have no ability to control for it at all.
So yeah. The boy who called Wolf kind of scenario. Absolutely. Cool. Yeah, that's interesting. I
kind of in my head. I was expecting you to answer that question and just say that like false negatives
are the worst. We got to make sure we avoid those. But yeah, of course, if your clients are getting
false positives all the time, then they're just going to ignore your tool. And then they're going to
miss the real deal. So in the last few years, I understand that the threat and landscape has
changed a fair bit. So how have you had to adopt your models out of normal security to handle those
new challenges? So traditionally, cybersecurity solutions function by identifying indicators of
compromise and stopping threats based on matching indicators of compromise between a particular
threats and a new thing that may or may not be a threat, like a new message or a new kind of
sign in. And an indicator of compromise in this case is something. It's a smoking gun. A link
that is known to be bad, a domain that is known to be bad, an IP that has poor reputation,
an attachment that's hash matches some known malware. There's many different kinds of indicators
to compromise. But what has happened is that the costs and ease with which attackers can switch
the routes, tools that they're utilizing has simply gone down. Attackers have had better and
better access to systems that have allowed them to evade the types of recognition of indicators of
compromise and send out attacks that have not don't match any the patterns of any previous attacks
with much, much larger scale and much, much higher degree of ability to avoid detection.
And this is certainly gotten substantially worse with the introduction of
generative AI tooling. A generative AI tooling in particular enables the
personalization of attacks to a particular recipient by combining something like somebody's
LinkedIn profile and integrating that seamlessly and entirely automated into social engineering
scams that are highly targeted for that person. And this avoids both the indicator of
compromise style checks for the templates that's phishing emails would normally match as well as
just increases the degree to which these kinds of messages and attacks look to the recipients to
be malicious. So our strategy at abnormal is to avoid an over focus on indicators of compromise
as the core tenet of our strategy. Our strategies instead to focus on identifying abnormalities in
individual pieces of communication and emails and sign-ins that make them different from the
types of normal business communication and rather than try to root cause attack instead try to
spot things that don't look like the normal safe communication. So rather than is attack we'd use
not safe as our core strategy and core objective. And so this enables us to be much more resilient to
changes that attackers can make to their attacks to try to avoid indicators of compromise and also
enables us to play to the greatest advantages that security defenders have over security
attackers which is knowledge of the targets. That attacker who's attacking somebody doesn't know
what's in that person's inbox. They don't know what emails that person received yesterday. They
may know a little bit about their attacker if they've utilized open source intelligence. A little
bit about the target of the utilized open source intelligence but that are unlikely to know nearly
as much as a security solution that's plugged into that person's accounts has access to that company's
data and information. And by leveraging this advantage this information be symmetry that defenders
have access to were able to most effectively fight back against attackers. This expands very,
very naturally to the growing threat posed by generative AI tech. Fascinating and very well said
you have such a crisp way of speaking. It's so easy to understand you thanks. So yeah I mean this
is now getting a little bit you know into the future maybe although maybe not that far in the
future. Do you ever worry about how generative AI like I don't know some kind of open source
alternate of like GPT-5 or GPT-6 you know something of that kind of capability that might be here
in a few years that's like open source and so can be used for malicious purposes. Do you ever
you know worry about LLMs being able to go beyond the kinds of attacks you're describing here like
this personalization which you know allows for the automation of say phishing attacks where
you can instead of needing to have a human be researching somebody and coming up with points
for a phishing email that might make them feel like this is a trusted entity. The LLM can now do
that automatically but yeah in the future and with like some kind of open source maliciously usable
GPT-5 or GPT-6 variant this might be able to do much more like this might be able to like
plan attacks like it's not just generating the text but it's actually it's in some ways it's
the it's like an independent malicious actor that some malicious human can kind of just like set
in motion and say like you know here's some money get as much possible money back. Yeah is that
something you ever spend time thinking about or is that just that's too far out. I think
multi-stage planning with deep reasoning is very very difficult. I think it's substantially more
difficult than solving a range of different problems. So I am less concerned about this from a
sort of like general existential threat perspective but that said I think in cybersecurity there's
a surestic that you could utilize for identifying what kinds of attacks you'll see in the future
that has proven to have been pretty effective and this falls closely within that which is that
cybercriminals are financially motivated by and large not every cybercriminals
financially motivated there are state actors that exist as well but there are a tiny percentage
of the overall set of cyber attacks the vast majority of cyber attacks are sent by people
who are trying to receive a return on an investment they've spent some money to invest in
technology to cloak their identity technology to acquire internet assets that they'll utilize
to send out a text these are domains and IPs and types of like internet connection ISP
variants and they will try to get a return on the money they're spending and the tax strategies
that enable them to get a return on the money they're spending have become more and more sophisticated
and the past if you were going to do something like a spear fishing you know you need to spend
a great deal of time investigating your targets and that time is money basically because you are
assuming you're getting paid on some hourly basis you think of yourself as a cybercriminal
comparing what you get paid at McDonald's to looking up someone's LinkedIn and utilizing it to
generate spear fishing emails if a tool lets you send out 10 spear fishing emails in the time
it previously would have taken you to spend send one now you're going to be able to start sending
more of these and there's certain kinds of attacks that are very sophisticated that exist already
that we see these types of vendor fraud attacks where a attacker will compromise a legitimate vendor's
account which is a very expensive thing to do purchasing an account of an email address of someone
in billing at a Fortune 500 company on the dark web is that's that's a very expensive asset
you're unlikely to have for very long because the company likely has a security team that's
going to find you and so there's it's a short lived expensive asset that an attacker is acquiring
and attempting to get as much money out of it as possible before they lose access to that asset
and so these kinds of attacks are very sophisticated very difficult to detect but we do see some of them
we do it builds models and systems to detect them and it's a reasonable heuristic that things that
we see a small amount of now because of their sophistication and because of the amount of money
that attackers need to spend in order to generate them we will become sheepers for attackers to
send in the future as technology advances advances as AI advances as cybercrime fields a larger ecosystem
of tooling and systems attackers will be able to send more or worse fiscated effects at a lower
and lower price points which will mean that the things that we see at a you know maybe once a
week basis will become things we see every day or things we see 10 or 20 a day as this this
thing moves closer and I think that this case that you're describing right now with a agent that
is operating the planning and prospecting of attacks at a multi-stage basis where first they
send a series of attacks to gain fishing emails at that person who's in a building at some vendor
in order to get access to their account then they have access that account then sending messages
from that accounts to the various customers of that products to tell them to update their bank
account info there's this kind of complex sophisticated multiple stage attack that having that
reaching lower price points I think that's it's feasible to imagine that that could happen
and the best way to protect against it is to take seriously the attacks that we see rarely today
with the expectation that they will become more and more common in the future.
Be where our data-centric future comes to life at ODSC West 2023 from October 30th to November 2nd
join thousands of experts and professionals in person or virtually as they all converge and
learn the latest in deep learning large language models natural language processing generative AI
and other topics driving our dynamic field network with fellow AI pros invest in yourself in their
wide range of training talks and workshops and unleash your potential at the leading machine learning
conference open data science conferences are often the highlight of my year I always have an
incredible time we filmed many super data science episodes there and now you can use the code
super at checkout and you'll get an additional 15% off your pass at odsc.com
nice yeah that that is a very sensible heuristic that as soon as you started to explain it I was like
yeah that makes a lot of sense so yeah so that certainly is something to keep an eye on I guess
yeah we don't know how quickly if ever machines are going to have that like multi-stage
planning capability but I don't know with how blown away I was in the jump from GPT 3.5 GPT 4
I'm like I you know being surprised should be on surprising yeah okay so clearly that kind
of heuristic is something that's useful for helping you figure out what kinds of models you might
need to start prototyping now I understand that you also do head-to-head competitor comparisons on a
weekly basis so how does that help you with refining your models as well totally so I'll just talk
a little bit about the the process and so most companies of decent size need to spend a decent
amount of money on email security emails are the primary vector by which large businesses get
attacked with malware phishing invoice fraud etc and there's a number of different ways that
businesses can try to protect themselves the most common way is purchasing solutions like
abnormal security and we have many competitors that offer similar products that's try to protect
customers from these kinds of attacks and because of the sheer volume of these attacks and the
length of time that email security has been around this has been a product's category for one of
the longest time periods within the SaaS types of products as email security that measured in
the time frame of decades rather than years like most SaaS products we have a pretty easy to understand
way to compare to products you simply install both and you see which one catches more text and which
one generates less false positives very simple to see very simple to evaluate and every week our
sales team works with customers to install abnormal security in their environments and compare us
against either the customer's current email security solution or competitor email security
solutions the customer is also considering normally customers will consider a number of different
solutions at different price points observe which ones require the most effort for them to manage
which is basically the same thing as false positives less false positives means less effort to manage
and which ones protect their customers the best which is the same thing as false negative rate lower
false negative rate you're better protecting the employees at the business and if we are able to
find a tax that no other solution finds and generate fewer false positives than other solutions
then we'll win the deal and our revenue will increase and if we're not then we won't win the deal
and so this is a very simple and exciting space to be in as a machine learning engineer because
it's relatively rare that you get to build technology that is placed immediately into such a
clear cut competitive environment where you are immediately tested not only against adversaries
but also against other solutions attempting to do the exact same thing that you're doing
you see very very quickly how good your system is and how you can measure that immediately in terms
of the dollar value that businesses will pay to remove their current solution and replace you
with abnormal security and so this serves as a strong rallying point and motivation function
for the detection team and for abnormal security as a whole nice that is a really cool process and
probably a kind of process that not probably definitely the kind of process that you described there
is something that's easy for me to imagine for my business and probably a lot of other people
could imagine for theirs yeah comparing false positive false negative rates against your competitors
and probably a lot of clients or prospective clients would be able to estimate you know how much
each false positive cost how much each false negative cost them and just be able to determine okay
yeah so you know going with this product at this price point I'm going to save this much overall
and that's the best one to go with yep okay so tying back to your previous they're your
most recent Super Data Science episode number 630 where we were talking about resilient machine
learning so maybe you could quickly kind of recap what that resilience means like basically
this idea of having a robust machine learning system how is that particularly important in cyber
security so resilient machine learning means as you say a robust machine learning system and
specifically building your engine so that it is unlikely to fail catastrophically there will always
be problems that you face sometimes these problems are acute problems where a single system goes
down perhaps there's an outage in data service someone pushes bad code some type of data gets
deleted accidentally sometimes it's changes in the underlying data distribution on your side perhaps
you onboard a new customer that's in a new industry that you've never seen before or you
have some kind of change in the way that you're categorizing the events that you're seeing such
that it changes the underlying data and that powers your features your aggregates for instance
and sometimes it's adversarial in cyber security this third category is constant attackers
are changing what they're doing every week and every day in order explicitly to fight against
the system that you're building and so there's a lot of strategies that you can apply to build
this kind of failure resilience into your machine learning systems to make it so that when things
change your system doesn't change with it and so this includes the data distribution shifts
that is normally thought of as a core problem within all machine learning systems you train on one
set of this on data you launch now there's a new set of data and you have to deal with that's
and so that's one part of it but it's also incorporates things like feature dropouts where you
have certain areas or signals that you rely on that are not available in certain environments or
circumstances and it needs to be able to operate even when you have these kinds of outages that
occur and you still need to be able to provide protection for your customers. Nice and so that
makes a lot of sense in cyber security but then outside of cyber security why might our listeners
be interested in the concept of resilient machine learning related to whatever their kind of data
science modeling is or kind of software engineering that they do or the kinds of systems that they
architect. So building systems that are resilience to changes in your customer distribution is a
constant issue that's every data scientist faces especially at a growing business when you have
your initial set of customers and there's an initial behavior that's present in the
kinds of data that you're seeing you want to be able to build your systems so that's when
that distribution changes when new customers are onboarded you are able to quickly adapt to
these new distributions and so there's two main principles that you could apply towards having
this kind of quick adaptation to new customers. One is fast retraining maybe you build a machine
learning model you train it on your data then you have new data that's coming in you if you have
assembled a concrete coherence data pipeline and data labeling pipeline then you'll be able to
retrain your model and sometimes you could even automate the retraining process depending on the
nature of your data and the environment that you're operating. Another approach that we lean into
we lean into both of these approaches that have normal security and but one more approach that I
think is very under-discussed for quick adaptation but has a lot of usability in this kind of case
is to utilize features that represents the data distribution itself and so to make this clear
perhaps rather than have a categorical feature to represent something like a user that's like
this is this user's ID we're going to represent them with a single value that's going to be
go into a one-hot lookup then then you're sort of expecting the model to memorize this user
and if this user changes their behavior in the future you need to retrain the model's updated
an alternative is to utilize features like what are the number of accounts that this user has
followed in the last 10 days of giving a Twitter example what are the topics that this user
has liked on tweets in the last seven days so these are features but they're features that
represent current data they represent the past that abnormal this is the aggregate features
we're talking about earlier your how many emails this person received from this kind of account
at this point of time and day in the past this is a feature representation of the current
information and so in the case where you have one customer that you're building a model on and
then another customer that gets onboarded even if that second customer is a very different
distribution maybe that first customer only had a 10 person customer service team this new customer
as a 500 person customer service team as if you've represented what it means to be a customer
service agent in terms of these kinds of signals like how frequently does this person receive
emails from the outside as opposed to things like memorizing who these individual people are
even memorizing like a categorical signal on his customer service then you'll be able to
better adapt to these kinds of new circumstances because your features themselves will be
modified and will adapt to this new distribution deploying machine learning models into production
doesn't need to require hours of engineering effort or complex homegrown solutions
in fact data scientists may now not need engineering help at all with model bit you deploy
ML models into production with one line of code simply call modelbit.deploy in your notebook
and model bit will deploy your model with all its dependencies to production in as little as 10
seconds models can then be called as a rest endpoint in your product or from your warehouse as
a sequel function very cool try it for free today at modelbit.com that's m-o-d-e-l-b-i-t.com
nice nice nice nicely said lots of practical tips there for any of our listeners on resilient ML
when did you start getting into this so is this something that like you started getting into
back a Twitter or it's like it's not related to your PhD stuff directly is it it was pretty important
to Twitter a Twitter the one of the core issues that I faced within the revenue science organization
which was the organization that operates the machine learning models for ad serving was fast
performance for a new ad campaign so customers would launch ad campaigns that have a number of
different creatives line items that combine a number of different creatives and want to target a
particular audience what we need to do is very quickly identify what are the types of users
for whom these ads will be most poignant and we don't have substantial categorical information
about the ads themselves and even the users themselves their ad interaction behavior can change
quite quickly if they previously weren't a situation where they weren't getting any ads that
were interested and now suddenly where they're a you know they're really into sports and sports
betting advertising is suddenly been legalized and now we can show sports betting ads for instance
that changes their behavior and so being able to represent the most recent picture of behavior at
each of these categorical signals very very critical towards out of the box performance of being able
to give that kind of quick turnover advertisers would generally tolerate worse performance for
a couple of days but not for a couple of weeks after beginning a campaign and so having that fast
adaptation you can't really rely on training a model that's going to be able to have the capacity
to capture that at that kind of scale that that quickly nice yeah that makes a lot of sense so
on the note of like reaction times and speed another obviously super critical thing whether
it was Twitter before or abnormal security now is the real-time nature of processing like I imagine
I mean it's it's super critical in both situations it's hard to say like it's more important
than one of the other like obviously in a social media platform people are expecting news in real
time for example they're expecting updates from people that they're following in real-time
but with cybersecurity arguably there's like a bigger danger to not being real-time
yeah so obviously it's super important in cybersecurity to have real-time processing as well
are you able to go into like any particular kinds of infrastructure or technologies or techniques
that you employ to handle massive traffic in real-time yeah so so we utilize our most
direct approach towards real-time information is aggregates model retraining is something we
utilize air flow to instrument model retraining on a weekly basis because of trying we're trying
to take advantage more of customer shifts than attacker shifts attacker shifts can happen
much faster than that and so we utilize our aggregate engine for identifying and adapting to
attacker shifts so this is both at the IOC level of trying to when we miss an attack or if we
see a particular IOC within a net new attack that we're we've caught now being able to ensure
that we catch everything else that has that same IOC so basically utilizing a combination of
abnormality to catch the first attack and then IOC to catch everything else that looks similar to
we need to very quickly identify okay this signal is now something that we've seen in a malicious
message we need to distribute this out to somewhere else so make this very concrete they're kind of
just to give the situation that I'll talk about the technology so if the attacker has purchased
the domain and they're not utilizing that domain to send out messages that include a malicious
link with that domain there that maybe they send out a hundred messages that I'll include this
domain and maybe we were able to identify that some set of these messages are malicious and we're
able to identify this by looking at the differences between the way this message was sent and the
kinds of messages that the person is receiving this normally receives but maybe we don't do
that for every one of these hundred messages maybe ten of these messages hit people who are
receive a lot of messages that look really sketchy but are totally normal and because of that we're
not able to spot on those ten people that this message was bad but we have seen on our other 90
that it was bad because where those were sent to people who receive mailing normal messages and
so now we need we have this new piece of information which is that this domain is bad and we have this
message that was what we didn't we wouldn't be able to identify as bad without this piece of
information now at risk of hitting this user this individual so this is a case where we need
to react very very quickly to pull this message and stop it from doing damage because we've
identified that this indicator of compromise is bad by leveraging this information and now we need
to act on it and so we utilize like a redis based key value store to track these types of
indicators of compromise and so we stratify based on every kind of decision that our system makes
and track each of the different types of indicators compromise you could extract from messages
or sign-ins in this system and utilize a triggering replay system to identify based on like a last
and aggregate within redis when any of these individual counts gets triggered we then submit from
the last and redis aggregates back to our core reprocessing answer very cool very cool example
you said you said a term in there which maybe you did define and I just missed it but I will see
yes so indicator of compromise oh yeah yeah so you talked about that earlier in the episode but
I wasn't used to it as an acronym yet nice yes yes it's such a it's an acronym that I'd never
heard of before going into the security world but it's constantly bandied about it's it really
just means anything that could indicate that's something is bad and generally it's referring to
IP addresses and domains and email addresses and file hashes and things like that but there's a lot
of other things that's it could refer to as well nice and so you talked about this a little bit
earlier but maybe we can we can dig into it a bit more when we talk about real-time processing
you've kind of now covered that you know things like this redis key value store allow you to do that
efficiently and you into into into previous answers you talked about a resilient machine learning
being adaptable in practice how does that mean that you need to be updating your models like
is it is there like a routine to updating machine learning models or is it like a vent driven
how does that work so we utilize a we've built an auto retraining framework that enables us to
retrain our models of regular cadence we maintain a large number of different machine learning
models which we retrain on different cadences our auto retraining pipeline covers our core models
are most important models that's we hook up into it's and it's it's a series of different steps
to to do a auto retraining first it's a collect all of the data that's going to utilize
we need to process it and to extract features from that data we need to actually run the training
process and in the most important stages of evaluation we need to identify that if we take the model
that's currently deployed and turn it off and turn this new one on we're not going to suddenly
flood a bunch of customers false positives we're not going to stop catching attacks that we're currently
catching we're not going to dramatically increase your cost or latency or anything else and so we
have a large suite of tests that run simulations with this new model in place of the old model
and so this is a pretty heavy expensive process which is why we don't set this up for every single
model we deploy and only our most important critical models and this we for our faster adaptation
we primarily rely on aggregates for capturing changes in data distribution and we utilize auto
retraining as a way to re-adapt as customer distribution shift over time and take on new signals
our one thing that's relatively interesting about our normal process is we are constantly adding new
signals we're constantly identifying what's something what's a new kind of aggregate to build what's
a new kind of data source to subscribe to to be able to understand more about the indicators
compromised within emails or sign an events what are new ways that we can transform apply natural
language processing apply clustering techniques to better understand each piece of data we process
and each one of these signals is something that could be useful in a model retraining
and we set up our auto retraining process so that it automatically consumes certain kinds of
signals that the team adds so we have the we're able to operate in a mechanism where one group
is building new signals and then immediately setting up heuristics around those signals to
to utilize as a heuristic kind of models and the auto retraining process picks up these signals
automatically into the models of regular retrain and so in in this way we are able to most
efficiently have this feedback loop between the sort of very hands-on work to optimize a signal
so that the signal is powerful enough to work in a heuristic and that signal then being incorporated
into our next automated retraining for our core machine learning. Nice so in the software engineering
world there is a term CICD continuous integration continuous deployment that is you know a very
common practice these days and so the analog for what you're describing could we call that CICT
continuous integration continuous training for a lot of these core models that are in your
auto retraining framework. So to be honest I would say no I generally think of continuous training
as being a somewhat separate thing where you're really looking at less than 10 to 20 minute
difference between when a sample shows up and when the weight update has been applied to the
model that's deployed in production at Twitter we had several systems that utilized this framework
where we did have what you would call CICT where we had models that were deployed and the time
between when a person clicked on an ad or chose not to click on an ad and when that fact had been
propagated into a feature update or a like a back propagation gradient step for the model that
serves ads was less than less than 20 minutes that abnormal it's it's going to be substantially
longer period of time because of our auto retraining but the there is a very fast turnaround time
towards that information being incorporated but it goes through the aggregate signals it goes through
the fact like this after this message is sent will extract all these signals from it update the
aggregates the features and the next prediction are different so you can think of it as like if you're
if you think of it's all sort of the same thing when you blur your eyes it's take a step back whether
you're applying this update to the features or applying this update to the weights of the model
but that abnormal are only real time updates are being applied to features whereas when I think
of continuous training as as referring to the real time updates being applied to the rules.
Yeah yeah yeah so in CICT like you were doing a Twitter you're talking about some actual
training of the model way it's like a back propagation step whereas the kind of retraining
that you're doing with your auto retraining framework this is more holistic so it's kind of like
it's going all the way back to like future creation and aggregation and yeah and then you can take
advantage of the kinds of cross terms that you were describing way back earlier in the episode
being able to be recreated afresh so it's a more comprehensive retraining it's not just like
yeah you know one step of back problem. Yeah that's right. Cool so I don't know how much you can
get into this kind of thing but I can at least ask but are you able to give like examples of instances
where a cybersecurity system would miss a threat or identify a false positive and then requires
you as a human or your team to come in and make some changes to address that kind of miss.
So yeah I can give one example there's I'll talk about it a little bit more vague there's a
type of pattern that we observe in cyber security where there's things that are sometimes
referred to as the Nigerian print scam which is essentially a type of scam where you begin the scam
by saying I want to give you money in some way or another and the person then engages and
they trick them into giving bank account details and so sometimes this is considered to be a less
harmful scam because it's you're just trying to steal money you're not necessarily trying to
steal credentials that would allow you to advance the business but many many things that begin with
I want to give you money may end up with malware credentials paying account information many many
very valuable things being stolen and so this this is a very important type of attack that we need
to defend against however we have seen what there was one case where we integrated with a church
and this church received a lot of messages from people in other countries saying something along
the lines of hi here's the donation for $10,000 for $5,000 I want to give this money to you
and these kinds of messages had many of the similar attributes to what you would see in the
Nigerian print scams they were sent from previously unknown centers from shady parts of the world
or shady infrastructure offering money to the recipients and so this this is a clear cut case
of false positives going crazy and being a totally unmanageable from the perspective of this
customer security team and so the strategy that we apply to this is relatively we have a few
different types of approaches but the most scalable best approach is to start by trying to figure
what is it about these messages that makes our models flag it extract that as a signal build
an aggregate for that signal keyed on the user or keyed on the recipient so like how frequently
does this recipient receive messages that have this signal in it and then retrain the models with
that signal and going through this process enables the models to stop flagging these kinds of
messages because now you were extracted away what is suspicious and taught the model that this type
of suspiciousness is not something to block this message for for this user and so in this case it's
this multi-stage process extract signal build aggregates retrain model with aggregates that's the
the recipe that we utilize when when handling issues like this one. Sweet yeah thank you for being
able to get into that example and maybe I'm just going off on a tangent here but one of the things
that I remember about the night journey and print scams is that those scams which I don't see
as much anymore but but yeah back when I did used to get them I am my understanding is that they
were because because when I get them you know there were lots of spelling mistakes the grammar was
poor and so I was like man these are so bad how do these ever work but then I later learned that
them being bad is a feature not a bug because if because you're actually you're trying to find
the most gullible people and the most gullible people will fall for like an email that looks terrible.
There's a lot of different attacker philosophies on how to approach this and certainly the scam
emails that are structured in a way such that they they filter out people who won't end up falling
for it and therefore save the attacker time on the escalation stages because what will happen is
you'll have to talk to the attacker more than to spend time and effort when everything kind of
goes back most attacker behaviors can be explained by thinking about this from this simple cost
of benefit analysis from the attackers perspective they want to maximize the number of dollars that
get for every minute they spend operating something and time they need to spend on the phone with you
is time they only want to spend if they think they have a decent chance of convincing you to give them
your money. Yeah so cool that we can go into a specific example like that in a bit more detail.
So one big thing that's changed for you since we did a full-length episode is you are now Dr.
Dan Sheepler. So you finished your PhD work at Oxford University and during that time you were
looking into applications of category theory to machine learning and you did define that for us
back in episode 451 when we had that most recent full-length episodes several years ago now.
My memory from then is that category theory was it had a lot of applications to clustering in
particular and from everything that you've been saying so far it seems to me like clustering could
be something that's very useful for identifying cybersecurity risks because yeah you're kind of
going to have there's going to be particular kinds of features that like in the Nigerian Prince
scam where you know we're programmer and spelling could be like this feature that could help
with clustering model identify Nigerian Prince emails as opposed to emails that are not
Nigerian Prince emails and maybe you could even be using clustering to
to identify new kinds of threats that you haven't before you know just like oh like this is
an interesting cluster over here it seems to correlate with this kind of attack like I don't know
so where I'm getting out with my question is first of all congrats on the PHD
and yeah is there any way that category theory applies to the kind of work that you're doing now
at abnormal security on a high level yes in that there's a great benefit towards being able to
look at the kinds of problems that we face through the template of how they fit into general
categories of problems and then identify what's worked for other types of problems that share
these characteristics with cybersecurity a lot of the challenges we face as as you described
is in the realm of clustering and one thing that I studied a great deal in my PHD was the
relationship between clustering and manifold learning so manifold learning is embeddings and things
that's vectors and vector databases and embeddings and the types of ways that you can
represent some kind of entity as a dense vector related to other entities query for them and
group them together and understand their behavior and characteristics in this lower-dimensional
form are all general characteristics that apply to a number of applications and the
relationships between building a manifold on which you would project your your entities but
basically means building embeddings for the data that you're working with which in the
case that that we're operating with is things like employees IPs domains links attachments devices
vendors companies these are the sort of core nouns that we reason about each of which are things where
we derive embeddings and we group them together when you identify a new domain you want to understand
what are the other domains that have similar characteristics to this one so this is something that
can go through the process of derive an embedding for this and then feed it into a model that
knows how to process some embeddings of domains or identify how the structure of this domain
enables us to cluster it in a group with other domains and so the derivation of these kinds of
strategies and how you would utilize this to build this kind of approach is something I would say
benefit of me a great deal as we set out our strategy very cool yeah it's nice that that academic
stuff can actually be useful in practice and yeah amazing to me that you did a PhD while working
full-time in really challenging roles first at Twitter and then I'm going to add abnormal security
it's an amazing yeah amazing accomplishment I really felt like I had my play full doing my
Oxford PhD all on its own and to some extent like I would love now having been an industry for
over a decade post PhD I'd love to be able to have the space to like do a PhD full-time I think I'd
really relish that a lot more than than I did when I was much younger because yeah you see all
these real world applications now and there's so many questions that I have that like I'd love to
have an infinite amount of time to dig into so well so we dug up in our research on you that you
deal that you do still manage to find some time for some other things so for example from your
about me page it looks like you have an interest in both math and history podcasts so that's kind
of interesting that specifically it kind of leads me to some more open-ended questions because
also something that I know about you Dan and that you know you're really big into fitness
and even even as we just before we started recording so if you're watching the video version of this
is almost no way to tell the only thing that kind of gives it away is it seems like the cameras
kind of on like a shaky service and the reason why that is is because Dan is on a treadmill right
now so before we hit the record button Dan was actually like walking as he was talking to me as
we were catching up and kind of getting set up here so you walk five or six miles a day it sounds
like during a full work day that's a cool hack there but I also know that you you know you live
in New York like I do you like riding around on your bike so how do these you know so I'm kind of
this is all related to the math and history podcast over like so that kind of you know
reflection that you do and thinking about time passing particularly with the history podcasts
um do you what do you think about how AI might impact things like urban planning or
transportation um particularly maybe in the context of climate change like um yeah I just I'm
curious if you have any interesting thoughts on how AI machine learning might transform
our urban world over the coming decades it's a good question and to be honest I'm not
tremendously optimistic I think many of the I think many of the I'm very optimistic about a lot of
things but the our urban world I think is something that's in many times they're the largest problems
that we face are um due to people problems rather than technical problems I think that AI is a
powerful tool for solving people problems and I'm sorry for solving technical problems and the
less powerful tool for solving people problems so so one example is I had a professor uh back
when I was a brown who I really uh there's an incredibly brilliant guy and one one problem that
I remember he he was working on and he was Philip Klein uh professor brown uh and one one
problem I remember he was working on was applying graph coloring algorithms to uh gerrymandering
problems so so identifying how would you most equitably assign uh voting districts to a particular
region based on where people live populations and I remember thinking this will never happen
or ever come into play because nobody has an incentive to make things the most equitable way the
incentives are to try to benefits whichever
whichever policy you're trying to get passed or person you're trying to put into power
and those are always how the decisions will be made for these kinds of things and perhaps that's
a very uh kind of cynical perspective on this particular area but I think the the human angle
on things like city construction is uh perhaps too dominant to to uh for technical approaches to
to really have the same kind of transformative power that they'll have uh in in other areas at
least would be um sort of immediate future okay yeah that's a good answer um I still like I want
you know there's maybe there's like tangential ways uh you know I guess things like to the extent
that AI could be helpful in keeping the plasma contained in a nuclear fusion reaction I guess
we could have a lot more abundant energy um but yeah in terms of like actually like urban planning
uh there's tons of there's tons of applications of energy and even global warming things like
simulations are possible the chemical development is something that's tremendously enabled by
simulations as well as all sorts of different areas than engineering and manufacturing there's many
things in the world of atoms that advances in machine learning technology and AI technology
have already shown tremendous advances on and we'll continue uh to do so but there's always a
attention between what is possible from what technology makes cheap and efficient and effective
and what are the incentives and structures of our society that's uh we need to operate.
Great answer um so as we start to wrap up a little bit here um it's clear to me and probably
to a lot of our listeners that you have a tremendous breadth of knowledge um do you have any particular
tips for us on how we can stay updated and ensure we're continuously learning I guess both inside
of our field in data science in AI but maybe outside of it as well. I think trying things out
and exploring new technology when it becomes available just opening up some little projects and
trying to challenge yourself to build something there's really nothing that lets you learn about
something better than trying to build there's there's something about putting yourself in a situation
where you need to demonstrate the knowledge that you've acquired that lets you understand it's
at a at a really deeper level and so there's I like looking at various GitHub repos and
quoting them and making small changes and building little toys for different applications I
want to explore a new kind of technology and I find that that's really the best way to to challenge
yourself and to grow a new areas technically. Yeah great answer and yeah definitely one that
I agree with as well I mean that's that's always the thing is it's like just I don't know just
reading a book for me yeah especially like a technical one like you know reading for pleasure okay
that's one thing but when it's about you know learning some new machine learning approach I
definitely just prefer being like okay I want to learn this cool thing what's something I can do
with it and yeah it could be even as simple as like finding someone else's Jupyter notebook where
you know I can just use that notebook and yeah like you're saying make small changes
you know upload my own data site or something and just see how things go yep yep all right so
Den you probably remember from your previous appearances on this show that before I like guess go
we like to ask for a book recommendation you got anything new for us so I have recently been
I've read a few history books um barbarians marauders and infidels I want to say uh is the name of the book
it's really yeah barbarians marauders and infidels an incredible book this book on medieval warfare
it just covers a number of different types of battles and locations of battles happening and
covers the broad themes of what was the way that's from the fall the Roman Empire to the fall of
Constantinople that warfare changed the introduction of the Magyars and the Vikings and the Arabs
as three different groups that dramatically changed the landscape of the areas they operated in
how the different weaponry the rise of artillery the rises and falls of different kinds of
projectile weaponry the difference roles of the horse and the boat really just the the fascinating
survey of a really fascinating and complex time in history and what it sort of says about the
people who lived then and how their their lives are similar and different from the people today
and it ties together a lot of your interest there you got security you got history uh nice
that sounds like a great recommendation and amazing that you can offer such a um a detailed account
of what's covered in the book um on a whim thank you very much for that suggestion Dan and thank
you very much for a wonderful episode maybe we can check in again in a few years once more on how
things are going with your very articulate way of speaking on such technical concepts no doubt
our audience will be creating that again sounds good thanks john i'm really glad to be here today
oh and i also did before he leave in the meantime of between now and that inevitable super data science
episode maybe like super time data science 1,000 or 900 or something um uh before that episode if
you want to be following your thoughts what's the best way for them to do that probably my twitter or
my linkedin i would say so i'm uh d shibu on uh twitter it's just my d and then my last name without
a r eight character ux could nice we'll be sure to include that in the show notes and yes now i
would really will let you go yeah so thank you very much for being on the show and we'll catch up
with you again soon thanks john
learn impressive confident speaker always awesome to catch up with dan i hope you enjoyed the
conversation in today's episode dan filled us in on the heuristic intermediate ml models as well
as the large language models that they develop at abnormal security to identify cyber security
risks and messages you talked about how false negatives are individually the biggest
classification error to avoid in cyber security but false positives accumulate to create a
dangerous boy who cried wolf situation as well he talked about how redis key value stores
and an auto retraining framework allow for efficient on the fly model updates how the clustering
associated with category theory is useful in real world applications and how AI is great at solving
tech problems but not always human problems like those associated with urban planning and politics
as always you can get all the show notes including the transcript for this episode the video
recording any materials mentioned on the show the urls for dance and social media profiles as well
as my own at super data science dot com slash seven one seven beyond social media another way we
can interact is coming up on november 8th when i'll be hosting a virtual half day conference on
building commercially successful lm applications it'll be interactive practical and it'll feature some
of the most influential people in the large natural language model spaces speakers including
some that have been on the show it'll be live in the orally platform which many employers and
universities provide free access to otherwise you can grab a free 30 day trial of orally using
our special code sds pod 23 we've got a link to that code ready for you in the show notes
all right thanks by colleagues at nebula for supporting me while i create content like this
super data science episode for you and thanks of course to evana mario natalie surge silvia
zara and kira on the super data science team for producing another absorbing episode for us today
you can support this show by checking out our sponsors links by sharing by reviewing by subscribing
the most of all just keep on tuning in i'm so grateful to have you listening and i hope i can
continue to make episodes you love for years and years to come until next time my friend keep on
rockin out there and i'm looking forward to enjoying another round of the super data science
podcast with you very soon