689: Observing LLMs in Production to Automatically Catch Issues
This is episode number 689 with Amber Robertson's Xander Song of a Rise AI.
Today's episode is brought to you by Posit, the open-source data science company,
by AWS Cloud Computing Services, and by Anaconda, the world's most popular Python distribution.
Welcome to the Super Data Science Podcast, the most listened to podcast in the data science industry.
Each week, we bring you inspiring people and ideas to help you build a successful career
in data science. I'm your host, John Crone. Thanks for joining me today. And now, let's make the
complex simple. Welcome back to the Super Data Science Podcast today. I'm joined by not one,
but two guests, Amber Robertson and Xander Song. Both are fabulous and both work at a rise in
machine learning observability platform that has raised over $60 million in venture capital.
And just to hear where a rise is spelled, maybe not how you'd expect. It's A-R-I-Z-E,
or Z-E for the Americans out there. So let me introduce both Amber and Xander. Amber serves as
an ML growth lead at a rise where she has also been an ML engineer. Prior to a rise,
she worked as an ALML product manager at Splunk and as the head of AI at Insight Data Science.
She holds a master's in astrophysics from the University of Dachile in South America.
Xander serves as a developer advocate at a rise, specializing in their open-source projects.
Prior to a rise, he spent three years as an ML engineer. He holds a bachelor's in mathematics
from UC Sonnet Barbara, as well as a BA in philosophy from UC Berkeley. Today's episode will
appeal primarily to technical folks like data scientists and ML engineers, but we made an
effort to break down technical concepts so that it's accessible to anyone who'd like to understand
the major issues that AI systems can develop once they're in production, as well as how to overcome
these issues. In this episode, Amber and Xander detail the kinds of drift they can adversely
impact a production AI system with a particular focus on the issues that can affect large language
models, also known as LLMs. They talk about what ML observability is and how it builds upon
ML monitoring to automate the discovery and resolution of production AI issues. They talk about
open-source ML observability options, how frequently production models should be retrained,
and how ML observability relates to discovering model biases against particular demographic groups.
All right, you ready for this important and exceptionally practical episode? Let's go.
Amber and Xander, two guests, double the fun, no doubt, where are both of you calling in from?
So I'm calling in from the Miami area in Florida.
Nice. I'm calling in from the San Francisco Bay area in California.
Oh, nice. All right. Well, great to have you both on the show. I met Amber in person,
a couple of weeks ago at the time of filming. I was at ODSC East. I did a couple of half-day
trainings there. I did a half-day training on deep learning, like an intro to deep learning with
high-torch and tensorflow. And then I also had a half-day training that was a huge amount of fun
for me to deliver. A huge amount of work for me to deliver because I hadn't done a talk on this
content before, but it was on large language models. So natural language processing with LLMs,
and specifically focusing on how we can be using both the commercial APIs, like OpenAI's GPT-4,
as well as open-source models through hugging phase, and then taking advantage of these kinds
of tools together with high-torch lightening to efficiently train in order to have
your own proprietary LLMs that are very powerful and kind of GPT-4 level quality.
So I was talking about that stuff, and I lamented over lunch to Amber that I wish I had seen her
presentation the day or at any time before I was presenting because I had a slide on issues
related to deploying LLMs. And I listed a whole bunch of problems that can happen in production,
and my final bullet was there are various ML observability platforms out there.
And I didn't list any because I didn't have any like particular one that I thought was like
really good. I was like, I just said to the audience, go ahead and Google it. Oh, and by the way,
all of that talk should now be live on YouTube by the time that you hear this listener.
It isn't at the time of recording, but I'm pretty confident it will be published in on YouTube if
you want to check it out. So you can hear me you me lament about that. And yeah, so Amber did a
talk specifically on model observability, machine learning observability with LLMs.
Yes. Yeah, that was that was a lot of fun that conference. But yeah, I did a talk on
kind of LLMs in production. So LLMs and where arise focuses on is the ML observability component.
So ML observability is the software that helps teams automatically monitor AI, understand how
to fix it when it's broken and ultimately learn how to resolve the issues teams are facing in
production. And LLMs and LLMs observability, similar concept to machine learning observability,
but focused on these large language models. Right. So I guess there are unique challenges related
to LLMs in particular. Maybe we can dig into that in just a little bit. In the meantime,
yeah, you can let us know a bit more about why whether it's an LLM or not that we have in
production. Why is a machine learning observability platform like arise useful to a data scientist
a machine learning engineer or a business that's deploying a machine learning model? That's a great
question. And mostly it comes down to what you do after you develop this model that, you know,
it goes through all the tests and you put a notebook. It looks great. And then you put it in
a production. Because I think I pulled the audience during my talk with how many people have
put a model in production on a Friday night and then come back on Monday. And you don't want to
be doing that because these models inevitably drift. So there's issues with the baseline data versus
the incoming data. There's performance degradation issues. These models are always going to change
over time. But it's important to know how much they're changing is that affecting customers,
is that affecting revenue? Is there fairness and bias issues thrown in there? Are there data
quality issues? So there's a lot that comes into play in that post production workflow. And oftentimes
what we see arise does a lot of surveying for our customers and for ML engineers in the space.
And what we find is that 84% of teams say that it takes at least one week to detect and fix an
issue in production. So sometimes these issues could exist as long as six months. And it's
customers calling to complain like, hey, I'm not happy with the recommendations I'm receiving.
I don't like this plan. I don't like this, you know, this feedback that I'm getting.
That's really the last thing you want is for you to find out there's a problem because customers
or you find it out because something's wrong with the revenue, something's wrong with the model.
So just making sure you have the orchestration in place to prevent all these issues
and to catch them early. It's that time to value that really makes teams want to purchase
and observability solution like arise. Nice. That makes a lot of sense. So you talked about Drift
there. I know that that in particular is something that is an issue with machine learning models
in production that happens all the time. So what is Drift? Why should we monitor it? Another great
question, John. And arise does have two free courses. They're virtual certification courses that
we done at your own pace. And Drift is covered in both of those units. One is kind of an
introduction to Drift. And one is the advanced Drift metrics that are used in production.
And I would say it comes in two categories. I'll let Xander tackle the second category. The first
category are the structured use cases that we see. A lot of times you're looking at distributions
and how distributions change over time. So you always have a reference or baseline distribution.
And then you have a production distribution. So if you have your baseline distribution,
let's say that's your training. And then your production data is going to be your current
distribution. You're looking to see how that data is changing over time. So you can use a
statistic method like PSI, which is a population stability index. You can use KL divergence.
You can use a chi-squared test. There's a lot of different statistical methods that you can use
to see how much one distribution is changing from another distribution. And so there's methods
around that. And then for Drift, it's important to set monitors because you might have a, you know,
maybe, you know, certain features like user IDs. User IDs are going to Drift. That's totally fine.
But if you have something like say a number of states, you want 50 states. If that's what's
used in your training model, and you know, we have seen teams that, you know, maybe they get,
oh, now there's 60 states coming in, and that's because we trained our model on uncapitalized
data. And now we're getting capitalized data coming in. So setting those monitors is going to be
key. Say, and what do you want to talk about how we measure unstructured Drift in production?
Yeah, sure. So I can, I can talk about the unstructured case. When you're dealing with unstructured
data, what you're dealing with are embeddings. And embedding just being a vector representation of
a piece of data. So imagine it's a, you know, imagine you're dealing with like an image classification
problem. You've got this image classification model. It's taking, you know, photographs and it's
trying to tell you what's in the photograph. An embedding vector would just be this vector that's
basically encoding similar images nearby in the embedding space and dissimilar images for our
part. So you would want to see images of the same class nearby to similar images for our part.
And it's not just like, you know, what kind of, what is in the image? It's like these embedding
vectors actually contain a lot of different kinds of information. Information that you might not
expect, like, is the image grainy? Is it corrupted? Is the semantic content of the image changing?
It can be really subtle things, actually, that are difficult for human beings to detect.
And basically, what we do is we monitor the distribution of the embeddings,
training relative to production. So is that distribution of embeddings in production?
Is it different from the distribution of embeddings that we saw in training? And, you know,
if it's helpful, I can talk about exactly how we do that. But the basic idea would be,
are you getting basically new areas of your embedding space? Are you seeing new parts of the
embedding space light up in production that weren't represented in training? And for traditional
models where you actually, you know, a traditional CD model where you're actually training,
you know, you're training it on some data set, you could really expect that the model is not
going to perform well if it's doing inferences on data, the likes of which it never was trained on.
So that's in a nutshell what we're doing when we detect embedding drift for unstructured cases.
And I would emphasize, right, like that approach is actually very agnostic with respect to the kind
of data you're representing. I was giving CV as an example, but you could also
represent a piece of text as an embedding vector. You could represent a piece of audio as an
embedding vector. So it's a pretty agnostic approach to detecting drift for unstructured
use cases. Nice. So with structured data with quantitative numbers that you'd find in a table,
we can use what Amber was describing where we just have some baseline distribution that we should
be expecting. And if those, if the distribution of those numbers in the table starts to get away
from the baseline distribution, then that could set off like on alarm basically, like what like,
is that what happens? Like somebody gets like a page or duty notification kind of thing?
Yes. So you can set your monitors. And that's, that's where a lot of teams have difficulty. It's
not so much understanding how drift works, but understand how they do it outscale. So what
arise does is set monitors for every feature of every model of every model version of every
model type. So you're able to handle it at scale. And so teams can work in arise accounts
similar to how you can use role access in Google or AWS and work on different sets of models.
And then yeah, that goes to Slack, email, page or duty, anytime a particular feature goes off
on performance drift or data quality. That would be alerted to a team member.
This episode is brought to you by Posit, the open source data science company. Posit makes the
best tools for data scientists who love open source period, no matter which language they prefer.
Posit's popular RStudio IDE and enterprise products like Posit Workbench, Connect and Package Manager,
these all help individuals teams and organizations scale R and Python development easily and
securely produce higher quality analysis faster with great data science tools. Visit Posit.co. That's
POSIT.co to learn more. Nice, makes perfect sense. And yeah, so just kind of circling back to what
Zander was saying there. So with the with tabular data, just like straight numbers that we can put
in a distribution, we have the we compare the baseline distribution to some production distribution
with unstructured data, whether it's images like the computer vision example that you gave Zander
or whether it's natural language data, it could be audio waveforms, any of these unstructured
data types, they can be converted into an embedding, which yeah, it's just this a vector of some length
that represents numerically kind of an abstraction of all of the various features. So like you said,
with computer vision, like it could be is the image grainy or not or is the meaning of what's in this
is the semantic meaning that's represented by its image different. And so there would be some set
of embeddings. So similar to the way that we have like a baseline distribution and we're comparing
that with the production distribution, we have this like baseline set of embeddings. And if we start
to see embeddings that are far away from what we were expecting, then that similarly sets off
in alarm. Exactly. Exactly. Nice. All right. I think I'm following along. So I know there are
like lots of different kinds of drift. So I know there's like data drift, model drift,
others. Are you able to like break down for us with these different kinds of Drift R?
Yes. And I also like the course gets into each type of drift, what it is, how to monitor it,
and then what metrics you should use. So there's definitely more details if people are interested
in the course. But so different types of Drift that we see, we see, well, what's known as
model drift or drift in predictions. So if something's drifting in predictions, you don't need
the actuals, you don't need those ground truths. But if you use something called concept drift,
that's a drift in essentially the outputs, which means you do need that ground truth. And for
teams that get the ground truth back without a really strong delay, it's better to use a performance
metric. Because what teams want to use Drift for is a proxy to performance. So if they don't get
that performance back, they want to see like, is anything happening? And so a lot of that comes with
a covariant drift, feature drift, data drift, metadata drift, all those are essentially putting
drift monitors on each feature. And you can set those for numeric and categorical features.
So the different types of Drift we see are going to be around the features, the inputs, the outputs,
and then there's also something known as upstream drift, which is essentially like
something going wrong in the upstream process. So it's more likely an engineering issue that's
coming around there. But yeah, I would say that the number one type of Drift that teams care about
is feature drift. Because some of these models have thousands of features. And if you're using
things like shop values, feature importance values, you might notice that maybe five features have
the most important impact on decisions. And so they really want to monitor the most important
features that are leading to these decisions that are then impacting customers. So feature drift
is, I would say one, the biggest ones. And PSI, which is a population stability index,
which is actually derived from KL divergence, tends to be the most use metric because of the
stability it has and a symmetric property that KL divergence doesn't have.
Cool. Yeah. And so KL diversion, you mentioned a couple of times there. So cold black
butchering the pronunciation. But in that library divergence, the people can look up in full
later. But just I think when we say it really quickly in a podcast like KL, you probably like,
is that a word? Yeah. So two, two letters capitalized KL hyphen in between them.
In many cases, KL divergence. So folks can look at that. And we have, like I said, I mentioned
the course, but we have blog posts that really go into details. Well, yeah, I have trouble pronouncing
it because there's JS, there's, there's a lot of different abbreviations. And then not the easiest
to say. And they're even harder to calculate sometimes. So having like a go-to guide has been
really helpful and has helped I think a lot of, a lot of our customers understand it because
these are non trivial things to calculate for large models. Okay, cool. So I guess with a
platform like arise, does it come built in with like these kinds of calculations? So yeah.
Yes. So if you were to send for the arise platform, which you can arise has a free version,
and you can upload two models into it. It's based on imprints. So anyone can go and try it out
today. And we have automatic schema detecting capabilities. You can upload your data from a CSV
or cloud storage or an API. So a lot of different options for uploading your data. And then we'll
automatically set all your monitors around drift performance and data quality. You can use custom
metrics. We have a lot of default metrics that tend to be the most useful for teams. And so we
have normally two types of customers. One type of customer that just wants everything done for them.
They want to check on it maybe once a week. And make sure that they're getting to learn
when things happen. And then we have another type of team that really wants to dive in to fully
understand all the features of their models. See if they can improve performance by one
percent. Like what would it take to do that? But yeah, it can be as easy automated as you'd like.
Or you can make it as configurable as you want. Like you can take away all the automated monitors.
It may make everything customizable if that's what you want. Nice. All right. I'm starting to get a full
picture of the shape of this platform and how to use it. So you know, we're uploading CSV files
so that the platform can get a sense of what our data are like. That allows it to create these
kind of baseline distributions or baseline embeddings. And then we can be monitoring in production.
And if some production issue happens, then there's alarms going off, emails, text,
and there's where we're pointing the finger. We're saying this is where it's happening in your data.
And then what we're also able to do with some of our unstructured capabilities is export the data
as well. So I think Sanders can be talking more about our open source offering, which the capabilities
there are also what you get from the arise platform in our unstructured. But with all the the
clustering and the unstructured data, you can actually lasso certain points and export them and
teams can use this for labeling, retraining. But essentially teams want to know where in the data
the problem is and be able to see exactly how much that data is influencing platform. So you can
isolate, you can filter on those, whether it's metadata, whether it's a certain class, whether it's
just a few IDs, it's a location, you could filter on each one of those and see how much it's
actually impacting performance. Is performance going down 2% when we isolate this data? So it's that
isolation aspect that is the difference between ML observability and ML monitoring. Because ML
monitoring is the tip of the iceberg. You definitely need that. But if you want to actually avoid
these models going down, you need to have that second set of saying, where is this problem and how do
I solve it? Yeah, so that is, I guess I wasn't in my head aware that those two words weren't synonyms.
So I might have thought that ML monitoring and ML observability were the same thing.
Okay, so ML monitoring is like somebody could have a like we could have a screen in the office
with like a bunch of charts of like performance over time or these distributions over time and how
they're changing. But that requires somebody to be like keeping an eye on it. And you're like,
oh, but with ML observability, it's automatically keeping an eye for you.
Right, right. And when you mention those screens, it makes me think of the stock market and
the stock exchange and you'd have everything going on. And unless you're constantly looking at it
all the time, just having something in place like a stop limit. So you know what's happening,
what you know where your limits are. And so you could set those for your models. Like any time
this goes down 2%, you know, it affects our KPIs, our profitability goes down $10 million.
And so relating those to business metrics, you can have those in place.
It's interesting that you said stop order because that is like, are you saying that from like
experience and financial markets, or is that actually what you also call it in your platform?
No, that I just came up from from how you're visualizing it for me.
Yeah, yeah, no, exactly. I just but yeah, that was perfect. Then it was so fluid that I was like
and I thought it's actually just how they like and figure it. But yeah, so it's like, yeah, how do
you know that? I didn't notice that you had a trading background from your biography.
I don't. Yeah, yeah, but yeah, that makes perfect sense so that like, yeah, if you're
if you're expecting some asset that you're, you know, you can set it up so that in real time
with a lot of these trading platforms that if something hits a certain price, you'll
you'll automatically buy yourself. And yeah, that makes perfect sense here. It's like if the
if some aspect of your model and production drifts so far that it hits this like price
that it triggers it's like, you know, that change corresponds to, yeah, this business impact of
this many this amount of dollars lost. And so at that point, we've got to be sure that we're
on top of it and that it's worth like, uh, yeah, that that sequence of whoever's
page of duty things are going off in the middle of the night so that someone has to get up and fix it.
Yeah, a lot of times it will trigger a retraining cycle. Nice, which is actually one of the topics
that I wanted to talk about next because when something goes wrong when like an alarm goes off,
when a model is no longer performing like expected because that like fundamentally that's the idea
with drift, we were to like summarize it, whether it's a concept drift feature drift prediction drift
with all these different kinds of things, the issue is that the model is no longer for some reason
or other, something about it is no longer relevant to the real world data that are coming through your
platform. And so then when this arises, we need to retrain our model, I guess is the most common
solution. Retraining is a very common solution and I'll actually let Xander chime in on this
because he works a lot with community members that are leveraging arise for their use cases,
but retraining is very common, but a lot of times, you know, it's it's a bigger conversation of
what's going on with the data. I never realized how strange it was to retrain a model on an arbitrary
amount of time until I started working at an ML observability company and then, you know,
oh, we're going to retrain every two weeks or every day or every six months, like you realize
that really doesn't make sense. It should be on a means basis, but Xander, what are you
interested in with the community? Yeah, just really quickly Xander before you go, that is a really
interesting point that you made and something that I have never thought of. So for me, it is exactly,
it's this I've only ever thought of model retraining. Indeed, in my talk at ODSC East in Boston,
I specifically was like, yeah, you need to retrain at regular intervals is the exact
guidance that I provided on my monitoring ML models and production slide, which now I realize it
says monitoring ML models in production. And so yeah, I hadn't gone that step further in
than thinking about like observing. But so I guess as part of this with observing,
you can be having these triggers automatically retrain your model, which makes more sense from
for so many reasons, because if like from my suggestion, it is, it's just like daily, weekly,
monthly, whatever cadence on retrain the model. And that actually doesn't make any sense unless
you're aware of how much drift is happening, because if I'm doing it every day, but I only need to
be doing it once a year, I'm wasting an crazy amount of resources. Whereas if I'm only doing it once
a year, and I should be doing it every day, my platform is going to be terrible. Yes, exactly.
If it isn't broken, you know, once the same thing goes, you don't need to fix it. And then you do
have teams on the other side that are afraid to retrain a very large model that's cascades into
maybe 10 other models that affect 10 other teams. And if they retrain it and something goes wrong.
So you have people on both ends. They're like, let's retrain every time we get any new data.
And then teams that are, you know, let's be very, very cautious and not retrain unless it's
absolutely necessary. And arise helps folks understand when is the right time to retrain.
Awesome. All right. So Zander, I didn't let you speak earlier. You got some examples for us?
Well, yeah, I think just one thing I wanted to add on to what Amazon has already said just around
the difference between ML monitoring and ML observability. I usually try and break down ML
observability into this simple equation, which is it's monitoring plus the ability to identify the
root cause of the issue. So just to know that something's gone wrong is not always enough. Like,
you need to actually have some visibility into what's causing the problem. And figuring that out
for machine learning systems can actually be like this really devilish, devilishly tricky problem
to do. And so I think that's also one of the distinctions I draw there between those two concepts.
Part of what we're trying to do is not only just detect and give you a little alarm bell,
but also to help you really quickly, to give you these like opinionated workflows that are
going to really help you quickly identify exactly what the issue is. So that's another
differentiator, I would say like between monitoring versus ML observability.
This episode of super data science is brought to you by AWS Trainiam and Infraentia, the ideal
accelerators for generative AI. AWS Trainiam and Infraentia chips are purpose-built by AWS to train
and deploy large scale models. Whether you are building with large language models or latent
diffusion models, you no longer have to choose between optimizing performance or lowering costs.
Learn more about how you can save up to 50% on training costs and up to 40% on inference costs
with these high performance accelerators. We have all the links for getting started right away
in the show notes. Awesome. Now back to our show.
Cool. All right. Yeah. Thanks for that additional insight, Sander, on ML monitoring versus observability.
And so thus far, we've been detailing the arise commercial platform and how it can be useful.
And you've mentioned how people can be having up to two models uploaded for free, which is super
cool. So you can check it out. And maybe for like, you know, smaller companies or use cases,
that'll be enough for them actually without you need to go to a commercial option. But you also
have a brand new open source product, which is called Phoenix. So tell us about Phoenix. Can
it do everything that the arise enterprise product can do, for example? Yeah. So Phoenix is
bringing part of the functionality of the arise enterprise platform, which is the SaaS platform.
It's bringing part of that functionality into a notebook environment. So what it actually is
is an application that runs alongside your Jupyter notebook runs on the same server that's running
here, you know, your notebook server, whether that's your local computer, whether that is your
your colab server. And it actually gives you this interactive experience that's more immediate than
uploading data to a SaaS platform. And right now, at the moment, it's really focused on unstructured
data on the unstructured offering that I talked about a moment ago. But I think in terms of the
long-term scope, we're really trying to be responsive to the community. What does the community
want? What is the community hungry for? And right now, the answer is they're really hungry for
LLMs. So that's really like where the push is happening for us on the Phoenix front right now.
Awesome. Yeah. So I mean, tying back in perfectly to a lot of podcast episodes that we've had
recently on the show, we've had a lot on generative AI, on large language models. Obviously, like my
talk at ODSCE, so it's focused on that. It was a super popular one. And I knew it would be because
this is just like what everyone's talking about, whether your in data science or not, and whether
you know to call it large language models, or you know to call a generative AI, everybody's talking
about chat GPT, GPT4, mid-journey, and how these large language models are impacting the world,
and you know is my job safe, or like how is this going to change things in the future,
what are the policy implications? What does this mean for misinformation? So yeah, so like one way or
another, in the data science world or not, these are major topics. And so it's really cool that you
have decided to go to be kind of LLM first, unstructured data first, with your Phoenix open source product.
So yeah, so I mean, is this something like people just go to the GitHub repo, and it's like
straightforward how to be using this with your LLM's? Yeah, yeah, so you can just pip install it,
it's pip install arise-fenix, you can check out our GitHub repo, you can check out our docs,
we got a bunch of tutorial notebooks up. Yeah, so it's you know free to get started, just put it
in your notebook, and run it. Nice. And then like are there additional kind of resource
requirements, like if we, I guess if we already have a pretty, so typically to be running an LLM,
we'd need to have a pretty, some pretty big infrastructure running, probably with at least one GPU.
And then so I guess in addition to that, there's probably no additional like infrastructure
requirements, it'll just, it'll be relatively lightweight relative to the model. Yeah, there's no
infrastructure requirements at all. You can run it along, you know, if you're running in a notebook
environment that has like a GPU, you can run it there, but you don't need a GPU in order to run the app.
Of course. The app is just this pretty lightweight application, and it actually is going to be a UI
that you can actually use either in a separate browser tab, or literally you can open up the UI and
use the application literally inside of the Jupyter notebook. Gotcha, gotcha, gotcha. So once you
get Phoenix running, that yeah, it allows you similarly to the way that you might have a tensor
board running, or watching your machine learning model train, whether yeah, tensor board comes
originally from TensorFlow, but your model could be in PyTorch. It doesn't matter. With tensor board,
you can open up another browser tab, or again, you could have it running just there in notebook,
and you can watch your model's cost hopefully go down, has your model trains, and so similarly,
you could have another tab open with Phoenix running, and so you could be, you could be watching
in real time as your model's running in production, just have this extra tab open. And then so then
that sounds like ML monitoring to me so far. So, but Phoenix also has built in like the observability
component that if we hit that, to go back to the financial algorithm, if we hit that stop order
price, we'll end up triggering alerts. Oh yeah, oh yeah. So one thing I'll say is like Phoenix at the
moment is not real time. If we find that the community wants that, we could put that in, but at the
moment that's not, it's not a real time application. That would be the SaaS platform, but imagine that
you've got some, you know, some inference data, right? And imagine that we're in this situation where
we want to detect drift. And before I mentioned, you know, you have your monitoring your embedding
distribution, right? And we're able to actually measure quantitatively how far has the production
distribution shifted away, drifted away from the training distribution of embeddings. And so you
can imagine you've got like this graph showing you drift of your production data relative to your
training data over time. Again, all this is the drift of the embedding distribution, right?
And then what we can do is like we can take those embeddings. And this is where I wish I could
show the people who are listening, but imagine this, like imagine you take the embeddings, which are
these high-dimensional vectors, right? It could be a thousand dimensions or more, right? And you take
those embeddings and you do some dimensionality reduction to view them in three dimensions. So now
you're actually literally able to see your two embedding distributions in three dimensions,
your production and your training distributions. And again, like I mentioned, if the production
distribution has shifted away, has drifted away from that training distribution, again,
you're going to be seeing pockets of production data that you didn't see during training. So it's
really cool. You can literally see the exact data points in production where you didn't have
training data. And then what we do is we provide this, we go ahead, go ahead.
But this is all Phoenix that you're still describing, right? So this is Phoenix, but this is also
in the SaaS platform. But so like I'm this is going to sound I must like have missed something. But
so you said Phoenix doesn't work on production, but everything that you were just describing was
like production embeddings and comparing those to training. So like, yeah, so like I'm I don't. So
like when we're watching Phoenix, when we have this extra tab open as we're training, like what
are we watching? Like what are we like you were talking about it is like comparing production embeddings
versus training embeddings, but like I went to like if we're just training a model, what does
production embeddings mean in that context? Yeah, yeah. So I think I was probably I was just trying
to say like it's not I was trying to clarify like Phoenix is in a nap that you're like logging
real-time data to if that makes sense. It's a it would be like imagine you've got batches of
production data. So you could have batches. So okay, okay, okay, so I completely misunderstood.
So I was thinking that Phoenix was like I was then now like okay, I have two tabs open. I've got
my TensorBoard running and maybe this is like it's my mistake is I like it was just kind of like
running with this idea. But like I'm training a model in my Jupyter notebook and I've got a tab open
that I'm watching like my lost functions on and then I've got another one where I'm watching
Phoenix. But that isn't what you're not doing. You'd be you're taking batches of production data
and ad hoc looking at them to to check for drift. So Phoenix allows you to do ad hoc
ML monitoring to identify like your own issues and to see like to give you a better understanding
of maybe where your model starting to fall down in production or where it has limitations
and where you might want to be retraining or adjusting aspects of your model. That is an
accurate description. Yeah. And John, you can also use like a validation set versus a training set.
You can use it pre-production. But most folks are concerned with that post-production workflow.
Even if the imprances are kind of like mock data, most of the time it's like you get a little bit
of data back, you know, you're kind of A, B test and you're getting some information back
just to see how well your model is doing. Right, we're not just like validating the model. We're
validating how well our model is doing with our customers. I got you. I got you. And so
yeah. And so you made a really interesting example there, Amber, that my head and thought of
which is splitting our training and validation sets. So I'm guessing you can correct me where
I'm wrong here. But it sounds to me like when we are training our models and we want to make sure
that our model works well on data that it hasn't seen before. We have this validation data set
that we set aside. But that validation data set is only useful if it matches our training data
in terms of obviously not being identical data points, but having the same kind of distribution
in the case of structured numeric data or the same kind of embeddings in the case of unstructured
data. Yeah. I actually, I got some head nods which only our video version actually. And my head
would have been on screen. So you could even see, but Xander and Amber both nodded their heads in
increment at the same time. So, okay, cool. Yeah, that's another use case that I hadn't thought of.
And I would just add on to that. I think I started out with this idea of training versus production
because I think that's a lot of the time the easiest one to understand the very first time. But
in general, Phoenix is this pretty general tool for being able to div embedding distributions.
As you mentioned, you could be differing the distributions between training and validation data.
You could be differing the two distributions between a fine tuned and a pre-trained model.
You could be, um, you know, we're getting into like actually really interesting applications
right now. So actually, what I've been working on the past couple of weeks is differing the
distribution of embeddings for a context retrieval knowledge base versus the distribution of user
queries to understand our users asking my Lama index service or my lane chain semantic retrieval
service. Are they asking questions of my database that are answered in the database or that aren't
answered, right? So it's a pretty general idea. It's like really like, you know, champion
challenger. Anytime you want to dip to embedding distributions, I would say.
Nice. Great. Thank you for like generalizing this concept as well as then giving specific
examples. Very cool. So yeah, so the canonical thing that we're thinking of with ML observability
in general, because this is the thing that, you know, we're worried about most in production is
comparing training data distributions versus production. But, um, of course, a tool like Phoenix,
which is open source, which, you know, we can use for comparing any kinds of embeddings,
we can be using that for comparing our training data versus validation data, fine tuned model versus
not. Yeah, particular user use cases, making sure that I'm trying to see whether users are kind
of using our platform in the way that we anticipated the way that we trained it for. Super cool.
All right. Yeah. I'm starting to see a huge amount of value here in Phoenix. Um, yeah. And so
I guess maybe, so you mentioned how you developed Phoenix on the one hand because LLMs are so
popular today, but I also, I just have a bit of a brain wave and you can correct me if I'm wrong
on this, but it seems to me like that it might also be particularly useful because of how
for comparing embeddings, this is so much more complicated than just comparing distributions.
So it also kind of seems like you've created a product that solves like the more complex problem.
You know, you know, with respect to, you know, we talked at the beginning of the episode about like
structured data, where we're just comparing distributions versus unstructured data,
where we're comparing embeddings. It seems to me like that ladder problem is more complicated.
And so this Phoenix problem like scratches that complex edge.
I think that's it. Yeah. That's that's a good description. Um,
and then I think one one less thing I would tack on there too is is again, like
the thing that we're really aiming to do is not only detect like that something's changed,
but it's really to immediately drill down into exactly what has changed. So like what that looks
like in the product is we're actually like pointing out the exact portions, the exact embeddings,
the exact data points that are causing the director that are causing this change in the distribution
and then surfacing those up to the user and telling you this is what you need. This is the data
that you need to look at in order to understand why you're experiencing this drift issue or in order
to understand why are the users who are asking questions of my, uh, you know, my semantic retrieval
chain service. Why are they not getting answered, right? Um, that's the idea.
And the the whole of arise products are to solve the pain points our customers are facing.
So the problems that customers have with using traditional drift methods aren't so much being
able to like understand how it works, but being able to do it at scale, being able to do a very
high volume, being able to select which metric is best for that use case. And that's what we help
solve for more traditional drift metrics, like setting them up for scale. And then for unstructured,
it's, is this even possible? Like a lot of teams are coming to us that have more traditional models.
And they're like, I'm thinking about using these elements now. Like I don't want to be behind.
How, how would that even look? Um, can I track that? Can I see how that's doing in production?
Um, and so those, those are just some of the ways we look at it. Um, and like some,
some of what sets arise apart. Nice. Very cool.
Did you know that Anaconda is the world's most popular platform for developing and deploying
secure Python solutions faster? Anaconda solutions enable practitioners and institutions
around the world to securely harness the power of open source. And their cloud platform is a
place where you can learn and share within the Python community. Master your Python skills with
on-demand courses, cloud hosted notebooks, webinars, and so much more. See why over 35 million
users trust Anaconda by heading to superdatascience.com slash Anaconda. You'll find the page pre-populated
with our special code SDS so you'll get your first 30 days free. Yep. That's 30 days of free
Python training at superdatascience.com slash Anaconda. So when we're dealing with LLMs,
the scale of these models can be very large, obviously. Like billions of parameters.
And, you know, and then so when we think about a user of your products, maybe having a very large LLM
that also has a lot of users. So like they're needing to scale up. Yeah, already big servers to,
you know, to very large numbers of these servers running to be handling all the users. So are
their considerations, are these kinds of scale challenges, something that the arise team had to
deal with as, yeah, as you guys decided that, you know, LLMs was something you wanted to focus on,
or did it, is there something about the way that arise was architected that this scaling just
kind of happened automatically? Brandon, you want to take a crack at that first? Yeah, I think I want to
also maybe clarify one thing that I think is a common question we get in the beginning, which
is like what data is arise actually taking in? So we're not actually taking in the model. We don't
ever, we never, you know, we're not an inference platform. We don't take in the model. We're not
performing inferences. That is our customers who are actually handling that responsibility. And
our customers are logging to us basically all of the data around the inferences. In the case of
an LLM, you're going to have the actual embedding itself. So you as the customer would at inference time
grab that last hidden layer or, you know, however you want to construct that embedding and you would
log it to us, right? And then we're taking in that embedding. And that is the information that
the arise platform is actually responsible for. Gotcha. So it sounds like the answer to my question
is that yeah, it arises scales very easily. Like it's, yeah, got it, got it. Oh, I just was going to add
on like we're very much built for scale. That's where we let wind a lot of bake offs because any
anything can look good if you're just experimenting with a very small set of data, but being able to
scale it to like, you know, some of our customers have billions of inferences daily, like ad tech e-commerce.
They have thousands of models. Each one of these models has a thousands of features and metadata
and chat values. And then with embeddings too, they can choose to sample it and they can choose
upload all the data because with a lot of the unstructured use cases, they're looking for trend or
they're looking, they're looking for major aspects of the data and not just a single anomaly.
They're they're looking for patterns. They're looking for kind of what clusters are emerging,
what clusters are new. And so for that, sometimes they will will sample the data or use all of it
because as we know, like these can be very, very large. Gotcha. Gotcha. Gotcha. There's a you said,
did you say bake off or big off? A big off. A big off. A little accent coming out there.
I thought it was actually, so I assumed that you said big off, but it would be kind of perfect if
you said big off because it's like, yeah, it's really like a very like, because it was specifically
to deal with like, who can handle the scale well. Yeah, break that scale. That we can
break in the scale challenge. Who can make 10 times as much cake? 100 times as much cake.
Cool. All right. So all right, I've I think I've got a pretty good grasp on the Phoenix product,
on arises, commercial offering and on these kinds of drift issues, model observability issues
in general. A related problem, particularly with unstructured data being used in production,
is bias. So I know that arise has a bias tracing tool. Can you dig into like why this is important,
how it relates to like model explainability and how practitioners could be leveraging this to
do both their models or maybe have models that are safer in production? Right. I'll talk about the
bias tracing tool. And then Xander, maybe you can give like an unstructured example of where
you can start seeing poor like poor actors or bad data coming through unstructured models by
isolating data, but with our bias tracing tool. So when teams say, you know, do you have
explainability? What they normally want is bias detection has explainability is great,
chat values are great knowing what features lead to model decisions is great, but it doesn't tell
you anything about what the final decision was the impact it has for that user. So looking
at bias tracing and looking at it on different levels. So with bias, you're going to want to look at
parity. So is my model making better decisions for this base group as compared to this sensitive group?
Because if your model could just be, you know, poor at making decisions, regardless of the group.
So then you know your model is just bad, but it's not bias. And then sometimes your model's performance
overall is really good, but not when you compare, maybe you compare a certain demographic,
a certain cohort to a base cohort, or a cohort where you have more data for, because you might have
a minority and majority class. So by doing this parity comparison and using something like a
recall parity or false positive rate parity, seeing the decisions you're making between groups
is really key. And we actually have more information on that in the course and blog posts about how to
measure and detect fairness and production. Because a lot of times it could come down, there's a lot
of things that actually end up causing bias in production. Sometimes you say, oh, I'm removing all
the data that relates to a protected class, but you could have proxies. You could have just not
enough data on certain groups could have class and balance issues. You can have, you know,
you could have certain biases in the data itself. And if you're training on a certain set,
you know, just being able to isolate where these biases are taking place because for parity scores,
if you're looking at, say you're looking at the decisions you're making for a loan for women
versus men. And you want that parity score to be as close to one as possible, because if it's as
close to one, your false positive rate parity is going to be good for one group and good for
another group. But if you see that parity score, it's, if you see if it's below 0.8 or above 1.25,
that tends to be if you're outside that range or outside of the fourth fifth rule, which is,
you know, it, like I said, this is all new stuff. But that is implemented in Congress as a way to
measure equal opportunity. And, you know, if, if teams are being essentially biased for jobs.
So that's one way we measure it. So if you're way off that parity of one, if it's like a parity of
0.1 or a parity of six or seven, looking more into the data, it's going to be key. And that, I know
that's a little bit of a, of a run on. It's just a very big topic. And it's not just measuring
drift and it's not just measuring performance. Like there are special metrics in place around
bias tracing. Right, right, right, right. In addition to those special metrics, is it the case that
these kinds of distribution monitoring or embedding monitoring? So like even going back to the
conversation that we were having earlier, where I had this light bulb go off around what you were
saying, where we could be using Phoenix for comparing training versus validation. And then you went
into other examples of fine-tune model versus not. Could we similarly, is it like in addition to
those special metrics that you just mentioned, Amor? Can we just also additionally be like comparing
the embedding of like sensitive group versus not? Or am I? Yeah, sure. So one one thing that you can do,
I mean, there's there's a lot that you can do here. One thing that you can do is easily, imagine
that we're dealing with like text output from a model and you want to know, is the is the text
output offensive or biased? Or is it, you know, is it actually producing some kind of bias against
a certain kind of prompt, for example? Maybe it's a prompt about that has some kind of gender
component to it. Is it producing output that's different based on gender? Right, maybe that's a
concrete example we could look at, right? You're dealing with some kind of chatbot, right?
Your input prompt has some kind of gender feature or gender part component of the prompt.
And you're looking at the responses. And if you actually embed the responses, what you can do is
see if there's certain clusters, literally like clusters of your embedding space, right? Like
clusters of your responses that are offensive or biased or have some other kind of negative
outcome there. And you can literally like look at those data points and color them by
the gender of the input prompt, right? And visually see in the embedding space what's going on,
right? So that's that's one idea. Like the embeddings because they contain this really rich
information are going to be very useful for understanding that kind of that kind of a
situation and use case. Nice, super cool. All right, nice. So we've got a clear idea of,
I think a lot of the key offerings that arise has now and this has given me and presumably also
are listeners, a huge amount of context around why ML Observability is important as well as
how it relates to adjacent issues like what we were just talking about with bias and model
explainability. So you guys have a lot of high profile customers, companies like Uber, Spotify,
eBay, Etsy, are you able to share without like obviously giving away anything proprietary about
you or your clients that you shouldn't be sharing on air? Is there something are there aspects of like
your relationship with these big clients that you can yeah, you can dig into some case studies of
how you were able to help with your solutions? Yes, that's a great question, John. We we see a lot
of teams coming to us. Well, we feel a lot of teams coming to us either because something went
wrong in production, something went really bad. The models were down for a while. These are
models that control websites, revenue, profitability forecasting. They control how much product you're
going to buy and when those models go down or they drift or they decrease in performance, it has
a cascading effect. So there's a lot of teams that come to us because something went wrong and they
didn't catch it. You know, it's at least a week to detect and fix an issue for most teams and most
of the time it's longer than that or we have teams that really want to be preventative. So some
cases we see going back to that retraining discussion we were having when to retrain your model
and not everyone agrees at a company when they should be retraining their model and how they
should be retraining their model. So having ML observability as a tool to justify retraining
is going to need to be important. Some teams are using a lot of resources, retrain their model more
frequently than needed and it's not always the best case to just automatically retrain your model
on data that you haven't really analyzed. And so being able to see the data, see what's going
on and say, hey, we don't have this, especially in large language models. You can see these new
clusters forming. You can see like San Jose and like prompting responses. You know, so being able
to retrain models based on what you're seeing and making a justification for that. If she and
their engineers can use our dashboards, they can use our visualizations, but those in a kind of a
PDF file and send that to certain managers and we have seen that help teams a lot on creating
a specialized training cadence. Another thing is for depreciated models. So essentially these
models that you take out to pasture, you know, what models are actually making a difference,
what models aren't. And you can see those performances, you can see, you can track them over time,
and that helps a lot of teams. The more models teams have, the more complicated their process is,
and that teams like stems to be where a rise really really shines for teams that have, you know,
these models that are three or four years old, a lot of times they don't have the original members
on the team who built these models. So helping to maintain them without, you know, knowing the source
code without having to change a lot of things up, you can tell is this model working or not. And
just by implementing it and having it in a rise, knowing when models are failing and version
controls. So that's a big thing that a rise offers. Like when you're coding, you want to
version control your code, teams that we work with often version control their models,
and then they can compare versions of their models. Just like we were saying, you could test
production and training, you could test validation sets, you could test any two data sets against
each other, you could also test models, model versions against each other and see which ones are
performing better. So a lot of times what we see is a tool for the machine learning engineer
to assist with their job, their workflow, and to catch, obviously to catch issues ahead of time,
but they can make justifications and tie that back to business profitability and business metrics.
Another big part of our survey was team saying, you know, business executives have a hard time
quantifying the return investment they're getting on AI. One of our customers said they spent 10
billion on AI solutions. They don't really know how it works or if it's even helping the company.
All these companies will remain nameless, but, you know, we are seeing this happen where they don't
want to not implement the latest technology, but they don't really understand if it's working or
if it's actually, you know, improving their profitability for the cost that involves implementing it.
And then like I said, the other big part is scale. So Spotify recently gave a talk on how they
are trying to manage their massive amount of embeddings in production with a rise. As you can guess,
Spotify has all kinds of embeddings and these are, these are audio, these are text, you know,
they're at search or retrieval. There's a lot of algorithms going on and a lot of personalization
and real-time aspects that are happening. So the integration of all those models, the embeddings,
the scale, I think those are the most important aspects when thinking of an MLB's availability
solution because trying to do all that in production while trying to build new models and do
everything else that I'm machine learning, the engineer is supposed to be doing to be very
difficult. So like having something in play where you set it up, you can check on it,
but you know if it's doing fine, it is really key for a lot of teams.
Awesome. Yeah, those were great examples. Even the ones where you couldn't name the company,
specifically, were really colored how useful a platform like this is, especially at the scale
that these companies are dealing with. And then it's super cool that you're able to give that
Spotify example in particular. Yeah, it's crazy to think how many embeddings they must have,
like you say, over like they have, there's something like 100,000 tracks uploaded to Spotify day.
And apparently a lot of that's actually AI generated. Like it's like, it's AI feeding AI tracking,
like it's pretty crazy. Nice. All right. So we've talked a lot about a rise,
machine learning observability in this episode. But we actually, I haven't let the audience really
get to know either of you as individuals. So I've kind of got a last topic here for them to learn
a little bit about you. So Amber, you studied astrophysics, Xander, you studied math.
And you also both had different roles before you got into what you're doing today. So maybe
just give us a little bit of the taste of how you transitioned into what you're doing today.
So Xander, you're a developer advocate, which is a kind of role that I think we're seeing more
and more in startups. So related to working directly with developers, answering their questions,
developing community, coming onto podcasts. So we see these kind of developer advocate roles
in cool, fast growing startups like arise more and more. Amber, your role I had never heard of
before. So ML growth lead, machine learning growth lead. So you're going to have to dig into like
exactly what that means. But yeah, so for each of you in turn, maybe we can start with Xander,
let us know like how you became a developer advocate and why a listener out there might be a
perfect fit for developer advocate role themselves. Yeah, that's a good question. So I guess let me
backtrack to and just because actually this is my very first time ever doing this kind of role.
I've been doing this dev advocacy stuff for like eight months. And actually before I joined
arise, I actually did not know what a dev advocate was. So for anybody out there who doesn't know what
it is, the basic idea is I think of it as like kind of two components. Part of it is evangelism.
So as you said, going on podcasts like, you know, I spoke to a really
well respected dev advocate in the field. And he basically told me you need to like
breathe, live and breathe. You have to like convey the passion for what you're building,
what the team is building to the audience and convey the need. So evangelize the product.
That's part of it. And then part of it is like being a pair of boots on the ground that is
in touch with the community kind of like, you know, the first, you know, the first person who
a user of an open source product is going to talk to and be a point of contact for the community.
Right. So if those kind of things appeal to you, like it's definitely I think a good career to
consider. And in terms of how I got into it, my background previously I was working as a machine
learning engineer at an early stage company. And we actually died. So it was actually a smaller
startup than a rise worked there for like two years. We really fought super hard and we died. We
couldn't raise. And when I kind of took a, took a moment to understand why did we die?
Really the thing for me was that we were pretty heads down building stuff and it turns out that
the stuff that we built was not didn't have a market, right? Didn't have a strong market. We
didn't achieve product market fit. And that really became this moment for me where I wanted to
evaluate career wise, like, how could I prevent that from happening again? Because it was very painful.
And for me, the answer was, oh, like I just need to be like really engaged in the community. And
really in tune with the community. And for me, like part of it is just like being a conduit
between the company and the community as a whole. Very cool. Nice explanation there. And yeah,
you've certainly got an ear to the ground now and can keep an eye on the trends. And it also
seems like you've identified a company that has a great product market fit. So yeah, yeah.
Nice. Amber, the floor is yours.
Awesome. And I will just say, my role changes all the time. I went from academia to industry in
2018. And since then, I've been a AI program director, an AI PM head of AI, AI sales engineer,
machine learning engineer and ML growth lead. And I think that's one of my favorite parts of doing
that and being in tech is that you can try out these roles. You can see what you like, what you
don't like. And I'm always curious about different roles and what folks are doing. And that's really
what led me to growth. Because I think it's one thing to be kind of heads down building the product,
which I really do enjoy. But you know, I would zander said of like finding that product market fit.
And helping folks solve problems is really where I want to be. I want to solve those pain points
we see for customers. And the growth aspect and being part of the growth team out of startup
is incredibly important for word of mouth for funding for people looking at their hands on the
product and for helping solve real world problems. And so the growth aspect is keeping in mind,
are people using our product? Are they using open source? Are they using the platform?
Are they staying engaged in it? Like, do we have activation? Do we have retention? Do we have
people in our open source community talking about issues and ML ops that they face? You know,
and for me, I really like connecting with folks. I like giving talks. I like doing workshops,
doing events, and having conversations. Like, you know, John, we met from, you know, an event.
And that, to me, is all part of growth. Like getting people to understand what arises,
how to pronounce it. You know, what ML is?
Haritze.
Making sure that people kind of understand that they're even having issues in the first place.
Like, you know, what we talked about earlier about monitoring versus observability,
about retraining on an arbitrary amount of time. Like, they don't realize these things.
You know, aren't as they seem until you kind of pull the veil up and say, like, are you struggling?
Like, do you know if your users are turning? Do you know if you're maximizing the profitability
of your models? And that's what I find really cool. Because for a lot of teams, a few years ago,
when I started, the, you know, ML observability was a luxury. Like, oh, you know,
that's nice to have. But, you know, we're focused in other areas. And now, a lot of teams don't
realize like how their models thrived without observability. Because they're thriving way more now.
And they feel safer about their models. Machine learning engineers feel better
about putting their models in production, knowing that if something goes wrong,
they're guardrails. Nice. Yeah. Yeah. It's an increasingly important area
that we, that everybody, I think needs to be aware of here in this space. It's amazing to me
that we haven't done an episode focused exclusively on ML observability like today.
It was very obvious to me as soon as I met you that we needed to do this episode. And I'm glad you're
willing to come on so quickly because, you know, when you describe situations,
like it's modified, but this is even in much smaller companies, machine learning models depend
on machine learning models, depend on machine learning models in production. Like there's like these
cascades. And so just one model drifting and having issues could mean that a user's experience
really takes a nose dive. Yeah. So one other interesting thing is at the ODSC, I judged a hackathon.
And these were great presentations. People built amazing products. And at the end when I asked,
does it work? Do you know that it's working? Because I would show one or two examples. I'm like,
how do you know it's working? How do you validate it? How do you, do you have any KPIs?
And that was all, oh, those will be our next steps. But it's interesting because people are very
focused on, you know, getting this AI to work. But, and they think like, oh, you know, later on,
we'll figure out if this actually adds value. We can find value like right away for our customers.
And I think that is the area of ML observability and ML observability that arises really focused on.
Nice, very cool. And yeah, I guess I should have done this right at the beginning now that you
mentioned the pronunciation thing, but arise is spelled A-R-I-Z-E or Z-E. If you're in the United
States. Awesome. Amber, Zander. This has been fabulous. I've learned a ton. And now our audience is
very much aware of ML observability. And it's important. If they didn't know about it before,
before I let you go, I ask all of my guests for a book recommendation. So Amber, maybe if you want
to go first. Yes. I recommend designing machine learning systems. It's an O'Reilly road
chip. Puyin. And believe this is the one where our co-founder, Parna, wrote a either chapter or part
of a chapter on ML monitoring and how that could be put into place. Nice. Yeah. Chip was actually
on the show recently. Episode number 661. And the title of that book was the title of her episode.
Yeah, really important topic. Super popular book and super popular woman.
Oh yeah. Chip's great. She's also super nice. She'll be at a lot of these events.
I recommend everyone says hi to her because she's a really great person to know in the space.
For sure. All right. Zander, what you got for us? I'm currently working on a book that was
actually recommended to me by the Teclid of Phoenix called Measure What Matters. It's a book about,
it's actually a book about KPIs and OKRs. OKRs. And I think it's been, we use OKRs pretty aggressively
on the Phoenix team. And it's been really interesting to hear about how some of the largest
companies in the world and some of the most successful startups in the world drove their growth
and honed their priorities using that particular system. So I'd recommend that one. Measure What Matters.
Cool. Yeah. Great recommendation. All right. Thanks to both of you, very knowledgeable speakers,
clearly know what you're doing. And I'm sure there's going to be lots of listeners who would
like to be able to continue to learn from you after the episode. Zander, what's the best way that
people can follow you afterward? I'm not on Twitter at the moment, although I probably should get
on Twitter. But for right now, I would say LinkedIn in me up. I do have a Twitter astronomer
Amber, but I think LinkedIn is better. But if you have direct questions, you can join the arise
community Slack and just slacken that community. I'm talking with community members every day. Me too,
me too. So I would second that point. Yeah. Perfect. We'll be sure to include all of those links,
your social media profiles as well as the Slack channel, the arise Slack channel in the show notes.
All right. Thanks very much, guys. And yeah, we'll have to catch up again some time with you
to see how your ML observability and job title journey is progressing. Awesome. Thanks, John.
Thank you, John.
Nice. Thanks to Amber and Zander for that highly informative discussion. In today's episode,
they fill this in on how ML observability automates ML monitoring to catch and fix production
issues before they become a big deal. How the various types of drift, for example, feature drift,
label drift, model drift can be tracked by comparing baseline probability distributions with
production distributions or in the case of unstructured data, such as with LLMs, we can compare
embeddings. They talked about how arises open source Phoenix library provides sophisticated
tools for comparing embeddings, allowing us to compare LLM performance during training versus
production. This allows us to monitor for drift as well as many other comparative use cases.
And they talked about how the comparison of natural language embeddings allows us to compare
whether sensitive groups are being treated differently by a model, thereby flagging where unwanted
bias may be occurring. As always, you can get all the show notes including the transcript for
this episode, the video recording any materials mentioned on the show, the URLs for Amber and Zander
social media profiles as well as my own social media profiles at superdatasigns.com slash 689.
That's superdatasigns.com slash 689. If you live in the New York area and would like to engage
with me in person on July 14th, I'll be filming a superdatasigns episode live on stage at the
New York R conference. My guest will be Chris Wiggins, who's chief data scientist at the New York
Times, as well as a faculty member at Columbia University. So, not only can we meet and enjoy a
beer together, but you can also participate in a live episode of this podcast by asking Chris
Wiggins your burning questions. Alright, thanks to my colleagues at Nebula for supporting me while
I create content like this superdatasigns episode for you. And thanks of course to Ivana,
Mario, Natalie, Sergei, Silvia, Zara, and Curel on the superdatasigns team for producing
another exceptionally practical episode for us today. For enabling that super team to create
this free podcast for you, we are deeply grateful to our sponsors. Please consider supporting
the show by checking out our sponsor's links, which you can find in the show notes. Finally,
thanks of course to you for listening all the way to the very end of the show. I hope I can
continue to make episodes you enjoy for many years to come. Well, until the next time my friend
keep on rocking it out there, and I'm looking forward to enjoying another round of the superdatasigns
podcast with you very soon.