706: Large Language Model Leaderboards and Benchmarks
This is episode number 706 with Katarina Konstantinescu, Principal Data Consultant at Global
Logic.
Welcome back to the Super Data Science Podcast, today I'm joined by the insightful Katarina
Konstantinescu.
Katarina is a Principal Data Consultant at Global Logic, which is a full-life cycle software
development services provider that is huge and has over 25,000 employees worldwide.
Previously she worked as a data scientist for financial services and marketing firms, she's
a key player in data science conferences and meetups in Scotland, and she holds a PhD
from the University of Edinburgh in Scotland.
In this episode, Katarina details the best leaderboards for comparing the quality of both
open source and commercial large language models, and the advantages and issues associated
with LLM evaluation benchmarks.
All right, let's jump right into our conversation.
Katarina, welcome to the Super Data Science Podcast.
It's nice to see you again, so where are you calling in from today?
Edinburgh, Scotland, actually, I'm delighted to be here, by the way.
Nice.
Edinburgh is a place that, as you know, from the time that we met at the New York Archons
conference, that Edinburgh is a place that has spent a lot of my time during my PhD at a
research collaboration there that led to my only, like, really top machine learning journal
paper.
I had a paper in NURP's from my collaboration at the University of Edinburgh, so there's
a lot of amazing computer science faculty at Edinburgh, in particular, in AI, and have
been for decades.
It's like, it's a powerhouse school for AI.
It might be one of the oldest AI schools around.
I mean, I don't know what stretch is back further.
That's so interesting.
Yeah, that's definitely a draw to Edinburgh, which is, I feel like it doesn't really even
need it.
It's such a gorgeous, gothic looking place, but for me, my trajectory has been quite different.
I actually came here to study psychology and then sort of seamlessly segue it into data
science through, I don't know, some discoveries along the way that actually during my PhD, I
was becoming more and more interested in the data design sort of aspects, and the experiments
I was running, the data analysis as opposed to the psychological theory per se, but then
also some accidents happened along the way.
I found myself running the army up in Edinburgh, met up with a lot of people who were doing
data science and slowly but surely, I ended up working for the data lab for a couple
of years, and that was my first proper data science gig, and I've just stuck with it ever
since, and I'm also still in Edinburgh.
This is maybe 10 years later after having appeared on the scene here, so yeah, here we are.
It's a beautiful city, very dark in the winter, but it's a beautiful city.
That's for sure.
That is the tough thing about Edinburgh, I think, in winter, the sun sets around 3 p.m.,
which is a big grim to be fair.
But yeah, your affiliation with that army up in Edinburgh is, I guess, what ultimately
brought us together, because that's how you ended up having a connection into the New
York Army up, the Jared Lander runs, and so yeah, you had a talk at the R conference.
We filmed a Super Data Science episode live at the New York hour conference, and that
was recently released as episode number 703 with Chris Wiggins.
That was an awesome episode, and you had a great talk there as well on benchmarking large
language models.
So I wanted to have an episode focused specifically on that today.
So big news, at least at the time of recording, and hopefully still a quite relevant at the
time that this episode is published, because this space moves so quickly.
But very recently, at the time of recording, Lama 2 was released, and Lama 2 came published
by Meta with 11 benchmarks, where, so there's three Lama 2 models that were publicly released.
There's a 7 billion, a 13 billion, and a 70 billion parameter model.
And even the 13 billion parameter model on these 11 benchmarks that Meta published, it's
comparable to what I would have said was previously the top open source large language model for
chat applications, which was Falcon 40 billion parameter model.
So all of a sudden you have this Lama 2 architecture that's a third of the size with comparable
performance on these benchmarks.
But then when you jump to Lama 2, the 70 billion parameter model, it blows all of the pre-existing
open source LLAMs out of the water.
And so, yeah, so should we believe this?
Can we trust these kinds of benchmarks?
What are, I mean, yeah, dig into us, dig in for us, into why these benchmarks are useful,
and also what the issues are.
Cool.
Yeah, so this is a really good starting point for our entire conversation, because this
example I think pulls in various aspects I really wanted to talk about.
And I think the first one I'm going to dive into is, what does all of this mean?
How can you, in a way that really does justice to all the effort that's been ongoing for
the last few years in this LLAM space, unpack this idea of performance, and what does it
even mean, what are all the fastest that are involved?
And at the end of the day, once you do start to dive into all of this detail with all
the benchmarks, all the metrics, all the particular domains that are involved in a particular
dataset used within these test suites if you want, how do you kind of drill back up again
to come up with some conclusions that actually make sense across this entire field, especially
as it's moving so fast?
So I guess something that I would probably point towards as a risk first and foremost is
we're immediately placed within this arena of academic research.
And it's obviously an extremely well-developed area already.
We are talking about all of these benchmarks as you mentioned, but what I wanted to kind
of flag beforehand as well is at the end of the day, the idea is that these models are
going to be exposed to some layperson, some user, and their idea of performance may not
really overlap particularly with what's in all of these benchmarks.
I think a good example to really drive this message home would be something like maybe
as a random average person, I might be looking to interrogate chat GPT as an example on what
a suitable present would be for my niece and my entire experience and my idea of performance
might rather have to do with, are the answers creative enough?
Creativity is not something you typically see in these benchmarks and how would you even
begin to measure creativity?
So that's one aspect it might also have to do with is the interface that surrounds these
models, making it easy enough for users to interact with the models per se.
So yeah, I think that's something that's definitely worth pursuing a lot more in conversations,
especially as the area develops further, but to kind of return to the more academic research
angle as well, then what I'd probably dive into at this point because it's a really good solid
effort of trying to incorporate a lot of facets of measurement, metrics, data sets is the
whole effort surrounding the hell paper.
So rather than immediately talk about is this model better than that model on this task
or that task or this metric or that metric, in hell, I think the sort interrupt you,
Katerina, but quickly let's define what hell is, at least like the acronym for our listeners.
So it's the holistic evaluation of language models, which yeah, I'm sure you're going to
go into this comprehensive benchmark, but just before we get there, there was another
aspect that you mentioned to me before we started recording related to issues with any of
these tests, and maybe you're going to get into it with Helm anyway, but is this issue
of contamination?
Yes.
So one aspect that I think isn't maybe as obvious, first and foremost, whenever we talk
about evaluation risks, is this idea that especially models that are considered to be state
of the art and have broadly speaking good performance, air quotes, they tend to be close
source.
So what happens there is we don't have a very good grasp on all the types of data that
went into these models in the first place, and therefore the outcome of that is we have
some degree of uncertainty in terms of are we actually exposing these models within our
test data they've actually already seen before, and if that's the case, and obviously any
performance we see might end up being inflated.
This relates to it, so if we're using GPT-4 and we're blown away that one, it gets
amazing results on these kinds of metrics, but it's been trained on all of the internet,
and so these test questions, the test answers, they're all in there.
And so it's a classic situation where when we're creating our machine learning model,
we want to make sure that our evaluation data don't contain the training data.
But if the algorithm has been trained on everything on the internet, probably the questions
on any evaluation, any answers are already in there.
Even more so, it's interesting because there's this huge jump from GPT-3.5 and GPT-4
with respect to performance on things like the LSAT, or I don't know if it's specifically
the LSAT actually, it was some kind of general bar exam, which actually, so LSATs I guess
is to get into law school in the US.
The general bar exam is once you have your law degree and you want to qualify in a
whole bunch of different states in the US, there's this general test, and I can't remember
these exact numbers, but like GPT-3.5 was like, you know, 9 out of 10 humans would outperform
it, and then with GPT-4 it was the other way around, only one out of 10 humans would
outperform it on this bar exam.
Yeah, so that's actually a really good example because LSAT is definitely part of these
benchmarks.
So if something like GPT-4 was trained to actually perform well on that, then if you
come in and try to test it again on that same sort of benchmark, then that's slightly
pointless because you're not going to really find out anything new about its performance.
And that kind of brings us to a different point that I'm glad we're able to make at
this point.
There's this whole idea of there's probably never going to be a particular point in time
where we can stop refining and updating these benchmarks because, well, first and foremost,
we don't know exactly what's been incorporated in the training sets in the first place,
so the only real way around that is to kind of find clever and clever ways to test the
performance on models and keep updating the benchmarks themselves, but separately as
well, as performance evolves, then benchmarks actually might become obsolete and relatively
speaking too easy.
So from these two points of view, if there's been this effort to keep adding new tests,
for example, big bench, I think started off with 200 tests or something of that nature,
but now has 214 for this exact reason.
So that's why there's probably going to be a lot of movement also from the perspective
of any type of standardization that might increase over time because currently performance
can mean a vast number of things.
It could mean accuracy, it could mean fairness, it could mean lack of toxicity.
So a big measurement problem is how do you incorporate all of these different aspects and
do you even need to because there is some indication, there are some pieces of research
that would suggest actually despite being substantively quite different things, all of
these facets end up being very highly correlated, which is also an interesting idea.
So yeah, for all of these reasons, I don't think the research in this entire area is going
to stop anytime soon, so another big problem is how do you even keep yourself up to date
and digest everything that's been happening in this field?
Yeah, this doesn't seem really tricky, this problem of constantly having to come up with
new benchmarks to evaluate, and that's going to become a bigger and bigger problem because
presumably in the same way that when you do a Google search today, you of course are
getting information that's minutes or hours old from across the internet, and it seems
conceivable that in the not-too-distant future, while models like GPT-4 today are trained
on data that stopped several years ago, presumably people are working on ways of constantly
updating these model weights so that you have the LLM's right there in the model weights
using up-to-date information about what's going on in the world, and so somebody could
publish a benchmark and then minutes later, and LLM has already memorized the solutions,
so it's moving goalposts, I guess, as the definition.
Now on the other hand, we can certainly say that these models are getting better, so despite
all these issues, I feel very confident that when I'm using GPT-4 relative to GPT-3.5,
I am getting way better answers than before and much less likely to have boost nations
than before, and so these tests should measure something, like these tests, I think, do
you have value, they have tremendous value, and they should correlate, I would hope that
they would correlate, or at least it seems like when these papers come out, and LLM2 comes
out, and I see that, wow, the 70 billion LL model, it outperforms Falcon and Vikuna,
Vikuna and all these other previous models, and then I go and use the 70 billion LLM2
in the hugging-faced chat interface, and I'm like, wow, this is actually pretty close to GPT-4,
some of these questions that I'm asking, that I feel like are questions that it has encountered
before, so there is this underlying real improvement happening, and it does seem to correlate with
these quantitative metrics, but yeah, there's thorny problems, lots of thorny problems, I don't know,
do you think that Helm, it seemed like you felt like Helm could be a solution that you started
talking about earlier? Um, I think the way they went about trying to systematically unpack
performance and try to cross various factors is probably the way I would have ended up organizing
this research, so that's why it really stuck out to me, but yeah, the sheer scale of effort
that went into it does make it very difficult to really, at some point, see the forest for the trees,
and I want to dive into this idea a little bit more, but yeah, we're talking about, for example,
I think five or six core types of tasks from things like summarization, information retrieval.
Oh, it's sentiment. I've got the page open in front of me, so again, Helm,
holistic evaluation of language models, and it's a Stanford University
effort from the Center for Research on Foundation models, CRFM, and there are 42 total scenarios
that they evaluate over a bunch of categories, like you were describing, so like summarization,
question answering, sentiment analysis, toxicity detection, it goes on and on and on, knowledge,
reasoning, arms, efficiency, calibration, and I'm not listing all the individual tests, I'm listing
the categories, and exactly, there could be half a dozen to a dozen different tests.
Yes, and multiply all that by the tens of models they're considering, so very quickly,
you arrive at this wealth of information, and if you take a step back, you naturally ask yourself,
what does all of this mean? Now, the authors helpfully try to sift through this volume of
information by creating a leaderboard on the website, and this is another really interesting
tool because it's not a unique concept. We have leaderboards on chatbot arena, and we also have one
on hugging face, but here's the thing. My initial thought process was, oh great, I don't have to keep
up with individual models necessarily, I can just take a glance at these leaderboards, get the
gist of what's been happening in the area, and then anything that kind of leaps out at me,
that's what I'll dive into a bit deeper. But I kind of started to realize that it's not quite
so simple, because even with these three leaderboards, the reality is their evaluation criteria,
the models themselves that are included, don't overlap, so looking in three different places is
already kind of creating a hazy picture of what's really going on. So connected to this idea,
I kind of realized that actually, papers as vast as the helm one kind of
subtly introduce this concept of time horizon you're interested in, because if you're interested
in models in the here and now, because maybe you want to pick one for a particular
application that you want to create, then sure you're going to dive into these and think, okay,
for this task, this metric, I want to see which one does best and I'll just go with that one and
test it further myself or whatever. But maybe there's more to the story, if we have a longer term
view, then maybe what we're going to be interested in is nothing to do with the particulars of this
model versus that one, but rather issues like, what is a good standardized way that we can even
think about measurement of these things, because it's so vast, because it involves so many different
aspects, maybe at some point in the future, rather than checking tens of different benchmarks,
multiple leaderboards, maybe there's going to be a distillation of fewer places to actually check,
or at least we can hope. And there's also an extra longer term focus, because at the end of the
day, once we get all of these metrics, like accuracy in terms of, I don't know, information retrieval
or Q&A and any associated metrics that get computed for tens of models, what we can do with those
is start to frame everything as a prediction problem, which is where things get really interesting,
because if we keep collecting these types of metrics, we're finally going to get closer to this
point in time where we get to say, okay, what are the ingredients from various models that actually
go into this observed level of performance? Is it the fact that they have this many parameters?
Is it the fact that they had this training objective, or like generally speaking, is there some
sort of recipe of success that tends to lead to better performance? And if so, what is it? And we
won't really know the answers to these types of questions, unless we do all of these evaluations,
but look at them from this much broader perspective of not this model or that model, but general
laws that somehow govern how LLM's operate on a general level.
Yeah, all really, really great points, and very thoughtful to think that we could eventually
converge and have kind of one state of truth for you to go to. It is interesting going to the
open LLM leaderboard from Hugging Face at time of recording. We do have various variants of
LLM2 that are generally near the top, looks like some groups have kind of retrained it with more
instruction tuning. And yeah, Hugging Face is trying to do an average over some different evaluations,
like Hellaswegg, like NMLU, the truthful QA, but those tests are just three of the
40 tests that Hellm ran, for example. Yeah, so I guess, I mean, it's nice to think that we could maybe
go and kind of have one absolute answer, but I think on the other hand, depending on specific
use cases that you're going to have for you or your users, maybe of these different kinds of
benchmarks is kind of level of granularity. It is useful. So with LLM2, for example,
I've actually not tested this myself, but I've read that LLM2 doesn't perform as well
on code tasks or math tasks as something like GBT4, even though it can be comparable in a lot of
just plain natural language situations, where it's just human language. So yeah, so that kind of
distinction could end up being important, depending on your use case, like you wouldn't want to,
I guess, take LLM2 and make something that's kind of like a GitHub co-pilot with it. Yeah,
you might want to start with something else. That's to be fair. Yeah, I do agree, actually,
with that point, and it does bring to mind all sorts of really interesting tests that are part
of Big Bench, and we're dealing with things like finding anachronisms and
anagrams and stuff like that, which, depending on the application of a model,
might really be completely irrelevant. So yeah. Yeah, and yeah, so in addition to Helm,
and the Huckney Face Open Leader Board, which I'll be sure to include in the show notes,
you also briefly mentioned the Chatbot Arena, which in some ways it collects more,
a valuable, more expensive data, because instead of having evaluations be done on these benchmarks,
there's head-to-head comparisons, and then human users select whether they like the output
from Model A or Model B, and they can be blinded as to what those models are. And in the very next
episode, coming up episode number 707, we've got Professor Joey Gonzalez of Berkeley University,
who is one of the key people behind that Chatbot Arena. So he's going to go into a lot more detail,
and it'll also disclose for us why it isn't as perfect an evaluation as it seems. There's still
some issues, like there's always, yeah, I guess we're, you know, I guess like many things in
science and technology, we are making errors, but hopefully smaller errors all the time,
and moving in the direction of progress, which again, it's safe to say, like, you know, all these kinds
of criticisms that we can have of these particular evaluation benchmarks. So leaderboards,
ultimately, we know, qualitatively, that this is a very fast moving space, and it's crazy what
these models are doing recently in the past year. And, you know, what I mentioned towards the
beginning of the podcast, having to do with end users, and what do they actually think of
as good performance? What does that even mean? And I think Chatbot Arena actually gets
quite close to this idea with their system of incorporating these ELO ratings. So that's
something I really enjoyed playing around with earlier today myself. So broadly speaking,
this is an approach that's been adopted from chess. So in terms of what happens in larger
tournaments, you might have two players opposing each other, and depending on who wins, they either
get a boost in points, or if they lose, they actually get points deducted. And the same sort of
approach is used on these LLMs. But just as a regular user, you might have some prompt in your
mind, like, please generate text as though Elon Musk had written it, or something like that,
or like the text of a tweet. And I tried this earlier myself, and to be fair, both answers I got
from the competing models were actually quite a legit Musk sounding, if you will. So yeah,
that's a lot of fun to play around with. And it's definitely a highlight in terms of what
Chatbot Arena contributes, as opposed to, say, Helm, although, even in that case,
there is an attempt made to incorporate some human feedback into the loop as well. But I don't
think it's anywhere near being the focus of that body of work. Nice, yeah. But a good mention there
of the kind of thing, this kind of human feedback as being a great way of moving forward. And
the Chatbot Arena, I think everything is made available. All the data are made available
for people to use and make models better. So very cool space to be in, very exciting times to be
in AI in general, as I'm sure all of our listeners are already aware, and maybe part of why they're
listening to the show. So, Catarina, before I let you go, I ask our guests for a book recommendation.
Do you have them for us? I do. It's something that sprung to mind, although actually my first encounter
with this book was a very, very long time ago when I was still doing my psychology degree,
and I actually have it right here with me. It's the Illusion of Conscious Will by Daniel Wagner.
And when I came across this, I was actually studying in France on an Erasmus grant, and I remember
being stunned at this concept that Conscious Will can actually be manipulated experimentally.
And honestly, it's a joy to read the level of intellectual
ingeniousness in how these experiments are devised so that people's
subjective feeling of having wanted to do something ends up being manipulated. It is just
to me at this point unique. So, if anybody has any curiosity about this, I highly,
highly recommend it, and who knows? These notions of Conscious Will maybe will kind of
come into the conversation and kind of have already with LLMs. So, there you go.
Yeah, that is certainly something the relationship between Conscious Experience,
artificial general intelligence. This is something that we dove into with Ben Gertzel in episode
number 697. And it is something that, as somebody with a neuroscience PhD, I'm really fascinated by
as I mentioned to you, Katarina, before we started reporting, I had a old PhD scholarship to do
a PhD in Consciousness. So, the neural correlates of Consciousness. So, trying to identify
using brain scans or probably some of the kinds of experiments outlined in your book in the Illusion
of Conscious Will, where we use things like intracranial stimulation. So, you
trans cranial magnetic stimulation, exactly. Yeah, TMS, thank you. Which allows you to have
a magnetic signal, and you may remember from physics that magnetism and electricity are directly
intertwined. And so, you can send these magnetic signals through the skull and then impact the way that
your brain cells work, which involves some electrical conductivity. And yeah, you can influence
people's conscious perceptions, like you're saying. And so, there's this really, in some ways,
it's kind of an obvious thing to say to probably scientifically minded people, like a lot of our
listeners, that because we live in a system of cause and effect, you can't possibly have some
little person in your brain that is separate from all that, and somehow is making decisions
in some way that's beyond just physical processes, like, you know, cause and effect collisions
of molecules, yet we very compellingly have this illusion of free will, and to some extent, yeah,
I mean, if you come to grips with that, if you really accept that free will is an illusion,
then, I don't know, it can be tough. It's like, it's like, real tough. So, it is a terrifying idea.
Yeah. So, yeah, I didn't end up taking up that PhD scholarship, because I was like, this might
really do my head in, and got into machine learning instead. Yeah. Well, I'm pleased you did,
because now here we are, luckily. Yeah. Well, anyway, thank you very much,
Katerina. This was a really interesting episode. A really nice dive into evaluating
large language models. Very last thing, and people want to follow you after this show. Here,
your latest thoughts was the best way to do that. Probably on Twitter, so you can find me
at see double underscore constant time. Nice. We'll be sure to include that in the show notes.
Katerina, thank you so much, and catch you again a bit. Awesome. Thank you. Bye.
Super, what an informative discussion. In today's episode, Katerina covered how ordinary users of
LLMs may have qualitative evaluations that diverge from benchmark evaluations. How evaluation,
data set, contamination is an enormous issue, given that the top performing LLMs are often trained
on all the publicly available data they can find, including benchmark evaluation data sets.
And finally, she talked about the pros and cons of the top LLM leaderboards, namely Helm,
Chatbot Arena, and the Hugging Face Open LLM leaderboard. If you'd like today's episode,
be sure to tune into the next one, number 707, when we have Professor Joey Gonzalez,
a co-creator of the Chatbot Arena, as well as Seminole, open source LLMs like Vikunia and Gorilla.
Yeah, he'll be on the show next week. All right, that's it for today's episode.
Support this show by sharing, reviewing, or subscribing, but most importantly, just keep listening.
Until next time, keep on rocking it out there, and I'm looking forward to enjoying another round
of the Super Data Science podcast with you very soon.