707: Vicuña, Gorilla, Chatbot Arena and Socially Beneficial LLMs, with Prof. Joey Gonzalez
This is episode number 707 with Joey Gonzalez, Associate Professor at Berkeley and co-founder
of Aqueduct. Today's episode is brought to you by the AWS Insiders Podcast.
By a model bit for deploying models in seconds. And by graph base, the unified data layer.
Welcome to the Super Data Science Podcast, the most listened to podcast in the data science industry.
Each week, we bring you inspiring people and ideas to help you build a successful
career in data science. I'm your host, John Crone. Thanks for joining me today. And now,
let's make the complex simple.
Welcome back to the Super Data Science Podcast. Today, we've got the fast talking,
extremely knowledgeable and extremely innovative professor and entrepreneur, Dr. Joey Gonzalez.
Joey is an associate professor of electrical engineering and computer science at Berkeley.
He co-directs the Berkeley Rise Lab, which studies real-time, intelligence, secure, and explainable
systems. He previously co-founded Tury, which was acquired by Apple for $200 million,
and more recently, he founded Aqueduct. His research is integral to major software
systems, including Apache Spark, Ray for scaling Python Machine Learning, Graph Lab,
a high-level interface for distributed machine learning, and Clipper for low-latency machine
learning serving. His papers, published in top ML journals, have been cited over 24,000 times.
Today's episode will probably appeal primarily to hands-on data science practitioners,
but we made an effort to break down technical terms so that anyone who's interested in staying
on top of the latest in open-source, generative AI can enjoy the episode.
In this episode, Professor Gonzalez details how his headline-grabbing LLM VaKunia came to be
and how it arose as one of the leading open-source alternatives to chat GPT.
Talks about how his chat-bought arena became the leading proving ground for commercial and open-source
LLMs alike, how his Gorilla project enables open-source LLMs to call APIs, making it an open-source
alternative to chat GPT's powerful plug-in functionality. He talks about the race for longer LLM
context windows, how both proprietary and open-source LLMs will thrive alongside each other in the
coming years, and he provides his vision for how AI will have a massive positive societal impact
over the coming decades. All right, you're ready for this phenomenal episode. Let's go.
Showy, welcome to the Super Data Science podcast. It's awesome to have you here. Where are you
calling it from? I'm calling it from Berkeley. Nice. And so we know each other through
Lulika Atapupa, whose episode was number 701, and that was an extraordinary episode. At the end of
it, I asked her if she had any recommendations of people to speak to, and she said,
Joey Gonzalez. And I already knew who you were from your amazing work on Vecunia,
and so I was delighted and amazing to have you here. Let's start right away with Vecunia. So for
our listeners who aren't aware of it, it is a model that I've been talking about on Air for a
while. In fact, we had an episode dedicated to these kinds of open-source single GPU chat GPT
light models. So in that episode, we talked about Alpaca, Vecunia, GPT for all J,
Dolley 2.0. That was back in episode number 672. And yeah, so Vecunia, I guess I could try to
introduce it again, but you might as well do it. And you can also tell us how it all came out.
Vecunia is a fun story. So again, I want to thank you for having me. It's exciting to be on the
podcast. The story of Vecunia actually began over a break period, sort of after another project
Alpaca. So maybe I should go way back to the beginning of this year with the release of Lama.
The Lama model developed by Mehta is a core foundation model. It embodies a lot of knowledge,
but it doesn't really speak, it doesn't chat. And so some colleagues actually led by my former
advisor, Carlos Gastron, when I was a grad student now at Stanford, led a project called Alpaca.
And the idea was to use a self-instruct a method to train the Lama model to behave more like a chat
bot, using something like chat GPT as a guidance mechanism. And so they built this data set,
which is pretty clever, actually. And they created a nice fine tuning script that they released
to the world that allows someone to take that data that they built and fine tune Lama to speak
more like a person to have a conversation, to follow instructions. My students at Berkeley were
like, we could do better. And one of the things that's important to know in this entire kind of
revolution of large language models is that data is critical to success. And so the students looking
at this project said, there's a better data set. There's this website called ShareGPT,
which is sort of actually a demo of some web technologies, but that website did something
pretty neat. It allowed people to have fun conversations on chat GPT and then share those with
their friends. These are the conversations they thought were funny, insightful, amazing,
hilarious. I don't know. But important, these are the conversations they wanted to share.
And so these are high quality conversations. We downloaded, I think, 800 megabytes of data
using the public APIs for the ShareGPT website. And then the students took that data,
the Alpaca training scripts. And they basically put the two together. There's a little bit of work
the students did to sublow your mind. They removed the HTML tags from the data. They
did a little bit of additional cleaning. And they fed that data into, again, the Alpaca training
scripts and fine tuned the model and out came the Kunia. And this was done, I think, over a break.
Mrs. Spring Break. No, maybe it was a spring break. This is done on vacation. It's kind of a
hack over a few days. And they got a model they were excited about. And then we need to figure out
if it's any good. So the Stanford team invested and actually had some benchmarks run against.
We wanted to use standard benchmarks like MMLU. There's a lot of benchmark in the NLP
community for evaluating these models. Unfortunately, none of them are very good.
They're not good because they don't measure chat behavior, complex, creative settings are more
oriented or retrieval, answering simple facts. So to really assess a chatbot, we needed something
stronger. And so students had this clever idea. Why don't we just ask GPT? So they created a set of
basic questions. They asked the model to answer those questions. We also asked Alpaca to answer
those questions. And GPT 3.5. And then we asked GPT 4, who's answers better and score them.
And then score them on various metrics. And in the process of doing that, we found out that our model
was actually better than Alpaca, go Berkeley, and actually pretty close to GPT 3.5, which is
really pretty exciting. And so we posted this online and there was a lot of interest in this model.
A few weeks period, we went from Lama to Alpaca to Vacunia, each making big strides in performance
as assessed by the benchmark that we had created. So Alpaca was certainly better than Lama and
Vacunia was better than Alpaca. And so yeah, this generated a ton of interest. There's a blog going
around about a discussion inside of Google. Some people in Google were also pretty concerned about
this because we also compared it against Palm, I guess Barat at that point. And it was comparable
to Barat. And in fact, it was a little better than Barat and some of the other benchmarks we started
running. So it was a big step forward for these open source research models and something we
were pretty excited about. That was though, we have no moat memo that went around Google, right?
Yeah, so though we have no moat story. I will say in defense of Google, they do have a moat
as illustrated by Vacunia. It is about the data and using your data, using it intelligently,
can make all the difference. So building a big model is important. We took off the shelf big models.
We didn't pre-trainer our model and we did fine tuning against fairly cheap, but we did it on
really good data. And so I think maybe the punchline that if you were to take one thing away from
from this conversation in the beginning, at least, is the data matters. And we saw that with
share GPT and Vacunia. For sure. And Google certainly has some of the best data. I love the way
that you evaluated with GPT-4. So this is something that we use in my company Nebula now internally
for evaluating our models as well. It is brilliant because it means that you don't need to,
like, we were always trying to figure out like, okay, we have this complex task. It's difficult to
evaluate like, okay, we could do like semantic similarity, you know, take, convert the response of
our model, compare to some benchmark model, and compare the embeddings, like take a
co-sensive linear, post-sensive linear score or something, and to try to, to try to say, okay,
well, the semantic meaning is similar. Therefore, maybe our model is doing all right. This idea of
asking GPT-4, which is better and rated on a score out of 10, we now do that internally,
inspired exactly by you, inspired by the Vacunia Project, because it means not only can we now say,
okay, we are definitely on this fine-tune model that we've created for the specific task,
not only is it definitely better than the open source LLM out of the box that we started with,
not only is it comparable to say GPT-4, but on top of all that, we can compare
epoch over epoch or whatever kind of interval you want to evaluate on as your model's training
on held out evaluation data, how is our model, like, are we starting to overfit?
Is it continuing to improve? Is it improving at all? And so thank you so much for this idea
which is, like, so simple in a way, but so widely usable.
Yeah, well, I should fill you in on some details. There's some follow-on work that we discovered in
this story. Good news is it was ultimately a good idea, but there's some caveats. So I should say,
once we launched this benchmark, my first response is we should probably check to see if this actually
compares to a human judge. And we didn't have a lot of budget for doing human evaluation at that
point. And so we instead decided to try to run an arena. And so we wanted to build a website where
people would actually, in the wild, have conversations with the bots. The hope here is that we'd get
not just our prompted discussion points, but what humans would say, what humans would ask,
the crazy stuff that people might come up with, and then let them judge which models perform better.
And that is why we launched the chatbot arena. Nice, yeah, let's talk about that more.
Yeah, so the chatbot arena was a fun project. It started actually like, how do we make this a game
to get people to participate, make it fun? We took the Vakunya model, several of the other open-source
models that had emerged at that point in time. We actually made API calls to the commercial vendors
to get the state of the art models as well. And then, yeah, we put a website together where anyone
can go. It's still there if you go to arena.lmsist.org. Right now, you can chat with any one of the bots.
You can chat with them directly in a setting where you know which bot you're chatting with,
or, in the more fun part, you chat with them blinded. So you chat with a pair of bots. You don't
know which ones you're chatting with. You start a conversation and both bots respond to you,
and you continue that conversation speaking to both bots at once. And at any point, you can say,
you know, A is better or B is better, or tie, or they're both terrible. And we take that signal,
and we use that to create a ranking. And so we've ranked all the bots. We ended up using a ranking
system called the ELO ranking system, which was developed for the chess community, has been
adopted by the gamer community, has been, you know, incorporated in sports betting. It's a really
cool mechanism. And that gives us an overall ordering of the AIs of the bots. And maybe not
surprisingly, this top that ordering are things like GPT-4, Claude is right there behind them.
And then as we go down, we see GPT-3-5, and then Kunya stays at the top. In the beginning,
we were worried, why is Vakunya so good? No one's going to believe our leader board if
Vakunya's up there. Maybe we should run a little bit longer. Let's get Koala, one of the other
models developed at Berkeley. I should say Koala was developed at Berkeley at the same time in collaboration
to Vakunya team. Koala was a little bit below Vakunya, rolling back to the story. It's kind of
funny. Koala was lower than Vakunya because they didn't remove HTML tags. That's the best of our
knowledge. So just a little bit of data cleaning. Again, punchline, think about your data. So we did
some better data cleaning. I have Vakunya's with better. But if we look at the overall leader board,
Koala is below Vakunya. And then since then, a bunch of other LM's have kind of merged in between.
I think it's like maybe over 20 models now. There's a lot of models on the board.
There's, I have some stats on that. So you collected over 53,000 votes regarding 33,000 conversations
for 22 models. And all of that was released at the time of recording. This is very fresh
on July 20th. You released that conversation data set to the world. So the people can take
advantage of all of those tens of thousands of conversations across all those almost 2,000 models.
Yep. Yeah. So we released the data. That was a nerve-wracking point in the research progress.
You know, releasing data is something you should do with care. We had been hoping to release data
a lot sooner. We removed PII. So we went to release data that didn't have any PII that took
somewhere. And then we ended up deciding to release all the data, including the conversations
that we would have not, or we wouldn't have continued in the actual bot in the arena itself.
So we have offensive content filters. We actually kept the offensive content and registered the
filters as well with the hope that the research community can start to study how these bots
respond to offensive content when it's present. Yeah. So that was a big release. My hope is that
we'll help shape research in RLHF and the design of valuation functions to
evaluate functions to train models in the future. So yeah, it's one of the, as an academic
at Berkeley, one of the exciting things that we get to do is focus on just general impact and
building data, building models that will hopefully shape research in the future. Even when we look
at Vakunia, I frame it as a battle with our colleagues at Stanford. But realistically,
we looked at it as a chance actually to test some of the training tools we'd been developing.
We have some projects to enable sky computing. We have some projects to enable distributed
training, distributed serving. And so Vakunia was a very natural kind of extension of how do we
test those tools? It was actually led by the system students who were developing those tools
that kind of picked up that effort. And so, you know, for research, it's helped shape a lot of
what we're doing and now more efficient serving technologies, better use of GPUs for
statistical multiplexing. All that was kind of driven by the work with Vakunia. And in fact,
even the fast Chaturina, the place where people can chat with our bots, gives us a mechanism to
evaluate the underlying systems and how they can serve these models. So it's been a big research
effort. And the release of data sets is one of the important steps in that effort.
This episode is supported by the AWS Insiders podcast, a fast-paced, entertaining and insightful
look behind the scenes of cloud computing, particularly Amazon web services. I checked out the AWS
Insiders show myself and enjoyed the animated interactions between seasoned AWS expert Raul.
He's managed over 45,000 AWS instances in his career and his counterpart Hilary, a charismatic
journalist turned entrepreneur. Their episodes highlight the stories of challenges, breakthroughs
and cloud computing's vast potential that are shared by their remarkable guests, resulting in both
a captivating and informative experience. To check them out yourself, search for AWS Insiders in
your podcast player. We'll also include a link in the show notes. My thanks to AWS Insiders for their
support. Yeah, and we're going to get back to talking about open source, close source,
with respect to both models, model weights, model architectures, data sources, we'll get to all
that shortly. Before we move away from this chaplot arena, these models are being released all the
time. These new LLMs are creating, whenever somebody releases their big new LLM, they're like,
look at these benchmarks with definitely the best. So I don't know, kind of like in my mind,
I felt like up until this time of recording for a month or so, it was like this Falcon 40 billion
parameter model that was the kind of model in my mind that I was like, this seems to be kind of
generally the leader. At the time of recording, LLM2 came out a week ago, and in the LLM2
paper, as well as on the main webpage on that website, they have 11 benchmarks, and you already
mentioned the top benchmark of like the first one that they listed, NMLU, because this is one of the
benchmarks that you hear about the most with natural language generation tasks. And yeah, I mean,
so to what extent should we, as somebody reviewing this table, when I'm thinking to myself, okay,
meta-published this, should I, should I, I guess I should probably trust like their big organization,
I should probably trust like the numbers that they put out, but also, but I wonder when I,
whenever I see these, I wonder what tests they're holding back. Like how many evaluations did they
carry out, and is 11 just a subset of all the ones that they did, and they're now showing to us the
ones that had the best kind of results. Because when you look at this table, you're like, wow,
the 13 billion parameter LLM2 is performing comparably to that Falcon model, this 40 billion,
and the 70 billion parameter LLM2 seems to be absolutely, for the most part across these 11,
it's crushing the results, like it's like it's setting completely new kinds of standards for
open source LLM. So yeah, I mean, to what extent do you trust these results when you see them, or
do you think this is a scenario where you're like, put it in the chatbot arena, and that's the best
place to evaluate it. So one, I was really excited about the LLM release, the LLM2 release, it's a
big deal, and they put a lot of effort into building a better model, into evaluating the model,
the paper is well written, like I was excited about this release. We immediately put it in the chat
by arena, and have started to get some signal on it, and the scores are not amazing, which is
sort of disappointing. I was expecting LLM2 to do better, it's still preliminary, we need to get
more data, and the elo ranking system is not particularly robust, so it takes some time to get a
good estimate of what it's ranking would be, but it's not, in fact, the large instruction to
inversions aren't as amazing as I had hoped. We actually did release a benchmark, so one of the
fun kind of anecdotes of this whole journey is we, when we released with Kunia, we used this
GPT-4 idea, a lot of other people have picked that up, we went and collected some data with the
chat by arena, which was pretty consistent with our original GPT-4 results, but we then started
digging into the GPT-4 study that we did, and it was fun to discover that GPT-4 has some very
peculiar biases, and so how you do that GPT-4 experiment takes some care, and we've since fixed that,
so we have this empty bench, there's a multi-turned benchmark, and I bring this up because when LLM2,
when we met it at the evaluation of LLM2, they didn't look at these more kind of complex discussion
oriented scenarios, and we wanted with the empty benchmark to really test that setup,
and so we created a battery of 80 questions based on the kinds of conversations people are having
on the chat by arena, but we then made these multiple rounds of questions, follow-up questions,
and then we had grad students carefully assess the pairs of models, and then we also had GPT-4
assess the pairs of models, and so on that benchmark, which we published results again, the LLM2 models
are not as good as we did hoped, actually, I was very excited about LLM2. Now I imagine now that
they're kind of releasing and making more of a public commitment that it'll improve with time,
there is a caveat that I need to raise, and this is an important one, and it comes back to using
GPT-4, so if you're using GPT-4 to evaluate things, GPT-4 has two or three important biases.
First, GPT-4 prefers whatever it read first, it does not, so there's a bias to the order in which
you present things. When we ran our experiments, we put GPT-4, or we put our competitors first,
in the original of the Cunior papers, and our value, our response is second. How do we reverse it?
We would have found that we were better than GPT-35. Now that's a bias, and if you randomly
sample it, it becomes much more neutral. So it's important to deal with the ordering bias.
The next bias is kind of funny. Also, humans share the same bias. GPT-4 prefers longer responses,
and we've noticed this across all the LLMs that they keep making their responses longer,
because humans tend to prefer more words in their response, which is a little surprising.
I guess I speak a lot, so maybe I'm used to saying too much, but I'd have to imagine if you're
asking questions, you want a short answer, and in many cases you do, but when you judge these
models and you use something like GPT-4 as a judge, it prefers longer responses. And then finally,
GPT-4 prefers itself, in general, as a self-preference bias, which humans also do.
And I guess even when you, presumably, something obvious that you're already doing when you say that,
is you're blinding that it is GPT-4. That's correct. You're not saying this is you and this,
but it prefers itself. It prefers its own writing. Yeah, so it would be like almost like,
if you heard somebody speaking in the same accent as you might think, yeah, right, right.
Yeah, so it's like when I go back and read my own paper, I'm like, that was pretty good ready.
But I read someone else's paper that's pretty much the same thing. No, I don't like this.
And so that stylistic preference is important. It's important because a lot of these open-source
models are using things like GPT-4 as or using data that was generated from GPT-4 conversations,
like share GPT. And so that makes their behavior, their style of speaking closer to something
like GPT-4. And Lama, too, I don't believe did that. So it's possible that one of the reasons
that we see this kind of duration of the two numbers could be something related to that.
Yeah, except that you said it's in the Chap-Lot arenas as well, although early stages,
but the Chap-Lot arenas, that's human evals, right?
Yeah, and it also gives a lower score. Let me dive into that too. So we've started to look at
why some of the really good or at least in principle should be good models aren't doing as well.
And this is maybe the fourth interesting takeaway. Models like Palm and Lama, or Lama 2,
refuse to answer things. You ask it tough questions, questions that shouldn't answer. And they'll go,
I don't know, or I don't want to have an opinion on that, or I won't explain how to build that.
And so that abstention behavior, which is actually something that's why this, maybe the focus
of the Lama 2 paper is something that they did right, that actually causes human interaction
scores to go down. And we also saw this with Palm. And a lot of cases, Palm will actually lose to
really bad model like Dali. And it'll lose because Dali will, I'll answer any question you ask.
It doesn't have to be the correct answer. I'll say something. And Palm will go, no, I don't know
the answer. I won't, you know, that's not a question. I have an opinion on. And so this kind of
abstention behavior, which is again, something that we actually should be aiming more towards,
these benchmarks don't pick up. That is very interesting indeed. So it goes to show that,
even when you think you've developed this kind of foolproof-seeming approach, where you're like,
we got people in the arena, we got human evaluations. This is like the most expensive and valuable
way to be doing this assessment. And even then, you're running into the issue that, yeah,
some of these models that are really good that are designed to prevent misuse can be getting
downrated because people are like, ah, this is a terrible answer. You want me to tell me how to
build a bond? Yeah, yeah. So it's a neat observation that we've had. It's something that when we look
at the arena as we aim to the future, thinking more about how to incorporate that as kind of a goal,
maybe having situations where you can say, guiding the user, that would it be more appropriate for
the model to abstain? Consider that in your evaluation. And also pushing the arena and more
kind of vertical orientation. So like, you know, looking at code, you know, ask me code questions
for this specific part of the arena. So we have, it can kind of isolate some of these extension
behaviors. You mentioned to me before we started recording that the ELO rating system that you
decided on for Chapter Arena is also not that maybe that's great. Yeah, so let's talk about ELO.
ELO is a pretty cool rating system. It has sort of a few design principles. So it was created
for chess. It was created for people who are playing in a decentralized fashion. People play chess
all over the world all the time. And, you know, I can play you and I can win and you can win.
And we need a way to update our scores. And so there are two goals. One is that I need a decentralized
way to compute someone's scores and ELO provides that. And then also, I get better or worse. One of
us can improve in time. And so ELO allows for that change in scores. In a chat about arena,
we have all the data in one place. And for the most part, we are freezing the model versions.
So the models don't change. And so there are other methods that are related to ELO. This is a
geeky tangent, but ELO has a kind of a close connection to logistic regression. And you can
actually analytically solve for what would be the fixed point of an ELO score. And I believe it's
called the Bradley Terry model. So there's other ways to rate things, but they're not as cool. And so
the ELO scores have stuck. They've stuck so much that I'm starting to see clones of our benchmark
with ELO scores in other places too. Something you should know, if you're reading an ELO score,
if everyone's close to a thousand, don't trust the ELO scores. So the rankings don't mean much.
It takes time for these models to emerge. There's parameters that are needed to tune to make
sure that ELO scores give some separation. Gotcha. So the ELO ratings are ideally suited
to an application like chess where those kinds of hyperbranders have been figured out over time.
Over time, yeah. Yeah. And where people are changing their capabilities. And also, again,
when you really need this to be done in a decentralized fashion, where updates can be done, you know,
you and I can play in that we can read computer scores without having to check with all the other
chess players in the world. That's cool. But yeah, very interesting, yeah, limitation there to cover.
One thing that must be flattering for you, or I don't even know if you think about this, but
Lama 2, when they released it, it's obvious to me that they did release now also a version
that is fine tuned to be great at chess. So they released the pre-trained model just like they did
with the original Lama. But then these kinds of things that you were doing with Fukuña,
this fine tuning to human conversations, allowing it to be graded chat, that is something that
it's like, it's super obvious that they did that as a part of Lama 2. If they hadn't done that,
you'd really feel like they'd miss something. And so, I don't know, do you think about that,
or do you think this is something the space would have gone in this direction inevitably anyway,
or that you played like a key role? Yeah. So Lama, in fact, the whole meta-participation of
space is, to me, it's really interesting. It's exciting. In fact, in this Lama 2 release,
Berkeley is going to start collaborating more closely with meta on the kind of the LM development,
which I'm thrilled about. Yeah, they was bare as one of the key partners. Yeah, so I'm super excited
about it. I think they probably would have headed in a chat direction just because the whole open
source community kind of moved that way. Now, did we start it? I don't know. Alpaca certainly did
the first two fine tuning. If we hadn't done it, I'm sure others would have at some point
taken an open source foundation model and tried to run the instructor fine tuning. The open AI,
described. And so I think it would have happened. I'm thrilled to see what meta is doing,
kind of their commitment to the open source, to developing these models in an ethical fashion,
writing about how they did that. Building this foundation model is expensive. I think they
said it 25 million in just training data alone, and I can't even fathom the amount of engineering
and computer hours that went into that. Yeah. So yeah, I'm excited about where they're headed.
I was again, a little surprised that it wasn't performing as well, but there are these
caveats that I already listed. There are things we need to think through as we start to evaluate
these more in the future. And I think we're still actually at the beginning because these models,
in fact, we're doing this work there, trying to be integrated into language or into visual
reasoning tasks. We have work on program synthesis, all sorts of ways in which these models will be
used. And there they don't need to chat well. They just need to have a good understanding of
language, and we can, you know, adjust them to these specific behaviors that we need.
Nice. Yeah. Really cool. So awesome to be able to hear the Vikunia story from one of the
Vikunia developers. I didn't know any of that about kind of the genesis of the project. I just saw
the timelines and like, wow, this is happening really fast. Yeah. Yeah. And can you really quickly,
I mean, so I know these are names of South American, I guess, are the dramataries? I don't know,
like they're kind of this class of animals. They're all related to the Lama. So when Lama came out
and then Stanford came out with Alpaca, is there any story around you deciding to have it be the
Vikunia specifically? There were some other options. I don't think so. We were coming up with names
of this general species of animals. I think Alpaca is a nicer, in the fur world, it's a better fur.
I think it's a, I don't know this, yeah, Alpaca is an interesting story. Vikunia, maybe a little
bit less. So I was kind of disappointed that we spelled it wrong in our first releases of it.
Getting the Nye on some keyboard stuff. But yeah, it just Vikunia. And another one of these
kind of Lama animals. There's also, I think people mispronounce it a lot. My favorite podcast,
to listen to myself last week in AI. And the host of that, we're calling it Vikunia. Or how do they
even, I can't even remember. My students initially were saying Vikunia when we were kind of
building it. So it was, yeah, Vikunia is the correct, correct pronunciations.
Deploying machine learning models into production doesn't need to require hours of engineering
effort or complex homegrown solutions. In fact, data scientists may now not need engineering
help at all. With model bit, you deploy ML models into production with one line of code.
Simply call modelbit.deploy in your notebook. And model bit will deploy your model with all
its dependencies to production in as little as 10 seconds. Models can then be called as a rest
endpoint in your product. Or from your warehouse is a SQL function. Very cool. Try it for free today
at modelbit.com. That's M-O-D-E-L-B-I-T dot com. And there's a funny, there's, so I asked
a week before we're recording today, I asked listeners by my LinkedIn page as well as my Twitter
page whether they had questions for you. And one of our listeners, Wes McDermott, he provided
a YouTube link to a video from a film called Sunset Boulevard. I guess it's from around 50s.
And there's a funny line. I'll try to make sure to remember to include it in the show notes
about like, he says this is, it's one of his favorite lines from one of his favorite movies.
And it's this thing about how you should have got her to buy the Vikunya.
That's great. Yeah. Maybe we've been a struggle for as action all these projects. We can,
yeah, we have a, Vikunya, I mean, the Holama series comes from LLM.
Yeah. But we have a project gorilla, which you can talk about a little bit too.
Yeah, that's exactly what I wanted to cover next. So let's move from to a completely different
part of the animal kingdom. Yeah. And talk about the real us. So the context here is that
I've used in probably a lot of our listeners have used chat GPT plugins, which are a really
cool way of interfacing with real time information in ways that are designed to be really smooth.
And so you can, if you're a chat GPT plus subscriber, you can go to the settings and you can
choose to be involved in the beta for these plugins. And then you get this like plug-in store.
And so you can choose to Mathematica from the plug-in store. And then when you provide some kind of
math problem or equation related problem to chat GPT, it should instead of trying to use next
token prediction, which is not a mathematically sound way to be making predictions, though,
is unbelievably well in the surprising, it does it unbelievably well in suppressing them for
circumstances. Mathematica, which is a language design for doing math, should be better at
solving that problem. So the chat GP plug-in should recognize, oh, here's some math, Mathematica,
would be better for that. And there's lots of different kinds of applications out there, like real-time
web search, or even very specific searches, like kayak is one of the most popular plugins. So
we're booking a car or hotel room or whatever you could go into chat GPT and say, I would like to
go to Los Angeles and can you bring me a car and it'll come back with some suggestions all right
in there in the chat interface. So that's all really cool, but it's not open source.
And so, yeah, you and your colleagues at Berkeley, as well as, I guess, Microsoft Research,
have been working on an open source kind of variant of this.
Yeah, so it was the Gorilla project named because Gorilla's used tools, which was simple.
I've back named it to say LLA, large language APIs, and it's a refinement of the original
intention of the name. Basically gets this idea of how can an LLM interact with web services,
with technologies outside of itself to gain knowledge and to affect the world. And I think this is
where a lot of this technology will head that these AI's will make it so that we interface not
with the browser, but through text or through voice with an AI that interfaces with the web that
can find and use services to achieve tasks. And this is kind of the bigger vision of the Gorilla
project. Gorilla started, I guess, in the early, early winter. So Gorilla started as a discussion,
even before kind of Acunia was taking off. And then as Acunia took off, we've certainly
pushing more in the kind of the, you know, having better open source models, we were doing more stuff.
I think with the Gorilla project today, it's become an open source effort to target a wide range
of different APIs. The way it works is you ask Gorilla, you know, I want to do, in fact,
to have Gorilla for the terminal. So you can go in your terminal and you can install the Gorilla
command line tools and say, I want to list all my files in order of, you know, the size and followed
by date. And it'll tell you what commands to run to do that. And the way this works, and I think
this kind of the exciting part of Gorilla that may be a little different than what even OpenAI
is doing, as we combine retrieval augmented generations, it's called RAG with fine tuning. And
the reason to do this is to make it so that the model can discover APIs. So we should be able to
add new APIs by a documentation to the model. And then we fine-tune it to be able to read these APIs,
to be more effective at reading these docs, and then generating results in a matter that, you know,
is consistent with the request. Remarkably, fine-tuning is pretty critical. And one of the
surprising findings for me in this work is that fine-tuning on the APIs goes a very long way,
and that retrieval helps a little bit, a little bit more, but fine-tuning the model to understand
the APIs seems to be pretty critical. And this creates problems for the entire field. If the future
is to be fine-tuning models on your data, that means we're going to have a lot of very expensive
to run models to be able to do a lot of things. We should come back to, you know, what that means
for kind of research and for industry. But for the Gorilla project, it's meant that we've had to
find a lot of resources to host these models. We try to make them open to the world,
can download the models yourselves, but we also host them in the cloud. And our hope, as we
pushed the project forward, is to kind of further extend this idea of incorporating retrieval
with fine-tuning to be able to support not just calling a single API, but actually building a
chat to my computer. I need to book flights for my upcoming conference, and I can go look at my
calendar and figure out, oh, yeah, I think this conference is this. You can figure out when I might
want to be there, you know, there's a weekend, it's discounting, come back and say, well, the cheapest
flights for your trip to VLDB would be the following. And so I go, yeah, I like those, can you book
those flights and find hotels for me as well? And that kind of interaction with the chatbot,
and then the chatbot taking action on the world, is I think where we're all headed with a lot of
this technology. Yeah, it's not that much of a stretch anymore to imagine.
Well, so let me back up a second and a year ago, I gave a TEDx talk, where I went through the life
of this woman named Jean Calmette. She was a French woman. She's the oldest person
to ever have lived, according to like documentation. And so she lived like 121 or 122 years.
And so I follow her life from when she was born in the 1870s to when she died in the 1990s.
And over that time span, everything was invented like pretty much. It's like from light bulbs
to transistors. I could go over the long list, but it's wild. The changes that happened in 120
years that she was left. And so in my TEDx talk, I was like, well, now try to project forward
from today and think about, you know, baby born today, given medical advances in that lifespan
doubled, average lifespan in the West doubled in Jean Calmette's lifetime. Maybe we're not going to
be able to double, but it seems safe to say that some child born around today is going to live
at least as long as her. And so what kinds of changes will this child bear witness to? And
even in Jean Calmette's lifetime, like the change was so rapid that there's no way that Jean,
when she was a kid, would have been thinking about cell phones and the internet.
And my argument that I make in the TED talk is that because of AI in particular,
and because we have more human brains than ever before, that don't need to be doing physical
labor for the most part, there's all this human ingenuity combined with AI, things are going
faster than ever, and that's going to increase. It's going to increase, increase, increase, and increase.
And so it was recently the one year anniversary of me giving that talk. And so I reposted the talk,
and I said, when I was getting this talk, if you had asked me if we would ever in our lifetime,
have something with the capabilities of GPT-4, I would have said maybe.
Yeah. And now we have it. And so it isn't a big stretch of the imagination at all,
and in terms of technically, like it's kind of just a matter of going pieces together,
there's no reason today why I couldn't have all of my email inbox history, all of my historical
calendar events be processed by some kind of LLM like this. And there's some cleverness,
like you're saying, like getting the API things right, but it could absolutely do that everything
you just described, based on my history of the kinds of plates that I tend to pick,
you're probably going to want to get there two days before the conference, like you usually do,
because, you know, and you mentioned three years ago why in an email, and it's got that right
on cue. Yeah. So yeah, I can't remember where we're in the conversation, but.
So I think this kind of rapid progress, Nei, it's taken a lot of us, even those of us doing the
research by surprise, you know, a year ago, in fact a year ago when I was working on my company,
maybe we'll come back to, we were seeing a lot of people using Scikit-learn and doing
basic machine learning, and I was kind of sad actually because we had done all this really
cool deep learning stuff and built new systems to support it, but that was kind of, you know,
a lot of Scikit-learn and basic machine learning, and then, you know, fast forward to today,
and everyone's like, actually I think I want to run a large language models, deep learning is now
mainstream enough to, you know, the basic things that I would do I should be doing with deep learning
today. So we've moved fast in the technology, we've moved fast in the adoption of the technology,
I've heard store, like, people's grandparents are using LLM to cheat on their book clubs.
That's awesome, but that's kind of, that's a big shift that, you know, a technology that's,
you know, is kind of deep in research is now, you know, so mainstream that it's, you know,
it's in, you know, in discussions around contract negotiations of unions, it's shaping how,
you know, how people cheat on their book clubs. This is a big shift in technology, and AI has,
you know, in the past been a source of hype and a source of failure. I think we're at a point
where the hype might have met reality, or reality might even be exceeding the hype that we had.
And that's exciting, it's a little bit scary too, what it means for research, what it means for
industry, it's harder and harder for me to know what tomorrow or what six months from now will
look like. Yeah, for sure. So yeah, so Gorilla, another step in this, open sourcing this capability
of having, you can support an, like, an effectively unlimited number of different kinds of APIs with
this, right? And so what's the going back to kind of a nuts and bolts of this? Retrieval augmented
generation, rag, won't happen if Gorilla only had that. So if Gorilla didn't have the fine
tuning to understand EDI, and it just had rag, what would that look like? What would we be missing?
Yeah, so we did some studies of this. I was keenly interested in kind of what is the, you know,
the best we could achieve with rag, using the Lama of Econia models we didn't get very far. So
we switched to Claude. Claude is actually remarkably good at long contexts, and we can stuff a lot
of documentation that context. Yeah, and we can get, did you see it? So just today at the time of
recording, andthropic announce that they have expanded the context window on Claude from 9,000
ish tokens to 100,000 tokens. Yeah, so yeah, this race for large contexts is exciting. Yeah, we can
talk about that. Yeah, let's talk about that next. Yeah, but just looking at Claude as our baseline,
this is a pretty, you know, it's a good model with a long, long context support. I think we might have
had beta access some of the earlier larger context APIs. We stuck a lot of text in, and we get
pretty close to Gorilla with just fine-tuning. And so fine-tuning pushed Gorilla a long way.
Now there's some caveats, and I, you know, I do this, I want to, this is research. So we were looking
at a specific class of benchmarks that are focused on calling Hugging Face and PyTorch APIs,
because it has lots of documentation, lots of usage. And so it's a smaller set of APIs, so we
could perhaps be fine-tuning effectively to memorize large fractions APIs. Regardless,
that fine-tuning, again, even if it's memorizing APIs, is making a big difference. And that was
something that I was, I was again surprised about that even with Claude with good, you know,
my students have become pretty good at front and during. So, you know, kind of hacking the inputs
still not enough to get to, you know, where we were fine-tuning. So yeah, it's an open discussion,
I think, where, where rag, where this retrieval of my generation and fine-tuning will come together.
And I think, you know, another point for me, for the whole podcast, I guess, is one of the big
questions I think of 2024, and, or maybe the end of 2023, I can't see that far ahead, is kind of,
what is this balance of how we will use rag, how we'll essentially in context learning,
stuffing examples, relevant data, how we'll mix that with fine-tuning. And, you know, there are
a lot of reasons from the system perspective to push for something like rag for, you know,
using in context learning. But there also seems to be strong evidence that fine-tuning can take
models a long way. Very cool. This episode is brought to you by GraphBase. GraphBase is the
easiest way to unify, extend, and cache all your data sources via a single GraphQL API,
deployed to the edge closest to your web and mobile users. GraphBase also makes it effortless
to turn OpenAPI or MongoDB sources into GraphQL APIs. Not only that, but the GraphBase command line
interface lets you build locally, and when deployed, each Git branch automatically creates a
preview deployment API for easy testing and collaboration. That sure sounds great to me.
Check GraphBase out yourself by signing up for a free account at graphbase.com. That's g-r-a-f-b-a-s-e.com.
I wanted to get a really brief pause before we move to the next topic to make sure that I've
defined some of the terms that we've used in this episode. So I very quickly mentioned how
bear is an affiliate of Mongo2. And so I just want to quickly say that that sounds for
standsport, Berkeley AI research, I guess. Yes. So it's like this, yeah, in the logo is a bear.
So it's just like an end-go bears. Yeah. So yeah, it's cute in a number of different ways.
And that lab has been around forever. Yeah, the bear lab, I guess, why did it start?
I want to say like 2015, 2016. It's been around for a while, but the group of people in bear,
the team, before we had a very clever naming activity, had been working together for,
you know, since I said my PhD actually, and perhaps before that. So it's weirdly been a powerhouse
in AI. And today is very much a powerhouse in AI. I think this is maybe an embarrassing fact,
but as you rank the papers at NURPS, there's like Google Brain, maybe Microsoft and then Bear,
in terms of kind of the overall number of publications, not necessarily a good metric of research,
but maybe also impact. Like, you know, open AI is a lot of the core technologies that are being
used that they write about. We're developed by students at Bear. So pretty an outsizing
pack for research group. And I think we'll probably end up talking about this again. As we talk
about your commercial ventures that you've started, but there's this huge Berkeley is amazing. We
talked about this in Relukis episodes 701 a lot as well, where Berkeley is amazing for coming up
with big challenges to tackle over many years of research, putting the right researchers together,
tackling those coming up with often open source standards that become the standard solution
to that problem globally. So yeah, so we'll talk with them more in the context of your
the entrepreneurial stuff that you've done, Joey. Before we get to the other
acronym, I guess, although this is maybe just an abbreviation, not an acronym, like Bear is,
but you've also mentioned LMSIS, so the large model systems organization. So yeah, how does that
fit into like that's also Berkeley organization, right? Yeah, so LMSIS was created as we were
launching Vikunia. We wanted an umbrella thing to support the research. We have some
collaborators at Stanford, UCSD, and other places, I think CMU as well, maybe, that were involved in
the kind of early creation of the research. And rather than tying that to a bear activity,
which you know, many of the students are in bear for as well, or the Sky Lab, which I also run,
you know, made the students revolve in the sky, we went to create an entity that would sort of
embody that research agenda that would, you know, be a little bit isolated from some of the
specific labs that we have at Berkeley. So we create the LMSIS org and put a lot of the work
under that banner. Very cool. Yeah, so with that behind us, we can now move on to something that I'm
sure it's LMSIS that is on top of, which is these long context windows. So I just mentioned how
Claude, Anthropics, LM, it now, you know, they've 10x to the context window. And that doesn't,
it says something that they've gone commercial with that. It says probably something about the
reliability, but you know, that's something that is, that it's difficult to access. You can probably
speak to that other than me, but you know, we've had papers come out in recent weeks around
models that can supposedly handle like millions of tokens that are like, or there's been papers
that are literally like, it doesn't matter. Anything like how do you want? It's fine, you put all
of the internet in. And it's like, well, obviously there's a degradation. Right, right.
Yeah, so the race for long context is, it's an important one. It fits into this question of like,
how do we balance in context learning and fine tuning? But I mean, fundamentally, it's about
how much pretext will my model read before it goes to answer my question? Or if I'm writing a
really long essay, how long can my essay be? You know, if I'm trying to get 100,000 word essay
together for my class project, I want to use a model that can do that for me.
There are tricks. And in fact, you know, the race for long context has kind of a parallel effort,
which is how to use smaller contexts to address some of the challenges that, you know,
long context try to address. You know, as a human, you don't have a, I don't have a long memory.
So I take notes, taking notes is a way for me to capture a context when I've read. So I don't
have to remember exactly what happened in the first half of the book. I go look at my notes.
And I can look at notes that are relevant to what I'm reading now. So this brings in kind of
retrieval. How do I go back to my notes? Nonetheless, having a long memory of what's happened in the
book, when I look at this word, being able to have remembered the past 100,000 words could make a
big difference and how I understand the meaning of this person. What's this person's story?
So there's been a big effort to deal with long context. The challenges are many.
So computationally, context grows, increase computation quadratically. It also increases the
amount of memory required quadratically. I have students working on various aspects of those
problems at Berkeley. And, you know, there are tricks that one can play. There's research on making
the attention to that long context sparse. So when I'm looking at this one person,
do I really need to really look at all last 100,000 words? Or maybe I can look at,
you know, just a few of the important sections. And so there's work on sparseification.
The, in my group thing a lot about systems, you know, if I'm going to use that full context,
can I split it over GPUs in interesting ways? Second big problem is you need training data
with long context because it's important to have that extra signal about how to use the full context.
There are tricks for extending your training data. There are tricks for changing the way we
embed the positions of text so that we can maybe get away with small amounts of training data and
try to extend it to longer contexts. So, and we've been doing some work with that. In fact,
the LMSIS blog has one of these recent tricks. It's someone else actually developed that we sort of
implemented tuned up and fine-tuned our models to run against. So, you know, there's challenges
around the data. And then there's a third issue, which is, so you can read 100,000 words,
but do you remember what you read? And does it all, you know, do you look at it equally?
One of my other students in collaboration with Grubit Stanford started looking at, like,
turns out we don't actually care about what happens in the middle of that context. Just the beginning,
just the end. That's what, you know, say the art models seem to do. Which could be fine,
but if I'm trying to summarize everything that happened in that context and I'd like, yeah,
it's not that important, that could be a problem. This also shows up if I'm doing retrieval,
you know, maybe the answer to the question that I ask is somewhere in the middle. And this happens
when I'm looking for lots of, you know, reading lots of notes, maybe the answers in the middle,
my notes. Again, I'm lost. So, fixing these problems is something that we're also thinking about
at Berkeley. Part of it could be training models to be more sensitive to the middle by putting the
answers to the questions more, you know, formally in the context. Yeah, it's an area of interest for us.
We've actually started a collaboration with Anthropic to start to build benchmarks to evaluate
these things more effectively. So, we can understand when you say 100,000 tokens at Anthropic,
does that really mean you're using all those tokens equally? And how do you make use of that
figure context? Very cool. Yeah, tons of things to tackle here. I think one of the things,
is he were talking about catching things in the middle? I can't remember who's telling me this. It
might have been a data scientist on my team. So, it could have been Grant Valevelts. He was describing
that one of the things that is done to evaluate whether these work is, you can have Easter eggs
like hidden at random points in the full context. And you can test very specifically on those.
Yeah, so that's what the benchmark that my student put together is they put a JSON keyword and
value anywhere in the thing. That's exactly what it was. How to find that. It's a good
micro-bench market test. Can you find this? If I tell you exactly the thing that you're looking for
and you should just do direct attention, can you attend to the middle? And already, not as good as
one would like. Where I'm getting to this bigger retrieve log in generation, this rag and fine
tuning story, one that I'm really deeply interested in is you've bought pine cone perhaps. You have
a big vector store. You've retrieved the top 1,000 relevant pieces of documentation for this
coding task, but it turns out the answer is somewhere in the middle and the other
documentation is not only wrong, it's perhaps distracting. It's similar APIs, but the wrong thing,
don't call that. How do you deal with that? And how good is the model like removing the things that
are wrong from its attention and attending to the right stuff, when that right stuff might be
anywhere? And so, in that actual benchmark from Stanford, they tested this and it's kind of
neat to see that these retrieval methods using vector inner products get pretty good recall at
1,000 documents. They most likely will cover the answer to your question, but the models don't get
better. They stay there. Models don't improve their performance when doing rag. And that's a big deal.
If you're going to pay for pine cone to do this cool retrieval, you want to make sure that you get
the results from the LLM at the end as well. Yeah, all great points, very exciting space.
One of the points that you mentioned, as you were talking about this, is you heard like if I was
a student that wanted to write a 100,000 word essay, this maybe increasingly thorough models that
could do that until recently weren't, but it just kind of got me into thinking about this question.
So, you do a lot of teaching, you teach, you devised the upper level data science course
effortly, and there's over 1,000 students a term in that course. So, how do you, as an instructor,
have you changed the way that you evaluate students in the last year? Tough question.
So, sadly, we have not. It's something that's become top of mind at Berkeley to think about the
impact of LLM's in how we teach and how students learn the potential opportunities and the
downsides. Currently, we haven't vastly changed our curriculum. So, I teach both now the data 8
class, which is the intro to data science class, which has almost 2,000 students a semester,
and then I teach data 100, which is our next level class that brings in pandas,
scikit-learn, a little bit of graded sand, all this more advanced stuff. Again, a couple,
nearly a 2,000 students a semester. So, big classes. One of the things that we've found in
teaching big classes is that the interaction with TAs, providing guidance, is critical to learning.
And students that get that support that, you know, I'm writing something,
explain, oh, yeah, they got a bug there. I think that one through, again, can make a big difference.
And having that feedback in the thought process itself shapes our learning abilities.
And there's actually an effort now at Berkeley to bring the CUNIA and some of the commercial
models in to see if they can be used to help guide students as they're doing exercises.
So, that's a plus side. There's a chance that LLM's can help provide the additional,
you know, immediate feedback as you're doing something like writing a program,
or maybe solving a math problem, or maybe someday even writing your English paper,
that will guide you and allow you to learn more effectively. The flip side of that is, of course,
you know, I have a 100,000 or an essay, and I would like Claude to please write that for me.
And so, there's efforts to do cheat detection that we've had for a long time,
and try and extend those to pick up these models, something that we need to think about.
I've kind of, personally, I'm more interested in encouraging students to figure out how to use
these to augment their own abilities. So, use it to brainstorm. My grad students, their writing
has improved significantly in the past half year. And partly because, and they've told me this,
they've started going and iterating with chat GPT saying, what, you know, provide a critical
review of my introduction for this paper, and then they adjust based on that feedback, what they,
what they did write, what they did wrong, and so it helps them improve their writing. So, it is,
I think if used correctly a learning tool, I'm maybe a little less worried about the destructive
kind of implications of learning and more about what we can do to use it to make learning
easier and better. Yeah, you and I see everything the same on this. I mean, for me personally,
I mean, I'm sure it's the same with you. Our space is extremely fast moving, and so I'm
constantly needing to be learning, and I, these tools, like, I'm a huge, I love using the GPT-4
chat GPT interface. I find it super convenient for copying and pasting code that I have problems with.
It obviously has limitations if the code is something, you know, if I want to be using some
cutting edge hugging-face library that's only come out in the last week, that's not going to be
able to be covered at least right now by the, you know, without any plugins with GPT-4, but for,
you know, when I run into a scikit-learn error or pandas error, most of that API is static,
you're over a year, and so it's amazingly accurate. Like, and the way that it talks me through,
like, it'll say such encouraging things like, you know, I can see why you did it that way,
it really makes a lot of sense that you would do it that way, but it's going to throw an error
because of this. And that's a little tricky, but just keep going, and, you know, you've
been, and so there's actually, there's a bigger point there around, which I think I've talked about
on the air before, which is that I think in general, interacting with chat GPT, and maybe people
interacting with Bikunia as well, which I admittedly, I haven't actually interacted with myself
directly very much. I have used the chat on your website, like, a little bit just to be like,
okay, it works, cool. But this, this, like, friendliness, it actually, I think, makes me nicer
in the things that I write and the things that I say, because I'm getting that, like, that kind of,
that positive reinforcement, you know, but yeah, it's been a really amazing tool for me to be able
to learn coding mistakes, maybe writing mistakes that I make. I think that education needs to,
to change, to embrace this in the same way that it might have with a calculator or computer. And
it's not surprising to me to hear somebody, apparently, where you're surrounded by, you know,
it's the top university in the world, you have so many clever people around you,
that this, the students and, and even postdocs, these are people who are going to be able to take
a tool like this and use it to augment themselves and be better. I think the people that worry the
most about these tools are, are, if you're not in that very top drawer, then there's a lot of
our education system is based around really dull and often not useful memorization and regurgitation.
And in those settings, unless the person is being, you know, has paper and pencil and is,
and, and proctored exams, yeah, it's, it's just, it's, it's, I don't know, it's kind of obvious to
me that education needs to change and, and go in the kind of direction that you're describing.
Anyway, that's probably enough on that topic. So, Joey, a number of times through this episode,
you've, we've alluded to this idea of open source versus closed source, the kinds of the pros and
cons. It seems clearer from initiatives like the Cunia, like a Rilla, that you are a proponent
of open source. So what are the kinds of pros and cons of these two different approaches?
Yeah, it's a great question, and it's one that's, that's, we've been grappling with it, Berkeley.
You know, all the labs that we've built have been around, you know, doing great research and
making that research accessible, not just in papers, but in open source projects from Apache Spark,
Clipper, Ray, the Cunia, the Hologram CIF, CIFS effort. It's, openness is critical to advancing the field,
to advancing research, but there's a problem. And that problem is that these models are
expensive. They're expensive to train, certainly building that foundation model is expensive.
Even this instruct fine tuning, if you do it quickly, you do RLA Jeff, and this, you know,
using reinforcement learning with human feedback, RLA Jeff,
requires data annotation throughout the training process. You know, the fact that GPT-4 is so nice
is probably because they had experts write how to respond to tough questions. And so that,
that emphasis on on good data, the need to, once you've trained this model to serve the model,
we're crying, you know, let's just say, you know, a few A100s is pretty expensive today,
but to serve a large, or an ensemble of large models, which is kind of the alleged GPT-4 setup
at 175 billion parameters, that's incredibly expensive. So there's a lot of costs associated
with these. So I want the open source community to succeed, but if I had to bet, the analogy that I
would draw for where these technologies will go is search. I think, you know, take, take search,
web search, there are a few major search engines, there are regionalized search engines as well,
web search, much like these models requires a very large amount of data, a large amount of compute
both to build the data and then to maintain it, and then to serve it. It takes engineering skills,
it takes a lot of safety systems, making one of these technologies at its peak, you know,
one of the best in the world is expensive. And so I think we will see something that more resembles
search. Just like with search, you use open source search probably all the time in, you know,
the tools that you use on your computer. If you're an enterprise, you might be using one of these
open source search platforms that's hosting the cloud. So there are large major search engines
that will be closed and will probably continue to be closed, just like OpenAI and cloud and near
the big LM companies. And then there are smaller open source search efforts. What I hope to see is
that the research community will continue to advance the open source technology. So they're good
enough that, you know, if I'm trying to teach students, there might be a specialized model that's
good at giving feedback on Python data science exercises. We might still host it. We might actually
pay someone else to host it, but that model being something that we can control and innovate on will
be critical. I think when I look at something like Gorilla, I think it'll actually be an interesting
mix where you might ask one of these major commercial technologies to break down the task of booking
my flight into important steps. And then you might call out to more specialized variations of Gorilla
for any one of those steps to succeed, you know, to do that more narrow task. Yeah, it would be
wonderful if the future were, you know, the GPTs, the world were purely the open source, the research
community develops them and anyone has access to them. But I do think just even the cost of running
them is so high that really these large models will probably more and more be dominated by major
organizations pushing pushing them. That was a probably the best explanation that I have heard
of why closed source will continue to dominate, at least at the very cutting edge, that analogy to
search, I hadn't heard somebody do that before, but the way that you described it, this expensive
human-organitated data, huge GPU clusters, engineering ingenuity, and then lots of these safety
checks in order for Google search to be able to operate effectively. You need all of those things,
and it's hugely expensive. And so, yeah, I think you're absolutely right that we're going to a world
where, you know, a relatively small number of big tech firms that are able to make these hundreds
of millions of dollars of investment continuously. Like, it isn't, you know, it's not like we get to
GPT-4 and we're like, okay, it's done. We've got this, it's like it's this constant, very expensive
race to be staying at the cutting edge. It'll be interesting to see if it can, because I mean,
I guess this was, you know, I was a lot younger, so I don't know how much I was thinking about
it critically or competitively, but when different search options were emerging, like, you know,
when I was in high school or elementary school, there were things like altifi stuff was like,
you know, something that I guess I would have used and what I've been using. But I don't remember
it being this kind of, I don't remember the states being so high, they're being so many competitors
like there are in this right now. Right. Yeah, that's a good point. So there are places where
this analogy breaks. So the amount of energy, the kind of realization, the impact was sooner here.
I think, I mean, search was pretty exciting when it was taking off, but like the kind of capturing
the imagination of the world, this technology has done that faster. I don't know if that favors
commercial entities or not. Yeah, I don't know. It's, you know, here are things that might break my
prediction. And I will say I would love it if I was wrong. It would be great if these things
become vastly cheaper to run, to maintain, to develop. Here's where to break it. One of the things
that works well in open sources, if I build something and you can make it better. So I can release it,
you can make it better. I can take your thing and make it better. With Vakunya, there was a little
bit of that. So, Facebook release, Islamah, we make it better. But it's not clear to me that you can
keep fine-tuning the fine-tuned model and get a better and better model. And in fact, this is one
of the big questions I have my students, even in Grilla, can I, can I fine-tune a one API and then
fine-tune another? What's the cost or advantage of doing that? So far, it doesn't, in fact, it hurts
trying to fine-tune on too many things. There's this problem of something catastrophic for getting.
So as I keep fine-tuning on new data, I probably need to go back and fine-tune on old data,
which is really I need to do more training. I think today, in fact, I don't know what the
OpenAI and others are doing as they get tons of new data if they're kind of restarting from
earlier checkpoints or starting from scratch. So basically, making this more accumulative
where the open source community can work together is something we need and it's hard to do.
Just sharing GPUs is probably not enough. Making the model smaller is also something that,
you know, we're excited about, but it's hard to do. You're trying to compress human knowledge
at some point that gets hard to do. And then if you can't compress it, it takes a lot of resources
to use it, which makes it harder and harder for the open source community to do it without
capital investment. Yeah, I think like another project area that we could think about is something
like a Unix operating system, which is now the foundation for like every server in the world and
all Mac computers. And so it's interesting to think how like that came about, but it doesn't
require all these things that you mentioned around search. Like I think search is more like
these conversational models where yeah, it's just this constant updating of data. Like in order for
that, you know, today with GPT-4 or other, you know, cutting edge commercial models,
we don't have at least embedded within the model ways because of these accumulation problems
that you're describing and also safety things like, you know, being sure that you're safe even
though you've added in some new information, which just came out an hour ago. Yeah, those kinds
of problems, I think make it a lot closer to the search problem. Whereas with, yeah, with like
the Unix operating system, it is so easy to accumulate for a year like, okay, you have this
space code and humans can look at the code and understand a little piece of it and make it better.
Yeah, it's a tough future. Yeah, maybe one more thought here because
so one more thought on the open source space is, you know, we can still make progress.
And in fact, I plan to continue to make progress in the open source space even knowing that,
you know, the best models in the world probably won't be mine. But what I can do, you know,
something I'm trying to do right now is to like explore what are these trade-offs between,
you know, rag and fine tuning. With the hopes that maybe the insights we have at smaller scales
will translate. And I think actually if you look at OpenAI's success, that's one thing they nailed.
They took this hypothesis that if you scale machine learning, you scale the data, you scale
the model complexity, you'll get better results. But they didn't do what other companies
Google did. They didn't just go to, you know, dial it to 11. They started small and they got
signal. We can continue to stay small and get signal and understand how these different things
mix and maybe influence where these big technology giants will go. And you know, just like search,
there will be smaller entities in other countries in other regions of the world that serve smaller
markets that serve specialized languages. And I think we can have impact there as well. And then
finally, you know, the open source models we build might be good enough for a lot of basic tasks,
you know, what you want to read through all your emails and figure out, like, what was the
conversation about? Maybe you can get away with the simpler model for that task. There's still a
cost associated with it, and you know, it still might be a commercial activity that does this,
but the models themselves we can continue to develop and hopefully provide insights again,
we'll shape the big reference as well. Yeah, I am so grateful for the work that you personally do
on Vikunia, as well as everyone else doing this amazing open source work. My business depends
on it. I know there must be many thousands of other businesses out there that do as well.
For us, being able to take something like Vikunia, or because Vikunia actually isn't
commercially lessensable because it's based on model one. So like we had been using, as our
starting point, open source LLM for a lot of our specific tasks in our platform, we were using
Goli 2.0, which had a commercial use license, Databricks provided it. But now we're switching over
to LLM2, as our starting point. And it's super, super inexpensive to fine-tune into our
tasks, because with a thousand examples of some specific task, at least in our kind of application,
we are able to fine-tune using a parameter efficient approach, like Laura,
which I remember if I already talked about it in this episode. Yeah, we haven't talked about Laura.
It's an interesting method, so yeah, be fun to talk more about it here.
Yeah, so really past, yeah. Oh, yeah, so I had a whole episode on it, episode number 6.74.
And the reason why I was wondering if I talked about it is because the episode that I recorded
immediately before I started the conversation with you. I did mention it, but you and I have not
on air yet today. So yeah, I mean, that is, so I can like really briefly introduce at a high
level the value of this, which is that it allows me to take on like LLM2. Now we're literally
doing this right now on a team. It's taking LLM2 and using PEPWARA to train for typically hundreds
of dollars worth of compute, and mostly actually using servers that I built by hand a couple of years
ago. And those are still good enough for like if you want to take a 7 billion parameter model or
a 13 billion parameter model, I can still run it on GPUs that I bought years ago, and on a server
that I built myself like, it's like, it's efficient enough that you can do that. And so, you know,
it's at no, there's no extra cost to me to do that electricity. It's right. And yeah, and you
know, we have training examples from our platform. That's the key. I guess that's maybe the key point
of this whole conversation is it's about having the great quality data. So we're able to create
these very narrow models that do very specific tasks. And it might be the case that we're able
to train it to do a few of the generative tasks that we need on our platform. But another really
cool thing you can do is you can switch out just those LLORA weights. So, and you can do that on the
fly in real time with your users so that you're not hogging lots of infrastructure where you have,
you could have one GPU running a 13 billion parameter, one or two model with these,
just these small number of LORA weights, which is typically going to be like, I mean, it depends on
exactly what hyper parameters we make, but it could be like half a percent of all of your model
parameters are going to be these new LORA weights that you added in. And then you can swap out those
LORA weights in real time instantly for different generative tasks that your platform has. And yeah,
so anyway, I've talked a lot. So LORA is pretty exciting. My students so far have not found LORA
to be as good. And that was one of the downsides. So maybe to highlight something you said in that,
the training often in fine tuning, even if you don't use LORA, isn't so bad, where you get
burned using the fine tuning method is if you have a separate model for every single user,
and then you go to serve it. And if these models, you can fit one or two, seven billion parameters,
depending on what GPU you're using, you may fit one or two in your GPU, you start to run out of
GPU memory for lots of models. LORA changes that narrative because it has, I have a base model and
then some LORA approximation that can apply. I actually haven't seen good results on not materializing
that LORA approximation, but it sounds like you guys have found ways to do that to be able to serve
and switch out the LORA approximation quickly. So yeah, that makes a big difference. And it allows
you to have lots of fine tune models, but not actually have lots of fine tune models. You have some
base model and some additive component that you can can swap quickly. And that adds components
to the ranks. So it's small means you can fit lots of users fine tune versions of that component
in a single GPU. And that that again, I can't stress enough, tends to be the bigger cost in life,
is not the training of these things, but the using of them. Modular the foundation training,
which is still very expensive, but you know, often it's fine tuning is cheap. It's the use that's
expensive and LORA can make a difference there. That's neat. Yeah, it's really neat. All right, so
this has been an absolutely incredible discussion so far, Joey, around all of the academic things
you've done, but that just scratches the surface of what you've done in your career. So starting with
what you're doing right now, you're in addition to all the academic stuff that you do at Berkeley,
all the research, all the teaching, your co-founder and vice president of product for a startup called
AquaDuct. And so AquaDuct has developed an MLOps framework. You kind of you alluded to this earlier
when you talk about psychic learn that you're going to do learning today. So this MLOps framework
allows you to define a deploying machine learning and LLM workloads on any cloud infrastructure. So
yeah, fill us in on like why you founded AquaDuct and what the app was that you saw and how it
simplifies prediction infrastructure for data science teams. Absolutely, yeah. So AquaDuct is a fun
story. It's, you know, I was going up for 10 years, a few years back and some of my really strong
students working on serverless computing were graduating. We had just finished some work on a system
clipper and forline building tools for real-time prediction serving. Those technologies actually
influenced everything from TensorFlow serving to like the current hugging face, you know, tiered
architectures. So we were excited to take this technology, bring it to the world. I was excited
to take a break from teaching and go out and do a start-up for a little while, take a leave of
absence. And so we launched a company. I was VP of product. My CEO, Vikram, was my former student.
So I now reported to my student. So we launched the company to bring this serverless technology,
this prediction serving technology kind of fused together with the hypothesis that, you know,
data scientists, the thousands of students I was teaching would become the future of kind of
engineering of solving problems. And they would be great at machine learning, great at data,
great at kind of thinking through the connection between data, machine learning, and business.
But not so great at infrastructure, not so great at, you know, running cloud tools,
running machines. And we knew that because we don't teach them how to run cloud infrastructure
in our current data science program. And many students come out, you know, expecting a Jupyter
notebook to exist in the world and expecting data tools to sort of just connect to it magically.
Something we should probably fix about the program. But yeah, that was our class, you know,
where we see the people headed. And in some sense, serverless computing, all this technology that
been developed to make engineering simpler really would make data science a lot simpler. And so
we launched a company to bring those, those ideas, those technologies we've been developing,
partner open source, into a commercial service where any data scientist or any person working
data in machine learning could go to that service, connect, you know, launch models, we would
manage machines for them, connect to various, connect those models to various data sources,
we'd manage all the kind of data plumbing to make it really easy to take ideas, develop models,
and then put those in production to solve problems. We launched it, talking to data scientists,
I guess, I wonder we exactly launch it. I want to say end of 2020, yeah, and just like November
2020, we very beginning of the company really took kind of took off in 2021. We talked to a lot
of data scientists around this point in time. And the first kind of disappointing discovery I had in
my research career is that people weren't building real-time prediction serving systems. They were
still plumbing basic data through psychic learn pipelines, which in retrospect made a lot of sense.
They realized that without, you know, having a company built around machine learning,
it's better to take your ideas, your predictions, dump them back in a data warehouse where anyone
can consume them, where the tools of visualization already speak. And so they had built pipelines,
it bring machine learning in, but in, you know, come for some ways using airflow and dump it back
into the data warehouse. So the early incarnations of our product was a really around trying to
simplify that process of integrating with the data technology people are using with the computer
infrastructure to support interesting pipelines, modern to them, provide visibility to, you know,
make it easier to be a effective data scientist. And then we started getting some, you know,
excitement around the project we got some early users. And then LMS took off. And it forced us to
sort of reassess where we were. A lot of people are like, well, I do have my pipeline working. It's
a total mess, but my team has to me to look at how LMS will change everything. Or, you know,
my interest would be now if I had spare cycles in my life not to make what I do better or simpler
faster, but to bring LMS technology to change the company. And I think actually that's wise.
A lot of people thinking about what is really going to be a monumental change in technology,
what it means for their business, the right people to do that, the people at the front line
would be the data science team, the people that we were talking to all along. And so we also went
back and started saying, what is challenging about LMS? How does that change the narrative?
Good news. In some sense, it brings us back to what we wanted to do in the first place. So
people want to serve lots of models. Now these models are expensive again. They need GPUs,
they need interesting resources. They need cloud infrastructure. It's hard to find because
everyone's already bought all the A100s. Can I make this run on a different kind of hardware?
So it was good news. It was a reverse pivot, which is a weird experience in startup land where we
realized that what we were doing in the very beginning was kind of the right way to go.
So we brought in some of the technology we're going to do back into the core product and started
focusing more not on the open source, but on our hosted option and we have the release coming out
soon of the updated version to make it again easy to do this stuff with LMS technology.
What LMS do that's kind of fun is I have a lot more to manage. I have prompt, I have resources,
I have spend that I need to keep an eye on. There's different kinds of models for different
sorts of tasks. There's fine tuning. So a lot of new challenges, things to be done in the data
science kind of machine learning life cycle. And the other thing that's kind of fun is the people.
It's no longer just data scientists, engineers, even people who are like, I don't really do that
stuff, but I'm a native chat GPT from my book club. I think I'd like to do something interesting
for my business. So it's a bigger market of people, different skills and expectations, which is
as a product person makes the problem a little bit harder. But I think ultimately an opportunity
for us and kind of a pivot to our roots, which is something I was excited about.
Yeah, so I can see what you're doing with aqueduct. I think let me try to like understand it through,
maybe like a use case. You can kind of like walk me through like a user story. So I deploy
LLMs as part of my business. And so it sounds like with aqueduct, I can use it, I can like upload
my model weights, I guess. And then I can figure it so that I am keeping track of my spend.
And I guess I don't need to necessarily be maintaining my model on my own servers. You will
handle conversations like, you know, my users come into my platform. They type,
you know, we're like an HR tech platform. So it's like, find me data scientists in New York.
My user types that in. That input gets sent to aqueduct, but then aqueduct runs my model,
and then brings the result back to my user, the generative result. Is that kind of thing?
Kind of. So what you described we can do. A lot of our users, and in fact, is something I would
even tell people if you're first experimenting, use GPT4 or Claude. So a lot of the cases,
people are just calling out to external models to begin with. What we're trying to do is make it
easy to put those pieces together. I don't love this analogy, but you know, people are familiar with
lane chain. We're trying to make a more simplified hosted lane chain. And in fact, the use case we're
focused on right now is more in the rag context. I think that's where a lot of people will probably
start. I have a question. Maybe it's a support ticket question. I as a developer this, what that
support ticket question to maybe look up related questions, ask an LM, you know, who should handle
this question based on those related questions, then maybe direct the person and maybe do that in
real time. To do that, I need to I need to maintain a database, perhaps I'm using pine cone.
So I need to put my my customer conversations in pine cone. Shackingly, pine cone doesn't do its
own vectorization or its own maintenance. So we need a system that sits there and watches conversations
and keeps that up to date. It's something we do. And then, you know, as this conversation runs,
you need compute to go off and hit the say GPT4, pull maybe which things I need to get from pine cone,
put that back in, make another question Pt4. So that's small maths compute that needs to exist
kind of in a federal state. What we wouldn't make it easy to do is in Python on your machine,
you write that workflow. And then when you deploy that workflow of us, we then can run all the
steps for you in the cloud. You can turn your laptop off and connect to the various technologies.
And then also it's all that kind of tooling around that. So like what's the spend? So we can break
down spend from each of the steps in your workflow, just to make it easier to kind of experiment and
show value with these technologies. And then as you start to deploy that using the things that
have already been building to make that kind of management process easier. Nice, I got you.
And so yeah, then in terms of name, I guess aqueduct, the idea here is that you have like kind of
liquid flowing very easily between different systems, like I don't know. Yes, so let's talk about
the name. As I said, multiple times names have not been my strong suit in life. I should even reflect
back on my early startups. So we launched a company called GraphLab based on my thesis work.
Oh, yeah, that changed names so many times while the watch. Yeah, so
definitely. And then Tury. And the story of each is interesting. We can come back to that. So
with aqueduct, we actually started a spiral labs because we wanted something that was kind of like
interesting researchy thinking of kind of how things converge to an idea.
With aqueduct, it fit what we were doing in the beginning more closely,
bringing the high value stuff from the lakes, the data lakes, the world to where it has impact.
There is some kind of essential piece of infrastructure, aqueduct was about simplicity,
elegance. And so it was a very heady name that unfortunately is not easy to spell by. We
spell it correctly, but most people do not. It's a QE, which hurts our SEO. I think in a long
run, we'll probably change the name to something more in the kind of LM space thinking kind of the
technology people are using. But yeah, so aqueduct, you know, it had a meaningful name,
which is still appropriate today, but it's also probably a little bit harder to find and
the connections a little bit less obvious. Cool. Let's actually, let's talk about data
to the graph lab. So aqueduct, I think based on, you know, what we could find online is your
second startup. That's correct. Yep. So previously you co-founded Tury, and this is something that
I am familiar with because I remember I was using graph lab. Although I know it's, I mean, this
was 10 years ago. So I don't actually now exactly remember what I was doing with graph lab,
but I remember that I was using it. I remember the name changed to data, and the main thing that came
to my mind was I was like, wow, these guys must be doing so well to be able to buy a four-letter
domain name like data. Yeah. That's wild. But I don't know. I personally thought it was kind of a
silly sounding name. Yeah, I was. So let's talk about the story of graph lab. Graph lab was,
you know, my first real company, the first company I launched, I launched it after finishing my PhD
on a project called graph lab. Graph lab itself was actually a sort of a joke that was launched out
of an insightful, you know, workshop. Graph lab was a system for graphs. It was designed to
graph computation, designed to support the trending model of the time, which was graphical models.
Many of your viewers may not know of graphical models because they've since been replaced by
neural networks. When I was doing my PhD, we made fun of neural networks. Neural networks,
that's silly technology. It's not principle. It's not statistically sound. Graph
commands are much more principled way approaching problems. And then they came with this branding
deep learning. They just rebranded those neural networks so silly guys. They were right. And
come back to how they got that right. But graph lab was designed for graphs, not neural networks.
It was designed for graphical models. It was actually a very successful open source project
from a group, my group at CMU. Croning email and university is not on the west coast. It doesn't
generally have or enjoy as much of the kind of publicity that something like Berkeley or Stanford
yet. Yet graph lab became pretty popular. It was popular because it helped solve problems in
matrix factorization, content recommendation, which was trending at that point in time. It was,
maybe it was the graph analog to something like Apache Spark, something that I would then later
on go to the analog. I would connect the two. But yeah, we launched a company graph lab.
First thing you should know for those building companies, don't name your company after your
first product. That was silly. One of the first discoveries we had with graph lab is a lot of
people are excited about content recommendation. But what's the graph? I have a lot of user clicks.
I don't understand. Well, it's obvious. You got a connection with a click and products
of bipartite graph. You can do clever tricks because you know it's bipartite graph. And they're
like, oh, cool. Do you guys do something more around data? We got a lot of data we need to process.
So one of the big observations at that point in time is that while machine learning was exciting,
it was the kind of forefront of what people wanted to do. The aspiration, the reality was data was a
mess. And projects that made sense of data made it easy for me as a data scientist to work,
maybe on my laptop, on large data sets, and I have to set these silly Hadoop or Spark clusters was
actually more appealing. And the core technology in graph lab, connect to another project in our
group, GraphG had these kind of way of partitioning problems so they could be distributed. That same
innovation allows you to optimize things for out of core computation so that you could operate
on your laptop. And so we very quickly kind of expand the scope of what we're doing to a technology
of pandas like technology that works on, you know, terabyte files. If you have that much disk space
on your, you know, eight gigs of RAM machine that you had at that point in time. And made it run
fast enough, you could actually do pretty interesting analytics without a big Spark cluster.
You could do graph stuff, but you could also do basic data visualization, basic streaming
online machine learning. Things that were actually pretty useful. And so we actually started
getting a fair amount of users and the graph lab name was confusing. So, dato.to is Portuguese,
my advisor Carlos Portuguese, so come with the name that would reflect a single piece of data.
It seemed reasonable. And, you know, graph lab wasn't, you know, it was confusing people,
so we changed it. Turns out there's a data backup company in the Midwest that also has that name,
but with 2T's pronounced differently. But close enough that it created some conflicts, and then we
ultimately decided to change your name to Tury, which is actually a pretty neat name in the end.
We had some help coming up with it because I'm bad at coming from the news. And so,
most academics, I guess. And so, Tury embodied this idea of, you know, a general platform for
doing interesting computation on data at kind of at any scale, but big and large, that supports
machine learning visualization. It was being used by a company Apple, and they were pretty excited
about it, and working closely with those in the development of new features. And in the end,
they were pretty excited and have a lot of money, and so we were able to acquire our company.
And so, that same technology went on to be parts of Apple Watch, parts of the iPhones,
like it's all over the, the Core ML. So, a lot of kind of really neat impact from a team
and the project. Absolutely, yeah, congrats on it. $200 million acquisition,
fantastic success there. And, yeah, I guess in the Tury thing, is that like Alan Turing,
that it's like that reference to Alan Turing? Yeah, perfect.
Nice. And so, very cool to get a sense of what you've been doing commercially, the amazing
success you've had there as well. What are you really excited about right now, Joey,
that you're tackling, either, you know, with everything you're doing, it probably blends academia,
and eventually commercial world as well? Yeah, it's a good question. So, as I said already,
thinking a lot about this kind of connection between fine tuning, retrieval, a little bit more
on context, I'm excited about kind of visual language understanding, and I think, you know,
computer vision is about to get radically transformed by these visual language models,
the ability to, you know, take my entire photo album from my family vacation and then have it
read the album, look at the album, convert it to words, tell a story, maybe put, you know,
caverns effects in a narrative around it, provide highlights, you know, this, I think we're about
to see kind of big changes in vision. My research group has been working on that. Autonomous
driving is another area that kind of, it went quiet because of the development cell M or kind of
renewed focus in electric vehicles, but I think we're going to see that emerge again, and I'm
kind of excited to see what these advances will do for autonomous driving. Probably the hardest
part of autonomous driving is actually understanding what other people will do, kind of the signals
around you. It's this prediction problem, and having kind of general foundation models might
change that, that we could train on, you know, endless amounts of webcam data, or it's about the
dash cam data, or, you know, the street view data. So lots of opportunities to change autonomous
driving, so that'll be exciting. And then my group has been thinking a lot about how to make
neural networks treat data kind of more dynamically differently. Right now, LM's are kind of neat.
And they, you know, grow based on your question, you'll get different kinds of answers,
different amount of computation, but still fundamentally each token is treated the same. It runs to
the 175 billion parameters predict, you know, that the next word isn't that. Thinking more dynamically,
like the next word is obviously that, you don't really need to ask 175 billion parameter run through,
you know, hundreds of GPUs to answer that question. And so being more intelligent about how we switch
between models, between levels of complexity, to support interesting conversations, you know,
use technology, use advanced models when they're needed, and not when they're not.
Is a big opportunity. There are a lot of challenges in making that opportunity real,
from, you know, how we use key value caches and kind of the underlying mechanics of these models.
So yeah, I'm excited to see kind of where the world ahead looking in different directions
for these kind of models and kind of new applications as well.
Nice. Very cool. I have no doubt that you'll continue to make an enormous impact as you have
already in your frankly. I mean, you're already, you're only really, relatively early in your career.
You just got 10 or a couple of years ago. It's wild to think what you'll do in the rest of your
career, which actually brings me to a question that I didn't prepare you for.
But, you know, kind of going back to the gene comment thing that I was describing about, you know,
how quickly things change over a lifespan. We're watching things change unbelievably quickly.
You're playing a huge role in that personally. What kinds of things do you hope you might see
in your lifetime that are, you know, far beyond what we have today, like, you know,
trying to project ahead, which obviously is going to be really hard. You're going to have huge
error bars, I don't think you'll say next. But what are like, what's like an amazing vision
for how things could be a few decades from now? That's a great question. You know, when I was
going up for 10 years, I started asking myself, am I doing research that's forward thinking enough,
because it's having impact, but it's having impact now. It should have impact in the future.
One of the things that got me started on, which is something I'm excited to see succeed. It has
it yet, is the use of these kinds of technologies to really tackle climate change. And one of the,
you know, I asked one of my grad students, how should we work on this? Should we use less data
centers? Or, you know, is there something more profound? And they came across some really cool work
going on at Berkeley to design new materials to pull carbon out of the air and to build better
batteries? And these materials, these metal organic frameworks require chemists to try millions
of combinations of things to figure out, you know, whether or not it works. We don't, like, the
science of kind of predicting the capabilities of these materials and whether or not they're
synthetically accessible is not there. And so maybe more broadly, will our advances in AI allow us
to really push science, fundamental science forward and tackle what is probably the biggest
challenges of our time. So something like climate change. That I want to see, that's really hard
to do. I started doing it. We failed so far, but hopefully we'll get better at it in time.
Maybe the early, the early hope I've seen in some of the work that we've, you know, started to see
a little success on is things like stable diffusion. Turns out you can use that to create molecules too.
And design the underlying structure with certain goals in mind. And maybe we can start to build
foundation models of, you know, of chemistry of molecular design. It would change how we do that.
Same with medicine. CRISPR opens up new capabilities for what we can do, but now we have to, you know,
design the things that will solve the problems that we need to solve. And again, maybe some of these
AI technology will help advance that so that we can, you know, tackle some of the biggest medical
problems of our time. Autonomous driving, people shouldn't be dying. I think people die every few
minutes and automobile accidents in the United States. If we can make cars safer using these
technologies, I'd be excited. And, you know, there's a lot of race to make, you know, smart taxis.
I'm more excited about cars that don't crash. Just making, making the road safer, making
those safer reduces emissions because we have less accidents, less traffic, so a lot of potential
impact there as well. Fantastic. Those were all great points using AI,
tell us tackle climate change, have a big impact in medicine as well. And yeah, people
shouldn't be dying on the roads. I agree with that 100%. It's still, it's the most dangerous thing
that you can do. And most people are doing it on a daily basis. Yep. So something really
interesting related to that, the generative idea of like a generative chemistry is actually,
we talked about that a little bit more just a couple episodes ago. In episode number 705, we had
three really senior people, like three of the most senior people in Syngenta on the show.
And they talked about exactly this, about using generative models to predict agricultural compounds
that could help feed the world, that pretty cool stuff. All right, nice. So as I mentioned
earlier in the episode, when I brought up Wes McDermott's quote from Sunset Boulevard,
that Vikunia quote, I did ask our audience if they had questions for you. We had some great
questions. I think some of them, we already answered in our conversation. So we had a great one from
Ed and Winpy about what the future of LLM training will look like with respect to smaller,
larger models, open versus code source. I think we've covered that pretty comprehensively on the
show. But here's one from Michael Lockhart. So he's a senior engineer at Roke. And so he's wondering
if we can take LLM2 and do clever iterating, like you did on the original LLM. So there's
this huge opportunity with the original LLM2 to fine tune it and have it be able to do more
conversational style of conversation. So is there a big opportunity to do some kind of fine tuning
with Vikunia too as well? Yes, so can we fine tune LLM2? We are fine tune LLM2. Yeah, so we're
going to do that. Maybe a variation of that question that bothers me is, should I fine tune the pre-instruct
fine tune LLM2? Or should I fine tune LLM2 not pre-instruct fine tune? Don't know. It's something
we're trying to figure out. Yeah, I don't understand fine tuning. I will say that. I will confess it.
I don't understand it, but I don't think anyone does yet fine tuning. It's more training. It's more
training to funny loss. There is funny learning rate that goes up and it down and you know where you
how you set that has, you know, a significant impact. Yeah, we're going to do that. I think lots of
people will take LLM2 and fine tune to do all sorts of things and we'll hopefully get clarity on
should you fine tune the fine tuned version or should you start from scratch? Yeah, so certainly,
yes, we'll see lots of updates LLM2 and we're definitely working on it. Nice. I can't wait to see
what comes out of that. Who knows, by the time this episode is released, it might already have
been published. Yep, that's how quickly these things move. All right, fantastic. Joey, before I
let my guests go, I ask them if they have a book recommendation for the audience. Yeah, so book
recommendations. So sadly, I don't have a lot of book recommendations. I have two kids, so I read
my nighttime reading that's fiction typically revolves around princesses and dragons, but I will say,
you know, one thing that I found helpful is I try to follow the space, certainly these podcasts have
been great, having access to my students, pointing me at, you know, the latest paper every single day
has been helpful. One of the neat things that's happening right now is ICML and I suspect the
proceedings of ICML will probably have some pretty exciting stuff in it. I haven't personally found
great points of aggregation for what's going on. And maybe that's one of the bigger frustrations,
and as an academic, something that we should be doing on the LMSA site is like here are the papers
you should have read, and it was something we could do in the future. So maybe check the LMSA
site and we'll try to put something like that together. Nice. The space is moving very, very fast.
As I like that, I don't think we had ever had a guest recommend the proceedings of a conference
until your colleague, Ruluga, out of the club in 701, and now another Berkeley faculty member has
done the same. And there is a huge amount of value in those eyes. Actually, there's one more
place I'll recommend. You know, it's kind of, it's good. It's self-moding only, only minorly.
The Bear, the Berkeley Eye Research Group has a blog, and the students put a lot of effort into
making their blog posts, and they have like a whole review process, and they put a lot of effort
into that blog post to make it as accessible as possible, while also staying as technical as
possible. Usually with videos, descriptions, animations of the math, the ideas, it's not a bad place
to look at, certainly what's coming out of Berkeley in the AI group. So check out the Berkeley AI blog.
Nice, fantastic. And beyond the Berkeley AI blog, how can people keep up with what you're doing? Do
you use social media at all? I have been trying to learn how to use Twitter, or X, or whatever it's
called now, and I've been, I will typically post, I tend to highlight my student stuff. So when
students have something, I will help try to repost it and provide some commentary. So certainly check
out my Twitter professor, Joey G. And then I linked in as something that I'm still also learning how
to do my CEO has asked me to become better at LinkedIn. So I will post there on YouTube.
Nice, your student CEO. Very well. Yes, that's correct.
Nice. Yeah, I mean, for us with the show, it's interesting. I know in the academic world,
typically people are more on what was Twitter. But yeah, it's interesting with at least with this
podcast, we get easily 10 times more, often a hundred times more engagement on LinkedIn with
literally the exact same content for whatever reason. So that is, yeah, primarily where I am.
Nice. Alright, Joey, this has been amazing. I had really high expectations for this conversation,
and you really exceeded them. This was the highlight of my week for sure. Yeah, so thank you so
much for being on the show. And maybe in a few years, we can come in and you can come back and
let us know how Bikkunya 5 is coming along. Yeah, sounds good. Thank you. Thank you for having me.
What an experience that was. In today's episode, Joey filled us in on how Berkeley students spotted
the opportunity to use shared GBT as an outstanding data set for fine tuning llama and approaching GBT
3.5 level quality with their resulting Bikkunya model. He talked about how leveraging GBT 4 for
evaluating generative LLM outputs has improved with the MT benchmarks, but the OpenAI model
nevertheless has biases to be aware of when you do this kind of evaluation, such as preferring
the response presented first, preferring longer responses, and preferring responses that are
closer to its own language style. Similarly, he talked about how guerrilla leverages both
RAG, retrieval augmented generation, and fine tuning to interact well with APIs and provided
chat GBT plugin like open source alternative. He talked about how his aqueduct startup enables
LLM workloads to be defined and deployed on any cloud infrastructure, and he provided us with
his vision how over the coming decades AI could help tackle climate change by helping design new
compounds that fix carbon dioxide from the air, make an enormous impact in pharmaceutical design,
and prevent tragic road deaths through autonomous driving. As always, you can get all the show notes
including the transcript for this episode, the video recording, any materials mentioned on the show,
the URLs for Joey's social media profiles as well as my own at superdatascience.com slash 707.
If you too would like to ask questions of future guests of the show, like several audience members
did during today's episode, then consider following me on LinkedIn or Twitter, is that's where I post
who upcoming guests are, and ask you to provide your inquiries for them. All right, thanks to my
colleagues at Nebula for supporting me while I create content like this superdatascience episode
for you, and thanks of course to Ivana Mario Natalie, Serge Sylvia Zara, and Kyro on the superdatascience
team for producing another phenomenal episode for us today. For enabling that super team to create
this free podcast for you, we are deeply grateful to our sponsors. You can support this show by
checking out our sponsor's links which are in the show notes, or you could rate or review the show
on your favorite podcasting platform, you could like or comment on the episode on YouTube,
or you could recommend the show to a friend or colleague whom you think would love it. But most
importantly, I hope you just keep listening. If you like you can subscribe to be sure not to miss
any awesome upcoming episodes. All right, thank you, cheers, I'm so grateful to have you tuning in,
and I hope I can continue to make episodes you love for years and years to come. Until next time,
my friend, keep on rocking it out there, and I'm looking forward to enjoying another round of the
superdatascience podcast with you very soon.