713: Llama 2, Toolformer and BLOOM: Open-Source LLMs with Meta's Dr. Thomas Scialom
This is episode number 713 with Dr. Thomas Cielom, AI Research Scientist at Meta.
Today's episode is brought to you by AWS Cloud Computing Services,
by GraphBase, the Unified Data Layer,
and by ModelBit for Deploying Models in Seconds.
Welcome to the Super Data Science Podcast,
the most listened to podcast in the data science industry.
Each week, we bring you inspiring people and ideas
to help you build a successful career in data science.
I'm your host, John Crone.
Thanks for joining me today.
And now, let's make the complex simple.
Welcome back to the Super Data Science Podcast today.
We've got the trailblazing AI researcher,
Dr. Thomas Cielom, on the show.
Thomas is an AI Research Scientist at Meta.
He's behind some of the world's best-known
generative AI projects, including Mama 2,
Bloom, Toolformer, and Galactica.
He's contributing to the development
of Artificial General Intelligence, AGI.
He's lectured at many of the top AI labs,
such as Google, Stanford, and Mila in Montreal.
He holds a PhD from Sorbonne University in France,
where he specialized in natural language generation
with reinforcement learning.
Today's episode should be equally appealing
to hands-on machine learning practitioner,
as well as folks who may not be hands-on,
but are nevertheless keen to understand
the state of the art in AI from someone
who's right on the cutting edge of it all.
In this episode, Thomas details Lama 2,
today's top open source LLM, including
what it was like behind the scenes developing it,
and what we can expect from the eventual Lama 3
and related open source projects.
He talks about the Toolformer LLM
that learns how to use external tools,
the Galactica Science-specific LLM,
why it was brought down after just a few days,
and how it might eventually re-emerge in a new form.
He talks about RLHF, reinforcement learning from human feedback,
which shifts the distribution of generative AI outputs
from approximating the average of human responses
to approximating excellent, often superhuman quality.
He talks about how Sunni thinks AGI,
Artificial General Intelligence, will be realized and how,
and how to make the most of the generative AI boom
as an entrepreneur.
All right, you're ready for this tremendous episode?
Let's go.
Tommas, welcome to the Super Data Science podcast.
It blows my mind that you're here on the show
that we get to have you here.
I'm so excited for this interview.
Where in the world are you calling in from today?
From Paris.
Nice.
It's been a while since I've been to Paris,
but I've never had a bad time there.
Yeah.
Me too.
Nice.
So we know each other, I'd say almost serendipitously.
This is, I did an episode a couple of weeks ago on Lama 2.
So episode 702 is this, I don't know,
it's like a 15 minute, maybe 20 minute episode
with just me describing, from my understanding,
all the new capabilities with Lama 2,
how the model came about a little bit.
And yeah, as I was opening up the technical paper,
there's like, I don't know how many,
there's probably like 50 authors,
and they're in this big long list listed vertically
on the side of the technical paper page.
But somehow, my brain noticed that I recognized one of them.
I was like, Anthony Hartjorn.
I know Anthony Hartjorn.
There can't be two people named Anthony Hartjorn.
And so I sent him a message, and I said,
do you want to be on a podcast?
We're the most listened to podcast in the data science industry.
And he suggested you, as the guest instead,
tell us, which is amazing because you're
the final author on the paper, which in the academic world,
it might sound to a normal listener like being
the final author should mean that of the 50 people,
we have the person that made the smallest possible contribution.
But in fact, on academic papers, that isn't how it works.
So you have very often, kind of the first author
is maybe the person who actually wrote, put everything together.
But then, traditionally, in an academic work,
the last author will be like the head of the lab
that brought in the funding and those kind of overseeing
the project.
So yeah, truly, it's an honor to have you here to us.
Thanks for having me.
So at the time of recording this episode,
it's only been a few weeks since Metta released,
the open source, large language model, Lama 2.
You were a science and engineering leader
for this groundbreaking development.
Can you explain the significance of Lama 2
in the context of other recent advancements
in AI and generative models?
Maybe kind of fill us in on how the Lama projects in general
that Metta was like, what, we're going to invest in?
Obviously, you're not going to divulge on air.
But there's rumors that kind of eight-figure sums
have been invested in creating Lama 2.
And so it's interesting, even from the very beginning,
what was it like maybe to get this kind of buy-in
from the organization to be doing this open sourcing?
Yeah, I think so.
Not dubbed large language models are a big deal.
They have made some breakthrough in the research.
I think also we had a chat GPT moment at the end of last year.
And most of the people realize a potential of this technology.
And so I think we did many two things with Lama 2.
One, what we call a line, some model,
with techniques called RLHF, for instance.
I can dig more in depth later if you want this.
But basically, the idea is you have what we call a pre-train model,
which has kind of riddled internet on a next token prediction.
So it tries to predict the next token.
And this is what we call self-supervision.
It's supervision because we have a target,
but it's self because text on the web are vastly accessible,
like that.
And so just with that, you have a pre-train language model,
which we had with Lama 1, and we did again with Lama 2
and extended it a bit incrementally.
And that's where all the new edge is on.
All the capabilities kind of emerge.
But then it's hard to access.
And the magic behind chat GPT is it's kind of interface
as a chat, which is very natural.
And to follow your instructions, to say,
oh, but talk like this person, or do these kind of things,
oh, no, make it more like a markdown, or a bullet point,
or chain data, make it shorter.
And it understands your instructions,
and does it precisely.
And this is, this happens at fine tuning.
It's kind of refining, educating a pre-train language
model, which we did also with Lama 2.
And that was one of the many innovation.
Because no one had done that at this scale
and open saw the model, and explaining
all the research behind in a research paper, as we did.
So before Lama 2, basically, the only large language model
line that were available, like OpenAI,
anthropic Google, they were closed behind AMP API.
So I would say that's the main innovation
in terms of science, and in terms of impact
for the community, the research community, the business.
I think you mentioned, and you're not the only one.
Your company now use Lama 2.
This is also possible, because we also
change a license to something commercial user-friendly
for commercial applications.
Yeah, exactly.
I don't have 700 million users at my machine learning company
yet, so we're still labeled.
This commercial license that allows
as long as you don't have more than 700 million active users,
it's OK to use Lama 2.
And yeah, so for us, it's brilliant.
So previously, we had been using as our base model.
So we have a number of different kinds of generative AI
capabilities in our platform for our users.
And so we take something like Lama 1, which was pre-trained,
but not fine-tuned, that would have been actually fine
for us as a starting point, except for the commercial use
limitation.
So we never could use the original Lama in production,
because obviously, yeah, there was this commercial user
restriction, it was for academic purposes only.
And that also meant that some of the initial fine-tuned
architectures that came off the back of Lama, like Alpaca,
out of Stanford, and like Vikunia, the Joey Gonzalez, who
was an episode number 707 of this show, developed effortly.
And so all of those, that whole family of models,
we were like, man, we're going to be left out.
But then luckily, some groups did come along.
So Databricks released Dolly2.0, for example.
And there was some other, and I've done episodes
on these open source alternatives that
are commercially licensable.
So episode 672, I talk about different open source
options that are available, where you not only have that
pre-training with a self-supervision
that you were describing, but also the fine-tuning based
on human feedback that means that the responses are going
to be deliberately helpful and kind of more like a conversational,
like a chat.
So we had been using Dolly2.0 from Databricks's
our starting point for the last couple of months.
When Lama2 came out, there was something,
the scale, you describe this already, the unprecedented scale
in terms of the number of tokens, two trillion tokens
for pre-training.
And over a million data points for the fine-tuning,
this kind of scale, it's orders of magnitude more.
The Dolly2.0 for comparison had 10,000 instructions
that were fine-tuned on.
So you're talking 100 times more.
And with these large language models,
the scaling laws that we've seen come out,
like the Chinchilla scaling laws,
they show that you kind of have three levers
to getting a great model.
So the number of parameters, the training data set size
and training time.
And it seems like with Lama2, you and your team
have tried to max out all of those things,
especially with the 70 billion parameter Lama2 model.
So that's, I guess, something that's also a worth,
if people haven't listened to my Lama2 episode already,
then you may not be aware that it isn't just one model
that was released here.
We're talking about a model family.
So there's a 7 billion, a 13 billion,
and a 70 billion parameter model.
And those two smaller ones, they'll
be able to fit on a single GPU.
And so this means that you can run them relatively
inexpensively.
And so with applications like with my company,
where we have a relatively discrete number
of generative tasks that we need the model to perform,
we can take that 7 billion or that 13 billion.
And we can fine-tune it to our tasks.
And so for listeners who aren't aware,
you can do this yourself at home using a parameter
efficient fine-tuning technique, like Blura, low-rank
adaptation, which I talk about in episode number 674.
So you can find, you can take the model like Lama2.
And so the 7 billion, 13 billion, you can typically
very inexpensively, for tens of dollars or hundreds of dollars,
you can fine-tune that to your own specific tasks.
And for us, that's perfect.
It means we now have this amazing large language model
that just it's as good as GPT4 or better in our own tests.
When we start with Lama2 and we fine-tune with our own data
at this narrow range of tasks that we have.
And then if you're a listener out there
and you're like, well, I want the absolute state of the art,
then you can use Lama2.
And at least in terms of open source,
this is going to be this state of the art.
So yeah, I've just talked a lot.
But the point is that, yeah, Thomas, what you've done,
and what this means for us as a community
to have access to something like Lama2, it's a game changer.
It was obvious that it was a game changer
within minutes of starting to read the Lama2 materials
online.
And my data science team and my company immediately
started retraining our models with Lama2.
It's always good to hear.
Thanks.
Maybe worth mentioning, what we really is also,
is so it was extended context length from 2 to 4,000, et cetera.
It's on text only for now.
But I think that's also the magic of open source thing.
We don't want to push for access for which,
for where the community will deal with that easily.
And we know that extending the context length
that's functioning is possible.
We know that connecting multimodal inputs
is a straightforward.
And what was magic is, after all,
there are really within which people have done that efficiently.
And so that's also one of the strengths
in my opinion of open source thing with kernel models.
And we'll see much more innovation
with shorter cycles of innovation, thanks to it.
So that was one of the philosophy also.
We went, as you said, all in on the scale
of the things that we can do at Meta.
To make it as good as we can so that everyone could use it
and then to adapt it for their use cases.
Amazing.
Are you stuck between optimizing latency
and lowering your inference costs
as you build your generative AI applications?
Find out why more ML developers
are moving toward AWS training and inferential
to build and serve their large language models.
You can save up to 50% on training costs
with AWS training chips and up to 40% on inference costs
with AWS inferential chips.
Training them in inferential will help you achieve
higher performance, lower costs, and be more sustainable.
Check out the links in the show notes to learn more.
All right, now back to our show.
And another thing that you did with Lama II
is there's extensive thought around ethics,
responsible use, acceptable use.
So for example, there were red teaming exercises
where you simulate internally
that you have these malicious actors.
And so yeah, can you dive into why this was so important?
I think this was unprecedented also.
So not only was the amount of data
for both the pre-training and the fine tuning steps
unprecedented, but for an open source model,
I think that the level of concern
that went into the ethics and responsible uses
also unprecedented.
So yes, maybe let's give a bit of context.
The strongest LLMs so far were, as we said,
accessible only on an API.
I think that was problematic in several aspects.
This doesn't research.
It prevents academia to explore,
industrial to have commercial use cases.
And to be honest, we would be nowhere
without open sourcing, like think about
birth, transformers, and even GPT-1.
That being said, the risks,
present and future with respect to Lama
have been arguably discussed by some of the researchers.
I think open area and ontropic did an extremely
great and important, invaluable job
at raising the bar for safety.
And I'm glad they did.
So the thing is, when do you have an API like them?
It's easy to control.
You can put classifiers on top of that.
You restrict the access somehow.
And there's clearly a very hard challenge
when it comes to open source, because you release the weights
and you enable everyone to fine tune to do whatever,
to control some of that.
So while I feel it is very important to do it,
and I think we're not here at the stage
where Lama's sole injury was that we should not do it,
it was important to do it in a responsible way,
to raise the bar events higher than what has been done
for competitor models through an API,
because the risk are bigger when you open source it.
And so we had a lot of inspiration for the works
that were done at those companies,
that open area and ontropic.
And we apply all the method we call,
some new methods we discuss in the paper,
to make the model as safe as we could.
It's not perfect.
There's still some gel break,
but one of the, maybe we can discuss that later,
but I feel that we had like two main complaints,
but follow the release.
And one of them was, it's too safe.
You know, there's an example we're like,
for instance, and I don't remember
can you kill the script or something like that?
And the model say, no, it's not good to kill.
Right, right, right.
So, I mean, well,
there was some a system prompt of top of it.
If you remove it, the model is actually better
at false official, but to me, this was a success
than that this was the first time we release
an open source, a model of this scale.
And so we had the responsibility to make,
to raise a bar for safety and as a responsibility.
So, because it wasn't pertinent,
I prefer to be on the side that it's too much to save
and progressively decrease the level of safety
if needed for future release than the opposite.
Yeah, yeah, yeah.
And so actually your discussion of that reminds me
that when I was doing my research
for my solo episode about Lama 2, episode number 702,
with that episode, when I was digging into your technical paper,
it actually talks about four models.
So three models that were released were the 7 billion,
13 billion, and 70 billion parameter models.
And then off the top of my head,
I think that there was,
it was 34 billion, was another model that you trained.
But I noticed that for whatever reason
in some of the, there was a chart
with some metric of safety.
And that model for some reason,
the 34 billion one seemed to,
it was more like the existing open source LLMs
in terms of safety.
So things, it was kind of more like Falcon,
or more like Dolly 2.0.
And so, yeah, so it seems like you've even,
you've held back a model, I'm guessing,
and you don't need to confirm on air,
but that it seems like because it didn't,
it didn't meet the security standards of the other three,
which is an interesting thing to have happened,
because presumably the same process was followed
for all of them.
Yeah, that's, that's a decoding, what you said.
And that's one of the main reasons we didn't realize it.
One thing also, it's probably that,
I don't, we don't know, we didn't have the time
to investigate what people have to understand
is that just the process together from starting
from the pre-trend model to fine-tune it
to apply a real HF with enforcement planning,
to then evaluate it automatically,
then evaluate it with human operators,
and with red timers, which are expert at finding the failure
trying to make the model say something bad,
and like they put the model in their hardest,
possible ways to make it say something like that.
All this process takes a lot of time.
And so we just decided based on this bad point,
which we don't know yet why we didn't have the time
to investigate.
Maybe it's an error in the evaluation.
Maybe it's a model that was not well-finished.
I don't know exactly yet.
But we just said, okay, why I'm wasting more,
one, two, three more weeks just for that.
We can already release a smaller model,
the biggest model, the more capable.
Let's not wait to let everyone use it.
That makes perfect sense.
And so it's actually, it's kind of nice to have that confirmed,
because that's actually what I speculated on here earlier.
So great.
So you mentioned that there were two main complaints.
One of them was that it was too safe.
So people were complaining that Lama 2 is too safe.
So things like somebody saying,
I want to kill this process.
Leads to it saying, I can't kill, yeah, killing is bad.
What was the other big complaint
people have had since the release?
Tell me if you heard the same,
but from my perspective, it was also safety, too safe,
and code, bad code abilities.
Oh, yeah, yeah.
That actually, so I do say that in my episode 702 as well,
is that it seems like, so when I say that Lama 2 performs
at the state of the art relative to any other open source
model, that's on natural language tests,
where it's like natural language in and out.
So my understanding, and I even tested this extensively
myself, but where there's code being generated,
or where you're asking it to do kind of mathematical problems,
it doesn't seem to, my understanding
is that it doesn't perform as well as some other options
out there.
Yeah, so to that, we actually rushed so fast
from Lama 1 to Lama 2 to get these abilities.
We focus mainly on natural language and not code.
I agree the model is not that good at code,
methodically, for now, but we are working on that.
And well, at the time of the postcast
could be released, I hope that some code Lama will
be also released.
Very cool.
All right, that's awesome.
That's exciting to hear.
Yeah, so I mean, that kind of gives us a sense that's
a really tantalizing glimpse.
Yeah, it's possible that by the time this episode is out,
that will be old news.
But yeah, a code Lama, that sounds very cool.
Is there anything else that you can tell us about where
this research might be going?
I understand.
I don't want to be extracting information under your rest.
But yeah.
I mean, we are the open guys, so to that.
No, I mean, in general, there's no clear secret.
We'll try to improve the models, in general, which
means scaling them, keep training them on mortocans,
increasing the abilities, maybe tackle
more mutating quality code, which is this because that's
reasoning.
We'll try also to improve the LHF stage to capabilities.
We'll go also on the one of the directions is obviously
tools, teaching this model to use in zero-short fashion,
some tools, maybe to access the web more easily.
But those directions seems quite reasonable and expected.
So there's no big spread.
Now the question is more like, how we will do that?
We'll make it some breakthrough discovery in the way
that will enable us to larger improve, hopefully, yes.
Nice, yeah.
And you mentioned there being able to handle tools, which
is something that you have a lot of experience with,
because you've also been involved with the tool
former LLM.
So this is an LLM that came out earlier.
And the tool former is specialized
to decide which API to call, in a circumstance,
when to call the API and what arguments to pass,
and how best to incorporate the results
into the next token prediction of the generative model.
So maybe this is a good time to kind of switch over
and talk about this tool former project
since it sounds like future mama iterations
might incorporate some of that kind of capability.
Yeah.
I mean, so for this tool former was a connecting large
language models with tools.
It was an idea I had last summer a year ago.
It was like kind of felt to a natural extension
of all these models retro Atlas Hague,
where you augment with a retriever a language model.
And the intuition is very easy.
So the idea was to train together a dense retriever
and a language model so that you will augment the context.
And so when you ask a question, you will search
on all the training data, some relevant passages.
And so if the model didn't remember memory as well,
it will boost the capabilities, which was very efficient
as shown in all those papers.
But so this is what we call a non parametric framework
because you rely not only on the parameters
or weights of the model, but also on external source of knowledge
that could possibly grow to a time to, for instance,
incorporate a new fresh information
without necessarily retraining the model.
But that being said, my idea was to extend this
to a non parametric general framework,
you could see, and there was some work at the time
that was doing that, you could see how using a calculator
or Python executor or different search engine.
Maybe I'm using Google for some search
and Google's call out for the specific search on papers.
And so the idea was to just give a list,
a set of tools to the model and much more like a human
like way teach it to use them given the context.
Not at each in front sign, but so the model now has to know
when to use a tool, how to use it to benefit
from this performance.
And so tool for more, Timoshrick led this work
and we published it in February.
And I think it was also like very pleasant timing.
It was two months after ChatGPT
and everyone was kind of, well, the game is over,
ChatGPT is there, what's next?
But ChatGPT at the time was just limited to a window,
like you're chatting with an agent
that has no access to the world.
And that changed a lot, the perception that you can have
once you can give the LAMZ access to the world,
to some knowledge, it makes an experience
for the user completely different.
It extends the capabilities dramatically.
And so that's what we have done with ToolFormer,
with some self-upervised techniques,
so the model learned that basically itself,
when it increased with introduced a perplexity,
using a tool, so yeah, that was the main idea.
Yeah, and so this problem, this may be familiar
in an analogous way, and you can tell me
where maybe there are, where the analogy breaks down.
But having not used toolformer myself yet,
it seems to me to be similar to what later happened
with ChatGPT with the plugins, so that now with ChatGPT,
you can turn on third-party plugins.
So if you turn on the Wolfram Alpha plugin,
then when you ask ChatGPT to do a calculus problem,
it's going to bring in Wolfram Alpha to use that API
as opposed to trying to use next token prediction
to do math, which works surprisingly well
in a lot of circumstances, given it.
It's like mind boggling that this next token prediction
can often do math correctly, but you're basically guaranteed
a correct answer, a correct differentiation, for example,
if you use Wolfram Alpha to do it, so ChatGPT
will automatically detect, okay, this is a circumstance
where I should be using Wolfram Alpha.
Let's do some math with that.
Or yeah, it can access the web.
Like you said, you can do a web search
or it can plug into websites like Kayak
to make to book your trip and to find you the car rental
and book the hotel, so is that kind of,
is that if you now just,
and now gets to use ChatGPT toolformer,
but toolformer is obviously open source.
Yeah, I mean, I think it was there.
I saw a lot on Twitter when one month later
after the toolformer open air is a plug-ins,
so they actually site in the plug-in page toolformer
and some people say the open air
implemented toolformer in one month.
Honesty and humbly, I think the idea was in the air
and we had a good timing to get flagger.
I think also the method used by open air
was quite different from toolformer.
So that's what's interesting.
In toolformer, the idea was,
so we had access to bad, I mean,
at the time, language model compared to GPT-3, at least.
It was before lemma.
And so what we did is,
with the self-supervised method,
which works kind of well.
But my conclusion also at the end of the work was,
we need more capable base model
and fine tune a line model,
such that we learn to use tool
with some instruction following scheme,
which is also why I step back from toolformer at the time
and not extended the project to work on lemma tool
and making it working with instruction tuning
to follow the instruction of the user.
And actually, you have one paragraph
that's then in the discussion analysis,
the paper showing kind of emergence of tool use.
Well, you just, with a prompt, describe,
you tell to the model, basically, natural language.
You can use a calculator, use this format.
For the idea, I use a search engine, use this format.
Now, what's, I don't remember which one it was in the paper,
but like, what's the difference in height
between effect hour and opacity building?
And then, naturally, say, step one, search height
of the opacity, search height of the effect hour,
and then calculate the difference between the two.
So you can see how, like, from toolformer,
where there's the opacity of using the tools.
But the method is pretty efficient, but yet,
I would say is obsolete with a better line model.
We move to lemma tool, now maybe come back to the toolformer.
All right, right, right, right, right.
Makes perfect sense.
This episode is brought to you by Graphbase.
Graphbase is the easiest way to unify, extend,
and cache all your data sources via a single GraphQL API,
deployed to the edge closest to your web and mobile users.
Graphbase also makes it effortless to turn open API
or MongoDB sources into GraphQL APIs.
Not only that, but the Graphbase command line interface
lets you build locally.
And when deployed, each Git branch automatically creates
a preview deployment API for easy testing and collaboration,
that sure sounds great to me.
Check Graphbase out yourself by signing up for a free account
at graphbase.com.
That's g-r-a-f-e-a-s-e.com.
Yeah, so that's an exciting, yeah,
it's exciting how these different research threads
diverse together, and it kind of sounds like you have
that vision all along, and you're like, okay, cool.
Toolformer works really well, but it could be better
if the base model that was calling it was better.
And so let's focus on this Lama 2 project for a while,
and then come back and worry about this API calling
from Lama 2 later on, very cool looking forward to that.
And yeah, that's similarly for the kinds of things,
again, with my own machine learning company,
that kind of ability having these open source,
these really powerful models like Lama 2,
with open source API calling abilities built in.
This is huge for us as well,
because it means that there's all kinds of cool things
that we can do internally.
Like a lot of companies, we use APIs,
these kinds of microservices to make it easy
to have these different compartmentalized services
within the platform.
And so with something like Toolformer,
we can then be able to say,
our users could provide natural language instructions,
just have a natural language chat with our platform,
and all of the capabilities of the platform,
the large language model behind the scenes can say,
okay, I think that they're asking for this particular kind
of data or this particular kind of task to be done,
and we have an API for that.
So let's go use it,
and then the results are returned back
in exactly the kind of format,
like a JSON format that our platform is expecting.
It could make the API call successfully,
it can return information from that call
and present it to the user.
Yeah, it's a very cool thing to be able to do.
Do you worry, I mean,
it sounds like with the level of worry,
the level of concern that went into making sure
that Lama 2 is used ethically,
something like Toolformer, maybe this kind of,
kind of ties into even AGI concerns,
because people say, oh, you know,
AGI won't be that dangerous
because it's not going to be connected to the world.
But that's obviously not true,
because with projects like Toolformer,
we see that, no, they will be connected to the world.
They could be, you know, in my company,
we're using something like Toolformer
to be able to query software APIs and get information back,
but there's no reason why those can be connected to hardware,
why these couldn't impact the real world.
So yeah, I just wonder if you have any thoughts on that
and maybe we can have a bigger AGI discussion
later in the episode, but yeah.
Sure, but no, maybe it went so quickly.
I think those are very good points.
And actually, we take like safety
for the tool direction very seriously.
That makes the thing quite different
from a kind of closed LLM in a window,
we just tried to demo.
There's like real risks at another order of magnitude.
So for sure, there's new concerns,
new research questions and problems on the way
that makes it very serious for that.
Nice, okay.
Well, yeah, that's a clear answer.
So.
Now, we actually, there's a survey
on a large language model,
augmented large language model.
We published it so in February, just after Toolformer,
we have a section at the end of that,
saying like augmented language models,
augmentation of notavit tools,
where a model can now take a connection in the world.
This is a different story than before.
Nice, yeah, yeah, no doubt.
So in addition to Toolformer,
another LLM project that you were working on
before LLM2 was Galactica.
And so Galactica was a large language model
that is, I suppose, specifically designed
for handling academic research scientific papers
and these kinds of scientific questions.
The Galactica model was only alive for a few days, I guess.
So yeah, so I don't know, it was a huge,
it seemed like a really big deal.
And then it was taken offline.
So maybe tell us a bit about the project
and maybe the thinking behind bringing it down
and maybe whether it will be back in the future.
Yeah, so, you know, there's this website,
which is one of the most well known for researchers
called papers with code,
a company that was acquired by Meta.
So the project of the team was,
that was kind of visionary about LLM2 model.
They wanted a LLM2 model for science
that will help us to access information for science,
to help us develop creative writing for science,
maybe connect different IDs for science,
stuff, find some papers that you will never find
on Google Scholar, just based on the ID.
And that's what the large language model
I kept it off and that's what Galactica was about.
And, actually, that was one of the first open
large language model that works pretty well.
And I think people also,
it was in some aspects far ahead of its time
and some aspects, we made probably some mistakes on the way.
It was only a pre-trained model, not an instructed model.
And so, maybe we presented it way too much
as something that can answer questions, do things,
and it will have worked so well
after an instruction journey phase.
The second thing also probably we did not well
was to over-claim a bit on the web page,
saying it can write a paper.
And I can understand how, like, for scientific,
a person working in the science,
this will filter over-claiming.
That was not our purpose, but anyway,
because of all the noise it was,
and that was quite some noise at the time on Twitter instead.
We decided to remove it.
It was also a weird time, because at that time,
there was a lot of people still criticizing
large language models that were quite noisy on Twitter.
And at that, on top of that,
some people from the scientific community
that say large language models are dangerous for science,
et cetera.
And it was just two weeks before chat GPT activities.
So, that was an interesting timing.
I think, for instance, like, people don't realize
how good it was at citations.
I think you need myself to give you an examples
of following instructions.
And when you say, like, or find me a site of paper
about bias, it will find the papers,
or the language one, for instance,
to give you an example of maybe more,
that will speak more.
Chinchilla, the scanning laws.
I think Chinchilla doesn't appear in the title of the paper.
And so, just say, or scanning law doesn't appear
one or the other, don't you remember?
And so, just seeing the model,
what's the citation for Chinchilla,
which is not in the name of the title,
it will find you write it,
and you could just click and add it to a riff
when you're writing something.
So, it was kind of connecting the things like that,
and it would, from the desk we did,
it was outperforming some of the scholars
or elastic search engines.
And I think as such engine,
LLMs have not been yet well explored,
but that's something bigger.
Yeah, for sure.
Deploying machine learning models into production
doesn't need to require hours of engineering effort
or complex homegrown solutions.
In fact, data scientists may now
not need engineering help at all.
With model bit, you deploy ML models into production
with one line of code.
Simply call modelbit.deploy in your notebook
and model bit will deploy your model
with all its dependencies to production
in as little as 10 seconds.
Models can then be called as a rest endpoint in your product
or from your warehouse as a SQL function.
Very cool.
Try it for free today at modelbit.com.
That's m-o-d-e-l-b-i-t.com.
And it's interesting how well Galactica was doing
at being able to do citations and accurately do citations
when something like Chachupdi,
especially with the GBT3.5 API running in the back end,
it was famously creating citations that sound plausible
but aren't real, even creating URLs that are made up,
which is, yeah, what you'd expect,
probably what you and I would expect
when the models are trained the way that they are,
but for ordinary users, for lay people,
they think, what is this, so why would this happen?
And then you even end up with lawyers presenting cases
that never existed to a judge as a result of this kind of thing.
So it's cool that Galactica was able to do those citations
even before the Chachupdi release last year.
Nice. And so speaking of the kinds of the issues
with large language models,
another big issue with LLMs has historically been the expense
associated with all the human labor to create a curated dataset.
So you mentioned right at the beginning of the episode,
how there's this pre-training step
that where it self-supervised,
where you can just use natural language,
it doesn't require any labeling,
and that gets us to model weights
that have this rich understanding of the world,
but the model isn't calibrated to be optimally answering
questions from people
and performing tasks based on instructions.
And so it's this second step after the pre-training,
we do this fine tuning.
For that fine tuning step, yeah, historically,
we've wanted high quality dataset.
So the Vikunia people, for example,
so Joey Gunzal's his team at Berkeley,
they took the original LLMA, which was just pre-trained,
and then they used hundreds of thousands of conversations
that people had shared.
I'm forgetting the name of it off the top of my head,
but there was a browser plug-in
that lots of people were using to save
and to share interesting conversations
that they'd had with Chad GBT.
And so this was in the public domain,
and so the Vikunia people at Berkeley took that dataset
and used it to fine tune LLMA
and create this Vikunia LLMA,
which still today actually has a remarkably good performance
for relatively small open source LLMA
compared to other kinds of open source LLMAs,
even compared to many proprietary options out there.
But so this kind of trick can get you so far,
but ultimately you might want lots more
of these instruction pairs, for example,
of these label data to be able to create
a powerful fine tune LLMA.
And so my understanding is that the unnatural instructions
project that you were a part of,
Abmeda was designed to help alleviate this issue.
Yeah, that's interesting because at the time of a natural
actually there was not even like a shared GBT or whatever.
And so at this time you had on one way,
open AI with a GPT-Davency-free,
Davency-1, it was a good instruction model, very capable.
And on the other hand, you had just kind of remarkably,
not that good, not that bad, a pretrained model,
and instructed datasets very like academic oriented,
I would say, from task like standard task,
like summarization, question, and so on.
But you clearly don't have the diversity of instructions
that people will have asked,
and that Davency-instruct was good at answering.
How to collect these diversity of instructions
is actually extremely changing.
Even for humans, think of 10 different tasks
in instruction right now,
it would be hard for you to come with this level of diversity.
And so at the scale of 1,000, 1 million diversity,
that's pretty hard.
Somehow, open AI managed to do that with Davency.
Maybe they collected some data from the API,
they had some annotators, which is well known
from years ago, if I had some experience.
Now the question is, what we found out,
the question was how to get some diversity of not only,
with GPT people type some instructions,
and you have the output of the model.
Now the question is, when you don't have even that,
how can you generate not only the answer from the model,
but the instruction?
And what we found out is that somehow you can ask
Davency-free, GPT, Frippon 5, I think,
or the version before, to generate those instructions.
So you can say, generate me instruction
and output for code, for this topic, for some reason,
or just without specifying any topic.
And it will come and generate a lot of samples,
and examples, with not only the answer,
but also the instruction, so that you can create
an unnatural data set.
That actually we found out to be more natural
than some of natural data sets at the time.
The reason was that natural data sets published
by some researchers at the LNRI, using actual humans
to create the data, was kind of lacking of diversity
and was academically oriented.
While somehow the model, from the LNRI,
managed to generate a large diversity
much more close to actual use cases.
I think we can see that kind of a distillation process
of a more capable model that was fine-tuned on this data.
And that was kind of a temporary solution for people
that had not access to instructed models.
Which is also one of the reasons we moved to LAMATU,
and did the process from scratch to create our own data.
Indeed, we paid a quite a lot for that.
We'll take more than millions of the annotations
to do the whole LLHF stage.
And so now we have such capability.
At the time, no one had the models at risk yet.
Nice, yeah, that's a great overview of the project.
Let's dive into that.
You mentioned near the beginning of the episode,
this RLHF reinforcement learning from human feedback.
This is a key part of the fine-tuning process.
And with LAMATU,
you introduced a new, a unique two-stage RLHF process.
Which evidently has led to even better results.
So not only having this large annotated data set
of more than a million training data points,
but you also, you use this new methodology,
this two-stage RLHF.
So, yeah, do you want to explain RLHF
and particularly this two-stage process to us?
Yeah, so energy assistance for reinforcement planning
with human preference.
And the idea is to fine-tune the model.
You type a prompt, a question to the model.
You sample different outputs.
And you ask a human to, instead of writing
the perfect solution and fine-tune the model
on what the human will have right,
you try to train the model to go towards the direction
of what human prefers among its samples.
And at the beginning of the project,
so we knew that was kind of the backbone
of some of the instructed models
from Tropique Company.
Excuse that, David Pay.
But if you have asked me,
at the beginning of the project,
and most of the research shows on me,
the question is,
supervised data,
when I ask annotators to write and serve,
it's kind of gold data.
That is what is considered by the community in general.
Get it to take good annotators,
high quality annotators, sure.
But then, comparing two outputs
from this is very expensive.
In comparison to generate,
write itself the answer.
But two answers,
and ask a human to prefer.
This takes way less time,
and so you can scale it way more.
And so, if you ask me,
I would say, okay,
if I have an infinite budget,
maybe I prefer supervised learning,
and ask a human to do that,
but it's not scalable,
so sure we will do a RGF.
And the thing is that I realize
that, after some time,
there's some magic,
which is not well understood,
I feel, by the community yet,
where we already have some
superhuman performance
on some creative writing tasks.
An example I always give
is like,
write a high queue,
or poem,
about large language models,
and the same.
And so, then we'll come with something.
I mean, ask me.
I don't know about you,
but if you ask me,
I will take an hour,
and I will come with nothing.
And models are good at that.
And the reason is,
the model has supercapable,
and have seen all the distribution
on the internet of the humans,
can think about an example
with coding.
So, it knows the distribution,
the middle distribution of average colors,
it knows the distribution of good colors,
excellent colors, and bad colors.
And so, if you ask annotators
to write code,
you would probably imitate
this distribution.
And by imitation,
you will have the distribution of
5% of the time it's grid,
50% of the time it's in the middle,
and sometimes there's some mistakes.
And every human makes some mistakes.
Now, if you apply an LHF,
this is kind of different,
and there's where the magic is.
You will shift the distribution
toward excellence,
toward even better than the best annotator you have.
Because, thing is,
even if you are the best annotator,
you will write at your best capabilities,
and you will do some mistakes.
And the model will imitate you.
But now, if the model imitate you,
and you sample 10 times,
on the 10 times it will sample,
some example that are really good,
your best examples,
and sometimes your worst examples.
And so, you can tell him,
no, this is the best example I've wanted.
And sometimes it will also explore a bit beyond,
and do something that even you won't have done.
And so, because it's easier for humans to compare,
on my side,
I can tell you which point I prefer,
I can write them.
And so, because of that,
you can have some emergence of
superhuman capabilities on some tasks,
thanks to LHF.
Yeah, yeah.
And you're actually,
you're touching on something
that blows my mind all the time,
what we already have today.
And this is why the release of GPT-4
in March was such a big deal for me,
and shifted my own perspective on
the realization of artificial general intelligence,
AGI, an algorithm that has
all the learning capabilities of human in our lifetime,
because already with GPT-4,
because of this magic of LHF that you're describing,
the shifting of the distribution,
from, yeah,
intuitively imagining like a normal distribution in my head,
where the outputs are going to be exactly as you described,
they're kind of going to be middling most of the time,
sometimes excellent, sometimes poor,
but with LHF, we shift everything
so that it's excellent all the time.
And so, this is the,
the high-queue example that you gave is great,
because with, you know,
a lot of people have the experience of using GPT-4,
they're probably the experience of using LHF-2
is similar.
And by the way,
any of you listening right now,
you can go to,
at least at the time of recording,
it's probably still the same at the time of this episode's release,
you can go to a hugging face chat
and the default model now,
for hugging face chat,
is the 70 billion parameter,
LHF-2 fine-tuned,
chat fine-tuned model.
So, you can experience yourself,
and, yeah, the queries that I've done in that hugging face chat
have been comparable to what I'd expect with GPT-4,
but either way,
with one of these state-of-the-art open-source LLMs,
it's capable of doing so many more things
than I could as an individual.
Like, obviously, you expect to be able to come to this interface
and ask a question about anything in the world.
And it knows the answer,
and it can articulate it well,
and it can dive deeper,
and it can explain why they did a certain way,
and when you argue with it,
when you disagree and you say,
no, no, I thought it was this other way,
it often knows,
oh, yes, that's a common misconception.
And so, you're already,
it's interesting that we're like,
oh, how far away is artificial general intelligence?
And this thing that's capable of learning everything
that we can learn,
already today, what we have,
well, maybe it isn't as good as humans on some tasks,
it is so much better than an individual human.
At so many things,
that in some ways,
we've already attained this really crazy superpower
here on this planet.
So, yeah, so I don't know,
I've kind of just gone off on change,
though, there wasn't really a question,
but yeah,
our researcher,
Serge Massisse,
he often digs up
the most incredible things on people,
on our guests,
and one of the things that he dug up on you
was that five years ago,
in 2018,
and I don't even know,
he might have translated this
because you were saying this to French children,
you said that there's evidence
that we are not at all close
to achieving general intelligence,
and that it's a fantasy.
But, yeah, I mean,
like my perception has shift,
an example that I've given,
I think, on air before,
a year ago,
I was giving a TEDx talk in Philadelphia,
and my whole point of the TEDx talk
was that,
because of AI,
technological progress,
is moving so rapidly,
that we can't predict accurately,
even a few years from now,
what kinds of capabilities we'll have.
And if somebody had asked me,
at the time of the talk a year ago,
whether we would have
an algorithm that could do the things
that GPT-4 can do,
or the Lama II can do,
I would have said,
I don't know if we'll have that in our lifetime.
And now, you're later,
we have it,
and people like you are making it
so that anybody can access it,
open source,
it's wild,
like that shift is unreal,
and it has me now,
like I went from being a skeptic
about what can happen
with AI in our lifetimes
to believing that,
yeah, some really crazy things
are probably going to happen
in our lifetimes.
So, yeah, I don't know if you have
any more thoughts on that.
I know that you've been interested
in AGI for a long time,
and yeah, what are kind of your thoughts
on when we might realize AGI
or artificial super intelligence beyond it?
Yeah.
I mean, let me show you my faults
at the moment.
But preliminary,
let me say that,
it probably depends on the mood I am.
I often change my mind.
Five years ago,
I would have said yes,
and I would say no,
and I was always balanced.
But,
also, I'm bad at predictions there.
I think,
there's only thing that I'm sure
that the unexpected is expected.
I think, actually,
five years from now,
five years before,
I kind of started my PhD.
It was,
it was an 18, 17,
transformer was there,
GPT-1 was there.
I was kind of,
doing some,
working on summarization
with rinse and threatening,
and like,
I remember some slides
where,
three words,
meaning less,
was kind of the summary
I could obtain.
So,
again,
if you have asked me the same question,
will we be there?
Now,
I would have said clearly no.
Actually, I was even,
like, late to the party,
and all the scaling things,
I kind of realized,
they're late,
how big it can be,
and all related to a ji,
I think there's,
one question which is,
do we have already,
all we need to get a ji?
And is it just a question
of compute,
flops and scale?
And will we get there,
in the decade,
with more investments,
which we will have?
Or not?
And I don't have a strong conviction there,
but I can tell you that,
well,
first I was bad at predicting,
the impact of scaling,
then I just watched
a told from a calcium man,
on YouTube,
where he clearly explains,
how for him,
scaling,
has a very important foundation,
in them even on the brain,
in human condition,
and that could be it.
And then there's a very proven question,
I always asked,
when doing deep learning,
is it just statistical,
correct correlation,
or is it more?
And there's a,
and I'm always balanced on that.
Sometimes it seems so good,
at reasoning,
and making it,
and sometimes it's like,
the mistakes are so silly.
And so,
actually there's a paper that,
I tend to, to be on the side
that we could get a ji is decade,
with scaling only.
There's a paper from Harvard,
a link called,
emergent word representations,
exploring a second model,
training and sinteting task,
publisher,
I clear.
And,
so this paper was,
notably like using some stuff
on Othello GPT,
where kind of the idea you can feel
about the alphago and all these things,
but the idea here is not to get
the state of the art result,
is to train a model
to predict the next token,
which is the next move
from human players.
And that's it,
just as a language man.
And then the question is,
as end of that,
did he learn the distribution
of the moves
as a stochastic part,
or did he learn more
a profound understanding
of the world?
Here, the world is a game.
And they clearly found
that the model,
the transformer,
training that,
and that's it,
kind of learned the world,
the rules,
the game,
what it is,
how it is,
being just a second of actions.
And that is a clear signal
that there's more profound understanding
and that may be just
from scale to emerge
this intelligence.
Yeah, yeah, yeah.
That is fascinating.
Yeah, so I guess,
yeah, that's kind of,
I guess the answer I'd expect
was put a balanced answer.
Maybe we will.
Maybe we won't.
But yeah, incredible.
But we're working on that.
Yeah, yeah, yeah.
And so then,
I guess kind of a more,
you know,
I could probably guess
that you're probably on the side
that we should be trying to open source
these, you know,
if we can,
if we can have AGI,
I expect based on what you're doing
with open sourcing Lama 2,
toolformer Galactica,
that you would like AGI
to be open source as well.
Yeah, I mean,
I'm pro open source.
I'm pro not having
in somewhere controlled by a few people,
or a very capable model.
But at the same time,
it doesn't mean we should rush
open source things
such a big technology
and some efforts on the other labs
to put the bar very high
and think forward about this
and what it means
and how we could prevent
is very important.
And we should learn from that.
And eventually,
we will have some regulations
and governance.
And yeah,
we will have an open AGI.
It's better than a close AGI.
Historically speaking,
it always has been
and always will be.
But doesn't mean we need
to make it
unresponsibility.
Nice, yeah.
And that kind of
responsible development
and huge development
of large language models
is something that goes back
for you.
You know, we've talked
in this episode about
this stuff you've been working
on in the last few years,
but it's only a meta,
like a long-to-tool form
or a galactica
and natural instructions.
But this is something
that goes back a while
for you.
So you worked on Bloom
a several years ago,
which was
at the time of like GPT2,
GPT3,
era Bloom was the leading,
I think open source,
kind of analog
to those kinds of things.
And yeah,
in your whole PhD
was based on
this kind of,
well, so I mean,
the title of your thesis
was natural language generation
with reinforcement learning.
And you developed a method
called QuestEVAL.
Was that,
is there any relationship
between that quest of Val
and the RLHF
that you were talking
about earlier?
Or is it,
is the reinforcement learning
that you were focused
on in your PhD
different from RLHF?
So somehow it has
the same foundation,
the sense that you want
to maximize the reward.
And so the question,
at that time,
like maximizing some
the RLHF
and natural language generation
was based on some
automatic matrix
called blue or rouge
and people that
know these matrix,
know how bad they are.
So basically,
the thing was,
you improve the score,
but you reduce the quality
of the output.
So how can you
develop new matrix
that we actually capture more
what we want?
So maybe we can apply
reinforcement learning
on that,
which was working pretty well.
QuestEVAL,
there's
I developed reinforcement
learning techniques on one side,
I developed matrix
like QuestEVAL on the other one.
There's a paper
that did reinforcement learning
with QuestEVAL
from IBM one or two years ago.
And they reduced
like hallucination by 40%.
So it was working.
Now, maybe,
the algorithmic
and the phonation
with respect to RLHF
are very close,
in terms of
architecture,
implementation,
math.
But,
there's philosophy
of RLHF,
which I discussed before
about improving
beyond the max
of the max of the human
and later,
is something
that is quite different.
Yeah.
Yeah.
Yeah.
Very cool.
And,
prior to your PhD,
you were involved
in quantitative trading.
So you were at
Soxion,
Societies General,
which is something,
I mean,
I wasn't at Soxion,
but something that
you and I have in common
is that I was also
before becoming a data scientist.
So,
in my case,
between my PhD
and becoming a data scientist,
I worked for a few years
as a quantitative trader,
working on algorithmic trading,
and I don't know.
I,
I don't know how interesting it is
to go into
kind of financial applications
or algorithmic trading applications
with AI and LLM.
You're welcome to talk
about that if you want to,
but I think something
that might be more interesting
for our listeners
is that you advise on
and you invest in early stage companies
that are focused on generative AI
and LLM.
So,
we probably have a lot of listeners
out there who
would like to have a start-up
or scale it up.
So what kinds of advice
do you have
for people that are,
yeah, looking to start-up
or scale up a generative AI company?
What kinds of problems
should they be solving?
What should they do?
That's a tough question.
I mean,
I'm very good at advising them
on the side of the research.
It's like,
what is a trend?
What will be in one,
two years?
Is this technology far ahead
or the remainder?
And so,
that helps kind of transition
from research labs
to applications quickly.
I feel I have some
ability to help them
in this regard.
Now, it's especially difficult
to predict where to invest right now
in generative AI.
There's kind of a paradox
with this technology
because we discuss the scale
and the velocity of the technology.
You say that a few minutes ago
and think about it like,
when I started
data science deep learning,
it was like kind of,
data is on unit.
And so then you have like,
companies like Grammarly
that capture,
annotate a lot of data,
create with a deep learning model,
train on this proprietary data.
Some very strong models
to correct grammar,
grammatical errors.
And this is kind of a technological,
very strong technical barrier.
Because to beat them,
to open from them,
with deep learning,
you need to annotate this same volume
and the same quality.
So, they're leaders.
And now, with the same kind of
technology deep learning,
what, one, two, three years later,
you have a model,
a plug-and-play chat GPT
that you can just create a website
in one minute or plug-in
on a Google Chrome
that is even better
when Grammarly to correct
and much more general.
And so all the technological barriers
vanish in one second.
And so,
the paradox with this technology
is that everything that we're saying now
could vanish in one year,
with what I said before,
like,
it's likely that the,
it's expecting that the,
unexpected will happen.
And so, I guess the main question
for entrepreneurs is,
what can you build
that will be robust
in this condition?
Yeah, yeah, yeah.
What can you build
that will respect the unexpected?
That will be reinforced
if there's some expectation.
Yeah.
Nice.
Yeah, so I guess that's the kind of thing
people need to be thinking about
with their votes.
Like, what is it?
Is there some kind of data
or some kind of,
you know,
market access that is unique
that means that even if
much better generative AI models
are open-sourced
or, you know,
could eat your lunch kind of thing
that you still have this,
this opportunity.
So maybe,
yeah, so if,
yeah, if you can get some kind of edge
somewhere,
then when these kinds of
unexpected new things come out,
these do AI capabilities,
you can be integrating them
into your tech
as opposed to being eaten by them.
Yeah. And again,
like, it is,
I don't want to feel like
to get entrepreneurs worried.
This is like very risky
and challenging environment,
but at the same time,
it's one of the greatest moments
entrepreneurs to create
to make some products.
That's where comes the paradox.
Like, it's one of the best
time to create,
but also very risky.
Nice, very well said.
All right, awesome.
So that is the end of my questions
for you, Thomas,
and the end of Surge's
questions for you.
So let's turn to
audience questions.
I made a post
a week before recording
on social media,
on both LinkedIn and Twitter,
and the LinkedIn post,
in particular,
got a crazy amount of reactions,
250 reactions,
over 70,000 impressions,
just at the time of recording here,
which is definitely at the top end
of the distribution of a post that I make.
And we had a really cool one
from Alice,
this way, yay,
who used to work with me
at Nebula.
She was an amazing
product manager responsible
for our data products
and AI products.
But I think, Alice,
I think your questions
on natural instructions
have already been answered
earlier in the episode
by Thomas.
So hopefully,
that answer was to your satisfaction.
So let's move on
to a question from Adityan.
So,
Adityan is interested
in generally rough rules of thumb
for how you choose
what kind of open-source
LLM to start with
and how to fine-tune it.
So if he's building
a startup for a niche use case
using our large language model,
some of his questions
are around things like,
how do you decide what size
to go with?
So I think I kind of,
I already actually answered
this question earlier in the episode.
So with LLM2,
for example,
the released model sizes
are 7 billion,
13 billion,
and 70 billion.
And I talked already
earlier how the 7
and 13 billion,
this can often fit
on a single GPU.
And so, you know,
a small model with that
could be good enough
for a niche task.
You'd only need the 70 billion
if you wanted the model
to be able to do a very broad
range of possible answers,
a very broad range of possible tasks.
So, yeah.
So I think in your case,
Adityan with a niche use case,
probably 7 billion is probably
going to be fine.
You can start there
if it doesn't do the trick,
try 13 billion.
But the question then for you,
Thomas,
is how many data points
do you think that he needs
to collect or somehow
synthesize in order
to be able to make use
of fine tuning?
So, you know,
the implication here is that
there's some niche use case
that he would like the model
to be able to handle.
How many data points
does he need to have
in order to make use of a,
you know, a parameter
efficient fine tuning approach
on top of one or two
and excel in that task?
Right.
It's an interesting question.
I was about saying,
maybe you can start
even without fine tuning,
just of the shelf,
with zero shots.
But also with few shots,
one, two, three, five examples,
you created yourself.
I think it's not like a few
shot-pre-train model,
like it used to be before.
It's a shared model.
So, maybe you need to
do a bit of prompt engineering
in the sense that create a dialogue,
like example one,
with your input,
you make the model
kind of generate
your gold output.
And then,
when you will ask your question,
the model is kind of
biased toward
the format,
the kind of a template
you want it to be answered,
that would be the first thing I would try.
If it's not enough,
I would say that generally
and it's very hard to answer
systematically because it depends on
each use cases,
task difficulties,
et cetera.
But in general,
what I have seen is that
with very few examples,
sometimes a hundred,
a thousand at max,
you can have like dramatic
improvements on some tasks.
Very nice.
Yeah.
That's a really great answer.
Very practical answer, Thomas.
Thank you.
Very much.
All right.
Our next question is from
Svetlana Hansen.
She's a senior software engineer.
I believe she works on like
a lot of outer space projects,
with folks like NASA folks,
that kind of thing.
So, Svetlana has been following
the Super Data Science podcast,
I think, for as long as
I've been hosting it.
So, several years now,
and she's had some great
guest suggestions in the past.
And she had a series of great
questions for Thomas.
One that I really liked was
about the lessons that you've
learned, Thomas,
from developing and managing
these large-scale AI projects.
So, being involved with Bloom
years ago, the Lactica
Toolform or Lama II,
these have huge team sizes,
and huge models,
very long,
and you're probably, you know,
you kind of gave us a bit of an
insight into this.
There's this pressure,
this race, especially in open
source to get out there before
other people.
And so, for example,
you made the decision
when the 34 billion parameter
model didn't,
wasn't meeting the same safety
standards as the 7,
the 13, and the 70 billion
parameter,
Lama II models.
You said, you know,
let's just go ahead and
publish what we have,
because we've got the state
of the art at the 70 billion.
We've got the smaller models
that you can fit on a single GPU.
So, we've had some kind of
insight into your thinking
on these kinds of large scale
projects, but yeah,
I don't know.
His first fit on his question
here, what other key lessons
have you learned about
developing and managing large scale
AI projects?
Yeah, it's a very interesting
question.
Let's try to do some smart
on that one,
but maybe the main difference
with these big projects,
with respect to, like,
when I was in academia,
with some small papers,
with very few people,
because of the size
it also means, like,
a lot of people's
impacted.
There's a lot of budget
around,
and you have a potential
to reach out to so
much more people.
The project is at another
scale of impact.
And because of all those
ingredients,
well, it was a case
for probably Bloom,
and even more galactica,
where it was more involved
in the training as a project,
where lots of way fewer,
you have a lot of
GPUs that runs,
you have to make some
decisions.
And the thing is,
there's a main difference
with, let's say,
in a perfect world for
researchers, as I am,
you want to understand
everything,
all the phenomena.
And so,
you want to do all the
ablations,
you want to do all the
experiments to see
what's the impact
of this factor,
and this one,
and what if we have
done that?
The thing is,
there's so many possibilities,
and every experiment
costs so much,
and takes so much
resources,
that you cannot do that
anymore.
And so,
one of the main challenges,
you're responsible to make
some decisions,
as I was in
Lama 2,
of like,
okay,
we need to choose
between that and that,
the thing is,
and even more
because no one is
publishing anymore,
the secret sauce,
maybe just we did,
you're like,
okay, I don't know,
what's my intuition,
but how can we,
they quickly verify,
and change,
if needed,
and you're playing
with actually a lot of
resources, like millions
of dollars,
some mentioned for the
notation,
of a lot of thousands
of GPUs,
a lot of many authors
that are involved
in the project,
and so,
and time is also
constrained resources,
and you cannot
like spend one year
to explore,
and so,
how to deal
with this changing
environment,
is what I thought
was the main challenge,
my side,
and like,
when you're
at night,
and you take,
before sleeping,
like, you took a decision,
is it the correct one
or not?
And you don't know,
and this uncertainty
for researchers,
is something hard
to deal with.
Nice, so,
I guess you kind of
your answer,
your key lesson,
is that there's trade-offs,
and you don't know
whether you're making
the right answer,
and maybe kind of these,
you know,
these decisions
on how quickly
do we rush this
or spend some more time
on it?
Well, it seems like
with Lamato,
you certainly got it right.
It's, it made
an enormous splash
in a huge impact,
so,
you seem to be getting it right.
We've got a comment
here from
Lawrence Vanderman,
who was recently
on the show,
episode 709,
a colleague of yours
at Metta,
and he doesn't have a question,
but I just wanted to highlight
that he said,
Thomas is awesome.
I'm looking forward
to hearing your conversation
with him.
Yeah.
So, Lawrence,
I hope that you enjoyed
this conversation
as much as you were hoping to.
And then,
last question here is from
SM.
So,
SM has asked questions
on the show before,
but SM has a,
I assume,
very deliberately
sparse LinkedIn profile,
which is unusual.
Most people on LinkedIn,
it's like real names
and that kind of thing.
But SM is,
seemingly,
exists,
they count,
exists solely
to ask questions
on the Super Data Science
Podcast,
because there are,
like, no other connections.
So,
I appreciate that compliment.
So,
SM's question is,
it's a long question,
but it's,
I think it's basically
getting at this idea of,
you know,
LLMs can be wrong.
They can make mistakes.
They can give on
helpful answers.
But nevertheless,
they are often
very useful,
and they're becoming
more and more useful all the time.
So,
I guess this,
this question is,
like,
I think we touched on this
a little bit early in the episode,
as well,
when you were talking about
the research at Harvard,
and the ability
for transformers
to seemingly
understand,
understand
such a bad word.
But,
I think you have a good sense
of kind of the question
where this is going,
so you can answer it.
Yeah, I,
I think, like,
the question,
if, and also,
it correctly is about,
like, can we one day
in the future,
and how is it not yet the case,
rely on these models?
Isn't it that,
humans on some very simple
task,
obtain 100% score,
like well,
while models
will sometimes do
so impressive things,
and when it's not expected,
will fail
on silly things.
And so,
that's very,
very weird.
My understanding,
and I'm not saying I got it right,
but just my intuition
and the studying at the moment,
is that,
as we discussed before,
we've scaled,
we might have an emergence
of much
general reasoning
and instinct,
and my understanding is that,
those algorithms,
kind of learn
the compression
of the data.
Maybe,
let me give you an example
if I give you,
I can print you,
an infinite number
of tokens
to train the model,
of numbers
and calculus,
one plus two equal,
three,
and so on.
Now,
if I give you that,
there's two ways
to predict the next token
after required.
You can memorize everything,
but that,
if it gets to an infinite
vocabulary,
you will need a lot
of weights to memorize it.
Or,
you can compress the information,
such that,
you learn,
you internalize the algorithm,
beyond that.
And so,
you can predict,
accurately,
the next token,
whatever,
and that requires much less weights
to learn calculus,
than to memorize
an infinite number.
Now,
at a prolonged time,
it seems that,
models,
large numbers,
are very good at doing calculations
on two three digits,
and when it goes beyond,
it fails,
more and more.
My understanding is that,
they kind of internalize,
generalization,
in terms of calculus,
for one digits,
but somehow,
in the large vector space,
they,
kind of see it
as different objects,
calculus for two digits,
than for four,
four, five digits,
maybe because it appeared less.
And so, they don't have yet
this generalization of,
all this is,
one, two, three,
five, six, seven, eight,
is calculus,
nine and ten,
that I didn't see in the training,
is calculus, as well.
And so,
there's one dimension
that I didn't generalize,
but there's some of those
that I already generalized.
And I feel like,
a true edge AI,
if we get there,
with scaling,
or in any other ways,
could emit from
this generalization
of compression
at another scale.
When the generalization
will be complete,
somehow,
if we get there,
with scaling.
That was an amazing answer,
that made it so crystal clear.
And yeah,
really built nicely on what you said
earlier in the episode around,
you know,
representing these complex concepts.
Very, very cool.
All right, Thomas,
it has been an amazing episode.
I've learned so much.
Truly, it's been an honor
to have you on the show.
Before I let you go,
do you have a book recommendation for us?
Um,
let me,
Black Swan,
from Nessigni Kuretreba.
Yeah, yeah, yeah, yeah,
great choice.
And, uh,
yeah, how should people follow you?
What's the best, you know,
after this episode,
if people want to keep up
with the latest on your work
or your thoughts,
how should they do that?
Oh, surely.
Again, follow me on LinkedIn
and to my serum
or on Twitter,
as well,
I'm really easy to find that.
Nice. All right.
We'll be sure to include
those links in the show notes.
Thomas, thanks again.
And yeah, best of luck.
We can't wait to see
what you release next.
Next,
uh, some stuff,
probably it sounds like
even before this episode is live.
Uh,
and yeah, truly, uh,
on, on behalf of my listeners
and tons of other early stage startups,
like mine,
we are so grateful to have people
uh, like you and Mehta
being willing to open source
these incredible technologies.
It's making such a huge impact,
uh, commercially,
and also big social impact.
So, thank you very much.
Thank you, Jen, for having me
with you all the time.
It was my pleasure.
Thomas is already a legend,
but it seems he's only
just hitting his stride
and his biggest,
most mind-blowing,
potentially,
AGI summoning projects
are yet to come.
In today's episode,
Thomas filled us in on how
pre-training and fine-tuning
an LLM on an
as-yet, unprecedented scale
for an open source LLM
led to the big llama-tooth splash.
He talked about how handling
code, tools,
web search,
and even better friends
have been able to
open source LLM
led to the big llama-tooth splash.
He talked about how handling code,
tools,
web search,
and even better performance
are up next,
or the llama project.
How tool-former
calls an appropriate ABI
and incorporates the output
into its next token predictions.
How RLHF shifts the distribution
of a pre-trained LLM model's outputs
from a normal distribution
of human-generated quality
to outstanding,
often superhuman quality.
And how, with AI developments,
the unexpected is expected,
and so AGI may be just
around the corner.
As always,
you can get all the show notes
including the transcript
for this episode,
the video recording,
and materials mentioned on the show,
the URLs for Thomas'
social media profiles,
as well as my own,
at SuperDataScience.com
slash 713.
Thanks to my colleagues
at Nebula
for supporting me
while I create content
like this SuperData
Science episode for you,
and thanks, of course,
to Ivana Mario,
Natalie Surge,
Sylvia Zara,
and Kuro on the SuperData
Science team
for producing another
tremendous episode for us today.
You can support this show
in so many ways
you could check out our
sponsor's links,
you could share it
with a friend or colleague,
you could review an episode,
you could subscribe,
but most of all,
just keep on tuning in.
I'm so grateful to have you listening,
and I hope I can continue
to make episodes
you love for years and years
to come.
Until next time,
my friend,
keep on rockin' it out there,
and I'm looking forward to
enjoying another round
of the SuperDataScience
podcast with you very soon.