713: Llama 2, Toolformer and BLOOM: Open-Source LLMs with Meta's Dr. Thomas Scialom

This is episode number 713 with Dr. Thomas Cielom, AI Research Scientist at Meta. Today's episode is brought to you by AWS Cloud Computing Services, by GraphBase, the Unified Data Layer, and by ModelBit for Deploying Models in Seconds. Welcome to the Super Data Science Podcast, the most listened to podcast in the data science industry. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. I'm your host, John Crone. Thanks for joining me today. And now, let's make the complex simple. Welcome back to the Super Data Science Podcast today. We've got the trailblazing AI researcher, Dr. Thomas Cielom, on the show. Thomas is an AI Research Scientist at Meta. He's behind some of the world's best-known generative AI projects, including Mama 2, Bloom, Toolformer, and Galactica. He's contributing to the development of Artificial General Intelligence, AGI. He's lectured at many of the top AI labs, such as Google, Stanford, and Mila in Montreal. He holds a PhD from Sorbonne University in France, where he specialized in natural language generation with reinforcement learning. Today's episode should be equally appealing to hands-on machine learning practitioner, as well as folks who may not be hands-on, but are nevertheless keen to understand the state of the art in AI from someone who's right on the cutting edge of it all. In this episode, Thomas details Lama 2, today's top open source LLM, including what it was like behind the scenes developing it, and what we can expect from the eventual Lama 3 and related open source projects. He talks about the Toolformer LLM that learns how to use external tools, the Galactica Science-specific LLM, why it was brought down after just a few days, and how it might eventually re-emerge in a new form. He talks about RLHF, reinforcement learning from human feedback, which shifts the distribution of generative AI outputs from approximating the average of human responses to approximating excellent, often superhuman quality. He talks about how Sunni thinks AGI, Artificial General Intelligence, will be realized and how, and how to make the most of the generative AI boom as an entrepreneur. All right, you're ready for this tremendous episode? Let's go. Tommas, welcome to the Super Data Science podcast. It blows my mind that you're here on the show that we get to have you here. I'm so excited for this interview. Where in the world are you calling in from today? From Paris. Nice. It's been a while since I've been to Paris, but I've never had a bad time there. Yeah. Me too. Nice. So we know each other, I'd say almost serendipitously. This is, I did an episode a couple of weeks ago on Lama 2. So episode 702 is this, I don't know, it's like a 15 minute, maybe 20 minute episode with just me describing, from my understanding, all the new capabilities with Lama 2, how the model came about a little bit. And yeah, as I was opening up the technical paper, there's like, I don't know how many, there's probably like 50 authors, and they're in this big long list listed vertically on the side of the technical paper page. But somehow, my brain noticed that I recognized one of them. I was like, Anthony Hartjorn. I know Anthony Hartjorn. There can't be two people named Anthony Hartjorn. And so I sent him a message, and I said, do you want to be on a podcast? We're the most listened to podcast in the data science industry. And he suggested you, as the guest instead, tell us, which is amazing because you're the final author on the paper, which in the academic world, it might sound to a normal listener like being the final author should mean that of the 50 people, we have the person that made the smallest possible contribution. But in fact, on academic papers, that isn't how it works. So you have very often, kind of the first author is maybe the person who actually wrote, put everything together. But then, traditionally, in an academic work, the last author will be like the head of the lab that brought in the funding and those kind of overseeing the project. So yeah, truly, it's an honor to have you here to us. Thanks for having me. So at the time of recording this episode, it's only been a few weeks since Metta released, the open source, large language model, Lama 2. You were a science and engineering leader for this groundbreaking development. Can you explain the significance of Lama 2 in the context of other recent advancements in AI and generative models? Maybe kind of fill us in on how the Lama projects in general that Metta was like, what, we're going to invest in? Obviously, you're not going to divulge on air. But there's rumors that kind of eight-figure sums have been invested in creating Lama 2. And so it's interesting, even from the very beginning, what was it like maybe to get this kind of buy-in from the organization to be doing this open sourcing? Yeah, I think so. Not dubbed large language models are a big deal. They have made some breakthrough in the research. I think also we had a chat GPT moment at the end of last year. And most of the people realize a potential of this technology. And so I think we did many two things with Lama 2. One, what we call a line, some model, with techniques called RLHF, for instance. I can dig more in depth later if you want this. But basically, the idea is you have what we call a pre-train model, which has kind of riddled internet on a next token prediction. So it tries to predict the next token. And this is what we call self-supervision. It's supervision because we have a target, but it's self because text on the web are vastly accessible, like that. And so just with that, you have a pre-train language model, which we had with Lama 1, and we did again with Lama 2 and extended it a bit incrementally. And that's where all the new edge is on. All the capabilities kind of emerge. But then it's hard to access. And the magic behind chat GPT is it's kind of interface as a chat, which is very natural. And to follow your instructions, to say, oh, but talk like this person, or do these kind of things, oh, no, make it more like a markdown, or a bullet point, or chain data, make it shorter. And it understands your instructions, and does it precisely. And this is, this happens at fine tuning. It's kind of refining, educating a pre-train language model, which we did also with Lama 2. And that was one of the many innovation. Because no one had done that at this scale and open saw the model, and explaining all the research behind in a research paper, as we did. So before Lama 2, basically, the only large language model line that were available, like OpenAI, anthropic Google, they were closed behind AMP API. So I would say that's the main innovation in terms of science, and in terms of impact for the community, the research community, the business. I think you mentioned, and you're not the only one. Your company now use Lama 2. This is also possible, because we also change a license to something commercial user-friendly for commercial applications. Yeah, exactly. I don't have 700 million users at my machine learning company yet, so we're still labeled. This commercial license that allows as long as you don't have more than 700 million active users, it's OK to use Lama 2. And yeah, so for us, it's brilliant. So previously, we had been using as our base model. So we have a number of different kinds of generative AI capabilities in our platform for our users. And so we take something like Lama 1, which was pre-trained, but not fine-tuned, that would have been actually fine for us as a starting point, except for the commercial use limitation. So we never could use the original Lama in production, because obviously, yeah, there was this commercial user restriction, it was for academic purposes only. And that also meant that some of the initial fine-tuned architectures that came off the back of Lama, like Alpaca, out of Stanford, and like Vikunia, the Joey Gonzalez, who was an episode number 707 of this show, developed effortly. And so all of those, that whole family of models, we were like, man, we're going to be left out. But then luckily, some groups did come along. So Databricks released Dolly2.0, for example. And there was some other, and I've done episodes on these open source alternatives that are commercially licensable. So episode 672, I talk about different open source options that are available, where you not only have that pre-training with a self-supervision that you were describing, but also the fine-tuning based on human feedback that means that the responses are going to be deliberately helpful and kind of more like a conversational, like a chat. So we had been using Dolly2.0 from Databricks's our starting point for the last couple of months. When Lama2 came out, there was something, the scale, you describe this already, the unprecedented scale in terms of the number of tokens, two trillion tokens for pre-training. And over a million data points for the fine-tuning, this kind of scale, it's orders of magnitude more. The Dolly2.0 for comparison had 10,000 instructions that were fine-tuned on. So you're talking 100 times more. And with these large language models, the scaling laws that we've seen come out, like the Chinchilla scaling laws, they show that you kind of have three levers to getting a great model. So the number of parameters, the training data set size and training time. And it seems like with Lama2, you and your team have tried to max out all of those things, especially with the 70 billion parameter Lama2 model. So that's, I guess, something that's also a worth, if people haven't listened to my Lama2 episode already, then you may not be aware that it isn't just one model that was released here. We're talking about a model family. So there's a 7 billion, a 13 billion, and a 70 billion parameter model. And those two smaller ones, they'll be able to fit on a single GPU. And so this means that you can run them relatively inexpensively. And so with applications like with my company, where we have a relatively discrete number of generative tasks that we need the model to perform, we can take that 7 billion or that 13 billion. And we can fine-tune it to our tasks. And so for listeners who aren't aware, you can do this yourself at home using a parameter efficient fine-tuning technique, like Blura, low-rank adaptation, which I talk about in episode number 674. So you can find, you can take the model like Lama2. And so the 7 billion, 13 billion, you can typically very inexpensively, for tens of dollars or hundreds of dollars, you can fine-tune that to your own specific tasks. And for us, that's perfect. It means we now have this amazing large language model that just it's as good as GPT4 or better in our own tests. When we start with Lama2 and we fine-tune with our own data at this narrow range of tasks that we have. And then if you're a listener out there and you're like, well, I want the absolute state of the art, then you can use Lama2. And at least in terms of open source, this is going to be this state of the art. So yeah, I've just talked a lot. But the point is that, yeah, Thomas, what you've done, and what this means for us as a community to have access to something like Lama2, it's a game changer. It was obvious that it was a game changer within minutes of starting to read the Lama2 materials online. And my data science team and my company immediately started retraining our models with Lama2. It's always good to hear. Thanks. Maybe worth mentioning, what we really is also, is so it was extended context length from 2 to 4,000, et cetera. It's on text only for now. But I think that's also the magic of open source thing. We don't want to push for access for which, for where the community will deal with that easily. And we know that extending the context length that's functioning is possible. We know that connecting multimodal inputs is a straightforward. And what was magic is, after all, there are really within which people have done that efficiently. And so that's also one of the strengths in my opinion of open source thing with kernel models. And we'll see much more innovation with shorter cycles of innovation, thanks to it. So that was one of the philosophy also. We went, as you said, all in on the scale of the things that we can do at Meta. To make it as good as we can so that everyone could use it and then to adapt it for their use cases. Amazing. Are you stuck between optimizing latency and lowering your inference costs as you build your generative AI applications? Find out why more ML developers are moving toward AWS training and inferential to build and serve their large language models. You can save up to 50% on training costs with AWS training chips and up to 40% on inference costs with AWS inferential chips. Training them in inferential will help you achieve higher performance, lower costs, and be more sustainable. Check out the links in the show notes to learn more. All right, now back to our show. And another thing that you did with Lama II is there's extensive thought around ethics, responsible use, acceptable use. So for example, there were red teaming exercises where you simulate internally that you have these malicious actors. And so yeah, can you dive into why this was so important? I think this was unprecedented also. So not only was the amount of data for both the pre-training and the fine tuning steps unprecedented, but for an open source model, I think that the level of concern that went into the ethics and responsible uses also unprecedented. So yes, maybe let's give a bit of context. The strongest LLMs so far were, as we said, accessible only on an API. I think that was problematic in several aspects. This doesn't research. It prevents academia to explore, industrial to have commercial use cases. And to be honest, we would be nowhere without open sourcing, like think about birth, transformers, and even GPT-1. That being said, the risks, present and future with respect to Lama have been arguably discussed by some of the researchers. I think open area and ontropic did an extremely great and important, invaluable job at raising the bar for safety. And I'm glad they did. So the thing is, when do you have an API like them? It's easy to control. You can put classifiers on top of that. You restrict the access somehow. And there's clearly a very hard challenge when it comes to open source, because you release the weights and you enable everyone to fine tune to do whatever, to control some of that. So while I feel it is very important to do it, and I think we're not here at the stage where Lama's sole injury was that we should not do it, it was important to do it in a responsible way, to raise the bar events higher than what has been done for competitor models through an API, because the risk are bigger when you open source it. And so we had a lot of inspiration for the works that were done at those companies, that open area and ontropic. And we apply all the method we call, some new methods we discuss in the paper, to make the model as safe as we could. It's not perfect. There's still some gel break, but one of the, maybe we can discuss that later, but I feel that we had like two main complaints, but follow the release. And one of them was, it's too safe. You know, there's an example we're like, for instance, and I don't remember can you kill the script or something like that? And the model say, no, it's not good to kill. Right, right, right. So, I mean, well, there was some a system prompt of top of it. If you remove it, the model is actually better at false official, but to me, this was a success than that this was the first time we release an open source, a model of this scale. And so we had the responsibility to make, to raise a bar for safety and as a responsibility. So, because it wasn't pertinent, I prefer to be on the side that it's too much to save and progressively decrease the level of safety if needed for future release than the opposite. Yeah, yeah, yeah. And so actually your discussion of that reminds me that when I was doing my research for my solo episode about Lama 2, episode number 702, with that episode, when I was digging into your technical paper, it actually talks about four models. So three models that were released were the 7 billion, 13 billion, and 70 billion parameter models. And then off the top of my head, I think that there was, it was 34 billion, was another model that you trained. But I noticed that for whatever reason in some of the, there was a chart with some metric of safety. And that model for some reason, the 34 billion one seemed to, it was more like the existing open source LLMs in terms of safety. So things, it was kind of more like Falcon, or more like Dolly 2.0. And so, yeah, so it seems like you've even, you've held back a model, I'm guessing, and you don't need to confirm on air, but that it seems like because it didn't, it didn't meet the security standards of the other three, which is an interesting thing to have happened, because presumably the same process was followed for all of them. Yeah, that's, that's a decoding, what you said. And that's one of the main reasons we didn't realize it. One thing also, it's probably that, I don't, we don't know, we didn't have the time to investigate what people have to understand is that just the process together from starting from the pre-trend model to fine-tune it to apply a real HF with enforcement planning, to then evaluate it automatically, then evaluate it with human operators, and with red timers, which are expert at finding the failure trying to make the model say something bad, and like they put the model in their hardest, possible ways to make it say something like that. All this process takes a lot of time. And so we just decided based on this bad point, which we don't know yet why we didn't have the time to investigate. Maybe it's an error in the evaluation. Maybe it's a model that was not well-finished. I don't know exactly yet. But we just said, okay, why I'm wasting more, one, two, three more weeks just for that. We can already release a smaller model, the biggest model, the more capable. Let's not wait to let everyone use it. That makes perfect sense. And so it's actually, it's kind of nice to have that confirmed, because that's actually what I speculated on here earlier. So great. So you mentioned that there were two main complaints. One of them was that it was too safe. So people were complaining that Lama 2 is too safe. So things like somebody saying, I want to kill this process. Leads to it saying, I can't kill, yeah, killing is bad. What was the other big complaint people have had since the release? Tell me if you heard the same, but from my perspective, it was also safety, too safe, and code, bad code abilities. Oh, yeah, yeah. That actually, so I do say that in my episode 702 as well, is that it seems like, so when I say that Lama 2 performs at the state of the art relative to any other open source model, that's on natural language tests, where it's like natural language in and out. So my understanding, and I even tested this extensively myself, but where there's code being generated, or where you're asking it to do kind of mathematical problems, it doesn't seem to, my understanding is that it doesn't perform as well as some other options out there. Yeah, so to that, we actually rushed so fast from Lama 1 to Lama 2 to get these abilities. We focus mainly on natural language and not code. I agree the model is not that good at code, methodically, for now, but we are working on that. And well, at the time of the postcast could be released, I hope that some code Lama will be also released. Very cool. All right, that's awesome. That's exciting to hear. Yeah, so I mean, that kind of gives us a sense that's a really tantalizing glimpse. Yeah, it's possible that by the time this episode is out, that will be old news. But yeah, a code Lama, that sounds very cool. Is there anything else that you can tell us about where this research might be going? I understand. I don't want to be extracting information under your rest. But yeah. I mean, we are the open guys, so to that. No, I mean, in general, there's no clear secret. We'll try to improve the models, in general, which means scaling them, keep training them on mortocans, increasing the abilities, maybe tackle more mutating quality code, which is this because that's reasoning. We'll try also to improve the LHF stage to capabilities. We'll go also on the one of the directions is obviously tools, teaching this model to use in zero-short fashion, some tools, maybe to access the web more easily. But those directions seems quite reasonable and expected. So there's no big spread. Now the question is more like, how we will do that? We'll make it some breakthrough discovery in the way that will enable us to larger improve, hopefully, yes. Nice, yeah. And you mentioned there being able to handle tools, which is something that you have a lot of experience with, because you've also been involved with the tool former LLM. So this is an LLM that came out earlier. And the tool former is specialized to decide which API to call, in a circumstance, when to call the API and what arguments to pass, and how best to incorporate the results into the next token prediction of the generative model. So maybe this is a good time to kind of switch over and talk about this tool former project since it sounds like future mama iterations might incorporate some of that kind of capability. Yeah. I mean, so for this tool former was a connecting large language models with tools. It was an idea I had last summer a year ago. It was like kind of felt to a natural extension of all these models retro Atlas Hague, where you augment with a retriever a language model. And the intuition is very easy. So the idea was to train together a dense retriever and a language model so that you will augment the context. And so when you ask a question, you will search on all the training data, some relevant passages. And so if the model didn't remember memory as well, it will boost the capabilities, which was very efficient as shown in all those papers. But so this is what we call a non parametric framework because you rely not only on the parameters or weights of the model, but also on external source of knowledge that could possibly grow to a time to, for instance, incorporate a new fresh information without necessarily retraining the model. But that being said, my idea was to extend this to a non parametric general framework, you could see, and there was some work at the time that was doing that, you could see how using a calculator or Python executor or different search engine. Maybe I'm using Google for some search and Google's call out for the specific search on papers. And so the idea was to just give a list, a set of tools to the model and much more like a human like way teach it to use them given the context. Not at each in front sign, but so the model now has to know when to use a tool, how to use it to benefit from this performance. And so tool for more, Timoshrick led this work and we published it in February. And I think it was also like very pleasant timing. It was two months after ChatGPT and everyone was kind of, well, the game is over, ChatGPT is there, what's next? But ChatGPT at the time was just limited to a window, like you're chatting with an agent that has no access to the world. And that changed a lot, the perception that you can have once you can give the LAMZ access to the world, to some knowledge, it makes an experience for the user completely different. It extends the capabilities dramatically. And so that's what we have done with ToolFormer, with some self-upervised techniques, so the model learned that basically itself, when it increased with introduced a perplexity, using a tool, so yeah, that was the main idea. Yeah, and so this problem, this may be familiar in an analogous way, and you can tell me where maybe there are, where the analogy breaks down. But having not used toolformer myself yet, it seems to me to be similar to what later happened with ChatGPT with the plugins, so that now with ChatGPT, you can turn on third-party plugins. So if you turn on the Wolfram Alpha plugin, then when you ask ChatGPT to do a calculus problem, it's going to bring in Wolfram Alpha to use that API as opposed to trying to use next token prediction to do math, which works surprisingly well in a lot of circumstances, given it. It's like mind boggling that this next token prediction can often do math correctly, but you're basically guaranteed a correct answer, a correct differentiation, for example, if you use Wolfram Alpha to do it, so ChatGPT will automatically detect, okay, this is a circumstance where I should be using Wolfram Alpha. Let's do some math with that. Or yeah, it can access the web. Like you said, you can do a web search or it can plug into websites like Kayak to make to book your trip and to find you the car rental and book the hotel, so is that kind of, is that if you now just, and now gets to use ChatGPT toolformer, but toolformer is obviously open source. Yeah, I mean, I think it was there. I saw a lot on Twitter when one month later after the toolformer open air is a plug-ins, so they actually site in the plug-in page toolformer and some people say the open air implemented toolformer in one month. Honesty and humbly, I think the idea was in the air and we had a good timing to get flagger. I think also the method used by open air was quite different from toolformer. So that's what's interesting. In toolformer, the idea was, so we had access to bad, I mean, at the time, language model compared to GPT-3, at least. It was before lemma. And so what we did is, with the self-supervised method, which works kind of well. But my conclusion also at the end of the work was, we need more capable base model and fine tune a line model, such that we learn to use tool with some instruction following scheme, which is also why I step back from toolformer at the time and not extended the project to work on lemma tool and making it working with instruction tuning to follow the instruction of the user. And actually, you have one paragraph that's then in the discussion analysis, the paper showing kind of emergence of tool use. Well, you just, with a prompt, describe, you tell to the model, basically, natural language. You can use a calculator, use this format. For the idea, I use a search engine, use this format. Now, what's, I don't remember which one it was in the paper, but like, what's the difference in height between effect hour and opacity building? And then, naturally, say, step one, search height of the opacity, search height of the effect hour, and then calculate the difference between the two. So you can see how, like, from toolformer, where there's the opacity of using the tools. But the method is pretty efficient, but yet, I would say is obsolete with a better line model. We move to lemma tool, now maybe come back to the toolformer. All right, right, right, right, right. Makes perfect sense. This episode is brought to you by Graphbase. Graphbase is the easiest way to unify, extend, and cache all your data sources via a single GraphQL API, deployed to the edge closest to your web and mobile users. Graphbase also makes it effortless to turn open API or MongoDB sources into GraphQL APIs. Not only that, but the Graphbase command line interface lets you build locally. And when deployed, each Git branch automatically creates a preview deployment API for easy testing and collaboration, that sure sounds great to me. Check Graphbase out yourself by signing up for a free account at graphbase.com. That's g-r-a-f-e-a-s-e.com. Yeah, so that's an exciting, yeah, it's exciting how these different research threads diverse together, and it kind of sounds like you have that vision all along, and you're like, okay, cool. Toolformer works really well, but it could be better if the base model that was calling it was better. And so let's focus on this Lama 2 project for a while, and then come back and worry about this API calling from Lama 2 later on, very cool looking forward to that. And yeah, that's similarly for the kinds of things, again, with my own machine learning company, that kind of ability having these open source, these really powerful models like Lama 2, with open source API calling abilities built in. This is huge for us as well, because it means that there's all kinds of cool things that we can do internally. Like a lot of companies, we use APIs, these kinds of microservices to make it easy to have these different compartmentalized services within the platform. And so with something like Toolformer, we can then be able to say, our users could provide natural language instructions, just have a natural language chat with our platform, and all of the capabilities of the platform, the large language model behind the scenes can say, okay, I think that they're asking for this particular kind of data or this particular kind of task to be done, and we have an API for that. So let's go use it, and then the results are returned back in exactly the kind of format, like a JSON format that our platform is expecting. It could make the API call successfully, it can return information from that call and present it to the user. Yeah, it's a very cool thing to be able to do. Do you worry, I mean, it sounds like with the level of worry, the level of concern that went into making sure that Lama 2 is used ethically, something like Toolformer, maybe this kind of, kind of ties into even AGI concerns, because people say, oh, you know, AGI won't be that dangerous because it's not going to be connected to the world. But that's obviously not true, because with projects like Toolformer, we see that, no, they will be connected to the world. They could be, you know, in my company, we're using something like Toolformer to be able to query software APIs and get information back, but there's no reason why those can be connected to hardware, why these couldn't impact the real world. So yeah, I just wonder if you have any thoughts on that and maybe we can have a bigger AGI discussion later in the episode, but yeah. Sure, but no, maybe it went so quickly. I think those are very good points. And actually, we take like safety for the tool direction very seriously. That makes the thing quite different from a kind of closed LLM in a window, we just tried to demo. There's like real risks at another order of magnitude. So for sure, there's new concerns, new research questions and problems on the way that makes it very serious for that. Nice, okay. Well, yeah, that's a clear answer. So. Now, we actually, there's a survey on a large language model, augmented large language model. We published it so in February, just after Toolformer, we have a section at the end of that, saying like augmented language models, augmentation of notavit tools, where a model can now take a connection in the world. This is a different story than before. Nice, yeah, yeah, no doubt. So in addition to Toolformer, another LLM project that you were working on before LLM2 was Galactica. And so Galactica was a large language model that is, I suppose, specifically designed for handling academic research scientific papers and these kinds of scientific questions. The Galactica model was only alive for a few days, I guess. So yeah, so I don't know, it was a huge, it seemed like a really big deal. And then it was taken offline. So maybe tell us a bit about the project and maybe the thinking behind bringing it down and maybe whether it will be back in the future. Yeah, so, you know, there's this website, which is one of the most well known for researchers called papers with code, a company that was acquired by Meta. So the project of the team was, that was kind of visionary about LLM2 model. They wanted a LLM2 model for science that will help us to access information for science, to help us develop creative writing for science, maybe connect different IDs for science, stuff, find some papers that you will never find on Google Scholar, just based on the ID. And that's what the large language model I kept it off and that's what Galactica was about. And, actually, that was one of the first open large language model that works pretty well. And I think people also, it was in some aspects far ahead of its time and some aspects, we made probably some mistakes on the way. It was only a pre-trained model, not an instructed model. And so, maybe we presented it way too much as something that can answer questions, do things, and it will have worked so well after an instruction journey phase. The second thing also probably we did not well was to over-claim a bit on the web page, saying it can write a paper. And I can understand how, like, for scientific, a person working in the science, this will filter over-claiming. That was not our purpose, but anyway, because of all the noise it was, and that was quite some noise at the time on Twitter instead. We decided to remove it. It was also a weird time, because at that time, there was a lot of people still criticizing large language models that were quite noisy on Twitter. And at that, on top of that, some people from the scientific community that say large language models are dangerous for science, et cetera. And it was just two weeks before chat GPT activities. So, that was an interesting timing. I think, for instance, like, people don't realize how good it was at citations. I think you need myself to give you an examples of following instructions. And when you say, like, or find me a site of paper about bias, it will find the papers, or the language one, for instance, to give you an example of maybe more, that will speak more. Chinchilla, the scanning laws. I think Chinchilla doesn't appear in the title of the paper. And so, just say, or scanning law doesn't appear one or the other, don't you remember? And so, just seeing the model, what's the citation for Chinchilla, which is not in the name of the title, it will find you write it, and you could just click and add it to a riff when you're writing something. So, it was kind of connecting the things like that, and it would, from the desk we did, it was outperforming some of the scholars or elastic search engines. And I think as such engine, LLMs have not been yet well explored, but that's something bigger. Yeah, for sure. Deploying machine learning models into production doesn't need to require hours of engineering effort or complex homegrown solutions. In fact, data scientists may now not need engineering help at all. With model bit, you deploy ML models into production with one line of code. Simply call modelbit.deploy in your notebook and model bit will deploy your model with all its dependencies to production in as little as 10 seconds. Models can then be called as a rest endpoint in your product or from your warehouse as a SQL function. Very cool. Try it for free today at modelbit.com. That's m-o-d-e-l-b-i-t.com. And it's interesting how well Galactica was doing at being able to do citations and accurately do citations when something like Chachupdi, especially with the GBT3.5 API running in the back end, it was famously creating citations that sound plausible but aren't real, even creating URLs that are made up, which is, yeah, what you'd expect, probably what you and I would expect when the models are trained the way that they are, but for ordinary users, for lay people, they think, what is this, so why would this happen? And then you even end up with lawyers presenting cases that never existed to a judge as a result of this kind of thing. So it's cool that Galactica was able to do those citations even before the Chachupdi release last year. Nice. And so speaking of the kinds of the issues with large language models, another big issue with LLMs has historically been the expense associated with all the human labor to create a curated dataset. So you mentioned right at the beginning of the episode, how there's this pre-training step that where it self-supervised, where you can just use natural language, it doesn't require any labeling, and that gets us to model weights that have this rich understanding of the world, but the model isn't calibrated to be optimally answering questions from people and performing tasks based on instructions. And so it's this second step after the pre-training, we do this fine tuning. For that fine tuning step, yeah, historically, we've wanted high quality dataset. So the Vikunia people, for example, so Joey Gunzal's his team at Berkeley, they took the original LLMA, which was just pre-trained, and then they used hundreds of thousands of conversations that people had shared. I'm forgetting the name of it off the top of my head, but there was a browser plug-in that lots of people were using to save and to share interesting conversations that they'd had with Chad GBT. And so this was in the public domain, and so the Vikunia people at Berkeley took that dataset and used it to fine tune LLMA and create this Vikunia LLMA, which still today actually has a remarkably good performance for relatively small open source LLMA compared to other kinds of open source LLMAs, even compared to many proprietary options out there. But so this kind of trick can get you so far, but ultimately you might want lots more of these instruction pairs, for example, of these label data to be able to create a powerful fine tune LLMA. And so my understanding is that the unnatural instructions project that you were a part of, Abmeda was designed to help alleviate this issue. Yeah, that's interesting because at the time of a natural actually there was not even like a shared GBT or whatever. And so at this time you had on one way, open AI with a GPT-Davency-free, Davency-1, it was a good instruction model, very capable. And on the other hand, you had just kind of remarkably, not that good, not that bad, a pretrained model, and instructed datasets very like academic oriented, I would say, from task like standard task, like summarization, question, and so on. But you clearly don't have the diversity of instructions that people will have asked, and that Davency-instruct was good at answering. How to collect these diversity of instructions is actually extremely changing. Even for humans, think of 10 different tasks in instruction right now, it would be hard for you to come with this level of diversity. And so at the scale of 1,000, 1 million diversity, that's pretty hard. Somehow, open AI managed to do that with Davency. Maybe they collected some data from the API, they had some annotators, which is well known from years ago, if I had some experience. Now the question is, what we found out, the question was how to get some diversity of not only, with GPT people type some instructions, and you have the output of the model. Now the question is, when you don't have even that, how can you generate not only the answer from the model, but the instruction? And what we found out is that somehow you can ask Davency-free, GPT, Frippon 5, I think, or the version before, to generate those instructions. So you can say, generate me instruction and output for code, for this topic, for some reason, or just without specifying any topic. And it will come and generate a lot of samples, and examples, with not only the answer, but also the instruction, so that you can create an unnatural data set. That actually we found out to be more natural than some of natural data sets at the time. The reason was that natural data sets published by some researchers at the LNRI, using actual humans to create the data, was kind of lacking of diversity and was academically oriented. While somehow the model, from the LNRI, managed to generate a large diversity much more close to actual use cases. I think we can see that kind of a distillation process of a more capable model that was fine-tuned on this data. And that was kind of a temporary solution for people that had not access to instructed models. Which is also one of the reasons we moved to LAMATU, and did the process from scratch to create our own data. Indeed, we paid a quite a lot for that. We'll take more than millions of the annotations to do the whole LLHF stage. And so now we have such capability. At the time, no one had the models at risk yet. Nice, yeah, that's a great overview of the project. Let's dive into that. You mentioned near the beginning of the episode, this RLHF reinforcement learning from human feedback. This is a key part of the fine-tuning process. And with LAMATU, you introduced a new, a unique two-stage RLHF process. Which evidently has led to even better results. So not only having this large annotated data set of more than a million training data points, but you also, you use this new methodology, this two-stage RLHF. So, yeah, do you want to explain RLHF and particularly this two-stage process to us? Yeah, so energy assistance for reinforcement planning with human preference. And the idea is to fine-tune the model. You type a prompt, a question to the model. You sample different outputs. And you ask a human to, instead of writing the perfect solution and fine-tune the model on what the human will have right, you try to train the model to go towards the direction of what human prefers among its samples. And at the beginning of the project, so we knew that was kind of the backbone of some of the instructed models from Tropique Company. Excuse that, David Pay. But if you have asked me, at the beginning of the project, and most of the research shows on me, the question is, supervised data, when I ask annotators to write and serve, it's kind of gold data. That is what is considered by the community in general. Get it to take good annotators, high quality annotators, sure. But then, comparing two outputs from this is very expensive. In comparison to generate, write itself the answer. But two answers, and ask a human to prefer. This takes way less time, and so you can scale it way more. And so, if you ask me, I would say, okay, if I have an infinite budget, maybe I prefer supervised learning, and ask a human to do that, but it's not scalable, so sure we will do a RGF. And the thing is that I realize that, after some time, there's some magic, which is not well understood, I feel, by the community yet, where we already have some superhuman performance on some creative writing tasks. An example I always give is like, write a high queue, or poem, about large language models, and the same. And so, then we'll come with something. I mean, ask me. I don't know about you, but if you ask me, I will take an hour, and I will come with nothing. And models are good at that. And the reason is, the model has supercapable, and have seen all the distribution on the internet of the humans, can think about an example with coding. So, it knows the distribution, the middle distribution of average colors, it knows the distribution of good colors, excellent colors, and bad colors. And so, if you ask annotators to write code, you would probably imitate this distribution. And by imitation, you will have the distribution of 5% of the time it's grid, 50% of the time it's in the middle, and sometimes there's some mistakes. And every human makes some mistakes. Now, if you apply an LHF, this is kind of different, and there's where the magic is. You will shift the distribution toward excellence, toward even better than the best annotator you have. Because, thing is, even if you are the best annotator, you will write at your best capabilities, and you will do some mistakes. And the model will imitate you. But now, if the model imitate you, and you sample 10 times, on the 10 times it will sample, some example that are really good, your best examples, and sometimes your worst examples. And so, you can tell him, no, this is the best example I've wanted. And sometimes it will also explore a bit beyond, and do something that even you won't have done. And so, because it's easier for humans to compare, on my side, I can tell you which point I prefer, I can write them. And so, because of that, you can have some emergence of superhuman capabilities on some tasks, thanks to LHF. Yeah, yeah. And you're actually, you're touching on something that blows my mind all the time, what we already have today. And this is why the release of GPT-4 in March was such a big deal for me, and shifted my own perspective on the realization of artificial general intelligence, AGI, an algorithm that has all the learning capabilities of human in our lifetime, because already with GPT-4, because of this magic of LHF that you're describing, the shifting of the distribution, from, yeah, intuitively imagining like a normal distribution in my head, where the outputs are going to be exactly as you described, they're kind of going to be middling most of the time, sometimes excellent, sometimes poor, but with LHF, we shift everything so that it's excellent all the time. And so, this is the, the high-queue example that you gave is great, because with, you know, a lot of people have the experience of using GPT-4, they're probably the experience of using LHF-2 is similar. And by the way, any of you listening right now, you can go to, at least at the time of recording, it's probably still the same at the time of this episode's release, you can go to a hugging face chat and the default model now, for hugging face chat, is the 70 billion parameter, LHF-2 fine-tuned, chat fine-tuned model. So, you can experience yourself, and, yeah, the queries that I've done in that hugging face chat have been comparable to what I'd expect with GPT-4, but either way, with one of these state-of-the-art open-source LLMs, it's capable of doing so many more things than I could as an individual. Like, obviously, you expect to be able to come to this interface and ask a question about anything in the world. And it knows the answer, and it can articulate it well, and it can dive deeper, and it can explain why they did a certain way, and when you argue with it, when you disagree and you say, no, no, I thought it was this other way, it often knows, oh, yes, that's a common misconception. And so, you're already, it's interesting that we're like, oh, how far away is artificial general intelligence? And this thing that's capable of learning everything that we can learn, already today, what we have, well, maybe it isn't as good as humans on some tasks, it is so much better than an individual human. At so many things, that in some ways, we've already attained this really crazy superpower here on this planet. So, yeah, so I don't know, I've kind of just gone off on change, though, there wasn't really a question, but yeah, our researcher, Serge Massisse, he often digs up the most incredible things on people, on our guests, and one of the things that he dug up on you was that five years ago, in 2018, and I don't even know, he might have translated this because you were saying this to French children, you said that there's evidence that we are not at all close to achieving general intelligence, and that it's a fantasy. But, yeah, I mean, like my perception has shift, an example that I've given, I think, on air before, a year ago, I was giving a TEDx talk in Philadelphia, and my whole point of the TEDx talk was that, because of AI, technological progress, is moving so rapidly, that we can't predict accurately, even a few years from now, what kinds of capabilities we'll have. And if somebody had asked me, at the time of the talk a year ago, whether we would have an algorithm that could do the things that GPT-4 can do, or the Lama II can do, I would have said, I don't know if we'll have that in our lifetime. And now, you're later, we have it, and people like you are making it so that anybody can access it, open source, it's wild, like that shift is unreal, and it has me now, like I went from being a skeptic about what can happen with AI in our lifetimes to believing that, yeah, some really crazy things are probably going to happen in our lifetimes. So, yeah, I don't know if you have any more thoughts on that. I know that you've been interested in AGI for a long time, and yeah, what are kind of your thoughts on when we might realize AGI or artificial super intelligence beyond it? Yeah. I mean, let me show you my faults at the moment. But preliminary, let me say that, it probably depends on the mood I am. I often change my mind. Five years ago, I would have said yes, and I would say no, and I was always balanced. But, also, I'm bad at predictions there. I think, there's only thing that I'm sure that the unexpected is expected. I think, actually, five years from now, five years before, I kind of started my PhD. It was, it was an 18, 17, transformer was there, GPT-1 was there. I was kind of, doing some, working on summarization with rinse and threatening, and like, I remember some slides where, three words, meaning less, was kind of the summary I could obtain. So, again, if you have asked me the same question, will we be there? Now, I would have said clearly no. Actually, I was even, like, late to the party, and all the scaling things, I kind of realized, they're late, how big it can be, and all related to a ji, I think there's, one question which is, do we have already, all we need to get a ji? And is it just a question of compute, flops and scale? And will we get there, in the decade, with more investments, which we will have? Or not? And I don't have a strong conviction there, but I can tell you that, well, first I was bad at predicting, the impact of scaling, then I just watched a told from a calcium man, on YouTube, where he clearly explains, how for him, scaling, has a very important foundation, in them even on the brain, in human condition, and that could be it. And then there's a very proven question, I always asked, when doing deep learning, is it just statistical, correct correlation, or is it more? And there's a, and I'm always balanced on that. Sometimes it seems so good, at reasoning, and making it, and sometimes it's like, the mistakes are so silly. And so, actually there's a paper that, I tend to, to be on the side that we could get a ji is decade, with scaling only. There's a paper from Harvard, a link called, emergent word representations, exploring a second model, training and sinteting task, publisher, I clear. And, so this paper was, notably like using some stuff on Othello GPT, where kind of the idea you can feel about the alphago and all these things, but the idea here is not to get the state of the art result, is to train a model to predict the next token, which is the next move from human players. And that's it, just as a language man. And then the question is, as end of that, did he learn the distribution of the moves as a stochastic part, or did he learn more a profound understanding of the world? Here, the world is a game. And they clearly found that the model, the transformer, training that, and that's it, kind of learned the world, the rules, the game, what it is, how it is, being just a second of actions. And that is a clear signal that there's more profound understanding and that may be just from scale to emerge this intelligence. Yeah, yeah, yeah. That is fascinating. Yeah, so I guess, yeah, that's kind of, I guess the answer I'd expect was put a balanced answer. Maybe we will. Maybe we won't. But yeah, incredible. But we're working on that. Yeah, yeah, yeah. And so then, I guess kind of a more, you know, I could probably guess that you're probably on the side that we should be trying to open source these, you know, if we can, if we can have AGI, I expect based on what you're doing with open sourcing Lama 2, toolformer Galactica, that you would like AGI to be open source as well. Yeah, I mean, I'm pro open source. I'm pro not having in somewhere controlled by a few people, or a very capable model. But at the same time, it doesn't mean we should rush open source things such a big technology and some efforts on the other labs to put the bar very high and think forward about this and what it means and how we could prevent is very important. And we should learn from that. And eventually, we will have some regulations and governance. And yeah, we will have an open AGI. It's better than a close AGI. Historically speaking, it always has been and always will be. But doesn't mean we need to make it unresponsibility. Nice, yeah. And that kind of responsible development and huge development of large language models is something that goes back for you. You know, we've talked in this episode about this stuff you've been working on in the last few years, but it's only a meta, like a long-to-tool form or a galactica and natural instructions. But this is something that goes back a while for you. So you worked on Bloom a several years ago, which was at the time of like GPT2, GPT3, era Bloom was the leading, I think open source, kind of analog to those kinds of things. And yeah, in your whole PhD was based on this kind of, well, so I mean, the title of your thesis was natural language generation with reinforcement learning. And you developed a method called QuestEVAL. Was that, is there any relationship between that quest of Val and the RLHF that you were talking about earlier? Or is it, is the reinforcement learning that you were focused on in your PhD different from RLHF? So somehow it has the same foundation, the sense that you want to maximize the reward. And so the question, at that time, like maximizing some the RLHF and natural language generation was based on some automatic matrix called blue or rouge and people that know these matrix, know how bad they are. So basically, the thing was, you improve the score, but you reduce the quality of the output. So how can you develop new matrix that we actually capture more what we want? So maybe we can apply reinforcement learning on that, which was working pretty well. QuestEVAL, there's I developed reinforcement learning techniques on one side, I developed matrix like QuestEVAL on the other one. There's a paper that did reinforcement learning with QuestEVAL from IBM one or two years ago. And they reduced like hallucination by 40%. So it was working. Now, maybe, the algorithmic and the phonation with respect to RLHF are very close, in terms of architecture, implementation, math. But, there's philosophy of RLHF, which I discussed before about improving beyond the max of the max of the human and later, is something that is quite different. Yeah. Yeah. Yeah. Very cool. And, prior to your PhD, you were involved in quantitative trading. So you were at Soxion, Societies General, which is something, I mean, I wasn't at Soxion, but something that you and I have in common is that I was also before becoming a data scientist. So, in my case, between my PhD and becoming a data scientist, I worked for a few years as a quantitative trader, working on algorithmic trading, and I don't know. I, I don't know how interesting it is to go into kind of financial applications or algorithmic trading applications with AI and LLM. You're welcome to talk about that if you want to, but I think something that might be more interesting for our listeners is that you advise on and you invest in early stage companies that are focused on generative AI and LLM. So, we probably have a lot of listeners out there who would like to have a start-up or scale it up. So what kinds of advice do you have for people that are, yeah, looking to start-up or scale up a generative AI company? What kinds of problems should they be solving? What should they do? That's a tough question. I mean, I'm very good at advising them on the side of the research. It's like, what is a trend? What will be in one, two years? Is this technology far ahead or the remainder? And so, that helps kind of transition from research labs to applications quickly. I feel I have some ability to help them in this regard. Now, it's especially difficult to predict where to invest right now in generative AI. There's kind of a paradox with this technology because we discuss the scale and the velocity of the technology. You say that a few minutes ago and think about it like, when I started data science deep learning, it was like kind of, data is on unit. And so then you have like, companies like Grammarly that capture, annotate a lot of data, create with a deep learning model, train on this proprietary data. Some very strong models to correct grammar, grammatical errors. And this is kind of a technological, very strong technical barrier. Because to beat them, to open from them, with deep learning, you need to annotate this same volume and the same quality. So, they're leaders. And now, with the same kind of technology deep learning, what, one, two, three years later, you have a model, a plug-and-play chat GPT that you can just create a website in one minute or plug-in on a Google Chrome that is even better when Grammarly to correct and much more general. And so all the technological barriers vanish in one second. And so, the paradox with this technology is that everything that we're saying now could vanish in one year, with what I said before, like, it's likely that the, it's expecting that the, unexpected will happen. And so, I guess the main question for entrepreneurs is, what can you build that will be robust in this condition? Yeah, yeah, yeah. What can you build that will respect the unexpected? That will be reinforced if there's some expectation. Yeah. Nice. Yeah, so I guess that's the kind of thing people need to be thinking about with their votes. Like, what is it? Is there some kind of data or some kind of, you know, market access that is unique that means that even if much better generative AI models are open-sourced or, you know, could eat your lunch kind of thing that you still have this, this opportunity. So maybe, yeah, so if, yeah, if you can get some kind of edge somewhere, then when these kinds of unexpected new things come out, these do AI capabilities, you can be integrating them into your tech as opposed to being eaten by them. Yeah. And again, like, it is, I don't want to feel like to get entrepreneurs worried. This is like very risky and challenging environment, but at the same time, it's one of the greatest moments entrepreneurs to create to make some products. That's where comes the paradox. Like, it's one of the best time to create, but also very risky. Nice, very well said. All right, awesome. So that is the end of my questions for you, Thomas, and the end of Surge's questions for you. So let's turn to audience questions. I made a post a week before recording on social media, on both LinkedIn and Twitter, and the LinkedIn post, in particular, got a crazy amount of reactions, 250 reactions, over 70,000 impressions, just at the time of recording here, which is definitely at the top end of the distribution of a post that I make. And we had a really cool one from Alice, this way, yay, who used to work with me at Nebula. She was an amazing product manager responsible for our data products and AI products. But I think, Alice, I think your questions on natural instructions have already been answered earlier in the episode by Thomas. So hopefully, that answer was to your satisfaction. So let's move on to a question from Adityan. So, Adityan is interested in generally rough rules of thumb for how you choose what kind of open-source LLM to start with and how to fine-tune it. So if he's building a startup for a niche use case using our large language model, some of his questions are around things like, how do you decide what size to go with? So I think I kind of, I already actually answered this question earlier in the episode. So with LLM2, for example, the released model sizes are 7 billion, 13 billion, and 70 billion. And I talked already earlier how the 7 and 13 billion, this can often fit on a single GPU. And so, you know, a small model with that could be good enough for a niche task. You'd only need the 70 billion if you wanted the model to be able to do a very broad range of possible answers, a very broad range of possible tasks. So, yeah. So I think in your case, Adityan with a niche use case, probably 7 billion is probably going to be fine. You can start there if it doesn't do the trick, try 13 billion. But the question then for you, Thomas, is how many data points do you think that he needs to collect or somehow synthesize in order to be able to make use of fine tuning? So, you know, the implication here is that there's some niche use case that he would like the model to be able to handle. How many data points does he need to have in order to make use of a, you know, a parameter efficient fine tuning approach on top of one or two and excel in that task? Right. It's an interesting question. I was about saying, maybe you can start even without fine tuning, just of the shelf, with zero shots. But also with few shots, one, two, three, five examples, you created yourself. I think it's not like a few shot-pre-train model, like it used to be before. It's a shared model. So, maybe you need to do a bit of prompt engineering in the sense that create a dialogue, like example one, with your input, you make the model kind of generate your gold output. And then, when you will ask your question, the model is kind of biased toward the format, the kind of a template you want it to be answered, that would be the first thing I would try. If it's not enough, I would say that generally and it's very hard to answer systematically because it depends on each use cases, task difficulties, et cetera. But in general, what I have seen is that with very few examples, sometimes a hundred, a thousand at max, you can have like dramatic improvements on some tasks. Very nice. Yeah. That's a really great answer. Very practical answer, Thomas. Thank you. Very much. All right. Our next question is from Svetlana Hansen. She's a senior software engineer. I believe she works on like a lot of outer space projects, with folks like NASA folks, that kind of thing. So, Svetlana has been following the Super Data Science podcast, I think, for as long as I've been hosting it. So, several years now, and she's had some great guest suggestions in the past. And she had a series of great questions for Thomas. One that I really liked was about the lessons that you've learned, Thomas, from developing and managing these large-scale AI projects. So, being involved with Bloom years ago, the Lactica Toolform or Lama II, these have huge team sizes, and huge models, very long, and you're probably, you know, you kind of gave us a bit of an insight into this. There's this pressure, this race, especially in open source to get out there before other people. And so, for example, you made the decision when the 34 billion parameter model didn't, wasn't meeting the same safety standards as the 7, the 13, and the 70 billion parameter, Lama II models. You said, you know, let's just go ahead and publish what we have, because we've got the state of the art at the 70 billion. We've got the smaller models that you can fit on a single GPU. So, we've had some kind of insight into your thinking on these kinds of large scale projects, but yeah, I don't know. His first fit on his question here, what other key lessons have you learned about developing and managing large scale AI projects? Yeah, it's a very interesting question. Let's try to do some smart on that one, but maybe the main difference with these big projects, with respect to, like, when I was in academia, with some small papers, with very few people, because of the size it also means, like, a lot of people's impacted. There's a lot of budget around, and you have a potential to reach out to so much more people. The project is at another scale of impact. And because of all those ingredients, well, it was a case for probably Bloom, and even more galactica, where it was more involved in the training as a project, where lots of way fewer, you have a lot of GPUs that runs, you have to make some decisions. And the thing is, there's a main difference with, let's say, in a perfect world for researchers, as I am, you want to understand everything, all the phenomena. And so, you want to do all the ablations, you want to do all the experiments to see what's the impact of this factor, and this one, and what if we have done that? The thing is, there's so many possibilities, and every experiment costs so much, and takes so much resources, that you cannot do that anymore. And so, one of the main challenges, you're responsible to make some decisions, as I was in Lama 2, of like, okay, we need to choose between that and that, the thing is, and even more because no one is publishing anymore, the secret sauce, maybe just we did, you're like, okay, I don't know, what's my intuition, but how can we, they quickly verify, and change, if needed, and you're playing with actually a lot of resources, like millions of dollars, some mentioned for the notation, of a lot of thousands of GPUs, a lot of many authors that are involved in the project, and so, and time is also constrained resources, and you cannot like spend one year to explore, and so, how to deal with this changing environment, is what I thought was the main challenge, my side, and like, when you're at night, and you take, before sleeping, like, you took a decision, is it the correct one or not? And you don't know, and this uncertainty for researchers, is something hard to deal with. Nice, so, I guess you kind of your answer, your key lesson, is that there's trade-offs, and you don't know whether you're making the right answer, and maybe kind of these, you know, these decisions on how quickly do we rush this or spend some more time on it? Well, it seems like with Lamato, you certainly got it right. It's, it made an enormous splash in a huge impact, so, you seem to be getting it right. We've got a comment here from Lawrence Vanderman, who was recently on the show, episode 709, a colleague of yours at Metta, and he doesn't have a question, but I just wanted to highlight that he said, Thomas is awesome. I'm looking forward to hearing your conversation with him. Yeah. So, Lawrence, I hope that you enjoyed this conversation as much as you were hoping to. And then, last question here is from SM. So, SM has asked questions on the show before, but SM has a, I assume, very deliberately sparse LinkedIn profile, which is unusual. Most people on LinkedIn, it's like real names and that kind of thing. But SM is, seemingly, exists, they count, exists solely to ask questions on the Super Data Science Podcast, because there are, like, no other connections. So, I appreciate that compliment. So, SM's question is, it's a long question, but it's, I think it's basically getting at this idea of, you know, LLMs can be wrong. They can make mistakes. They can give on helpful answers. But nevertheless, they are often very useful, and they're becoming more and more useful all the time. So, I guess this, this question is, like, I think we touched on this a little bit early in the episode, as well, when you were talking about the research at Harvard, and the ability for transformers to seemingly understand, understand such a bad word. But, I think you have a good sense of kind of the question where this is going, so you can answer it. Yeah, I, I think, like, the question, if, and also, it correctly is about, like, can we one day in the future, and how is it not yet the case, rely on these models? Isn't it that, humans on some very simple task, obtain 100% score, like well, while models will sometimes do so impressive things, and when it's not expected, will fail on silly things. And so, that's very, very weird. My understanding, and I'm not saying I got it right, but just my intuition and the studying at the moment, is that, as we discussed before, we've scaled, we might have an emergence of much general reasoning and instinct, and my understanding is that, those algorithms, kind of learn the compression of the data. Maybe, let me give you an example if I give you, I can print you, an infinite number of tokens to train the model, of numbers and calculus, one plus two equal, three, and so on. Now, if I give you that, there's two ways to predict the next token after required. You can memorize everything, but that, if it gets to an infinite vocabulary, you will need a lot of weights to memorize it. Or, you can compress the information, such that, you learn, you internalize the algorithm, beyond that. And so, you can predict, accurately, the next token, whatever, and that requires much less weights to learn calculus, than to memorize an infinite number. Now, at a prolonged time, it seems that, models, large numbers, are very good at doing calculations on two three digits, and when it goes beyond, it fails, more and more. My understanding is that, they kind of internalize, generalization, in terms of calculus, for one digits, but somehow, in the large vector space, they, kind of see it as different objects, calculus for two digits, than for four, four, five digits, maybe because it appeared less. And so, they don't have yet this generalization of, all this is, one, two, three, five, six, seven, eight, is calculus, nine and ten, that I didn't see in the training, is calculus, as well. And so, there's one dimension that I didn't generalize, but there's some of those that I already generalized. And I feel like, a true edge AI, if we get there, with scaling, or in any other ways, could emit from this generalization of compression at another scale. When the generalization will be complete, somehow, if we get there, with scaling. That was an amazing answer, that made it so crystal clear. And yeah, really built nicely on what you said earlier in the episode around, you know, representing these complex concepts. Very, very cool. All right, Thomas, it has been an amazing episode. I've learned so much. Truly, it's been an honor to have you on the show. Before I let you go, do you have a book recommendation for us? Um, let me, Black Swan, from Nessigni Kuretreba. Yeah, yeah, yeah, yeah, great choice. And, uh, yeah, how should people follow you? What's the best, you know, after this episode, if people want to keep up with the latest on your work or your thoughts, how should they do that? Oh, surely. Again, follow me on LinkedIn and to my serum or on Twitter, as well, I'm really easy to find that. Nice. All right. We'll be sure to include those links in the show notes. Thomas, thanks again. And yeah, best of luck. We can't wait to see what you release next. Next, uh, some stuff, probably it sounds like even before this episode is live. Uh, and yeah, truly, uh, on, on behalf of my listeners and tons of other early stage startups, like mine, we are so grateful to have people uh, like you and Mehta being willing to open source these incredible technologies. It's making such a huge impact, uh, commercially, and also big social impact. So, thank you very much. Thank you, Jen, for having me with you all the time. It was my pleasure. Thomas is already a legend, but it seems he's only just hitting his stride and his biggest, most mind-blowing, potentially, AGI summoning projects are yet to come. In today's episode, Thomas filled us in on how pre-training and fine-tuning an LLM on an as-yet, unprecedented scale for an open source LLM led to the big llama-tooth splash. He talked about how handling code, tools, web search, and even better friends have been able to open source LLM led to the big llama-tooth splash. He talked about how handling code, tools, web search, and even better performance are up next, or the llama project. How tool-former calls an appropriate ABI and incorporates the output into its next token predictions. How RLHF shifts the distribution of a pre-trained LLM model's outputs from a normal distribution of human-generated quality to outstanding, often superhuman quality. And how, with AI developments, the unexpected is expected, and so AGI may be just around the corner. As always, you can get all the show notes including the transcript for this episode, the video recording, and materials mentioned on the show, the URLs for Thomas' social media profiles, as well as my own, at SuperDataScience.com slash 713. Thanks to my colleagues at Nebula for supporting me while I create content like this SuperData Science episode for you, and thanks, of course, to Ivana Mario, Natalie Surge, Sylvia Zara, and Kuro on the SuperData Science team for producing another tremendous episode for us today. You can support this show in so many ways you could check out our sponsor's links, you could share it with a friend or colleague, you could review an episode, you could subscribe, but most of all, just keep on tuning in. I'm so grateful to have you listening, and I hope I can continue to make episodes you love for years and years to come. Until next time, my friend, keep on rockin' it out there, and I'm looking forward to enjoying another round of the SuperDataScience podcast with you very soon.