695: NLP with Transformers, feat. Hugging Face's Lewis Tunstall

This is episode number 695 with Dr. Lewis Tonstell, machine learning engineer at HuggingFace. Today's episode is brought to you by the AWS Insiders Podcast. By withfeeling.ai, the company bringing humanity into AI, and by model bit for deploying models in seconds. Welcome to the Super Data Science Podcast, the most listened to podcast in the data science industry. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. I'm your host, John Cron. Thanks for joining me today. And now, let's make the complex simple. Welcome back to the Super Data Science Podcast. Today, I have the great honor of being joined by the brilliant Lewis Tonstell. Dr. Tonstell is an ML engineer at HuggingFace, one of the most important companies in data science today, because they provide much of the most critical infrastructure for AI through open source projects, such as their ubiquitous Transformers library, which has a staggering 100,000 stars on GitHub. Lewis is a member of HuggingFace's prestigious research team, where he is currently focused on bringing us closer to having an open source equivalent of chat GPT by building tools that support RLHF, which is reinforcement learning from human feedback. And he's also big into large scale model evaluation. On top of all that, Lewis was the first author of the book Natural Language Processing with Transformers, an exceptional best-selling book that was published by O'Reilly last year and covers how to train and deploy large language models using open source libraries. Prior to HuggingFace, he was an academic at the University of Vernon, Switzerland, and held data science roles at several Swiss firms. He holds a PhD in theoretical and mathematical physics from Adelaide in Australia. Today's episode is definitely on the technical side, so it will appeal most to folks like data scientists and ML engineers. But as usual, I made an effort to break down the technical concepts Lewis covered so that anyone who's keen to be aware of the cutting edge in natural language processing can follow along. In this episode, Lewis details what Transformers are, why Transformers have become the default model architecture in NLP in just a few years, how to train NLP models when you have few to no labeled data available, how to optimize LLMs for speed when deploying them into production, how you can optimally leverage the open source HuggingFace ecosystem, including their Transformers library and their hub for ML models and data, how RLHF aligns LLMs with the output users would like, and how open source efforts could soon meet or surpass the capabilities of commercial LLMs like chat GBT. Exciting. Alright, you ready for this freaking fantastic episode? Let's go. Lewis, welcome to the Super Data Science Podcast delightful to have you here, where are you calling in from? Thanks for having me, John. I'm calling from Switzerland. Nice. Yes. And by coincidence, the way that I managed to wrangle you into coming on this podcast was on a recent trip that I had while I was flying to Switzerland, I was on the plane reading a book called Natural Language Processing with Transformers. And the first author on that book is you, Lewis Tunstle. And so I was reading it on the plane, and shortly after I landed, I was filming a podcast episode with a guest Richmond Olocke, he was on an episode number 685, and he has a podcast himself at the end of the episode. I said, Richmond, do you have any great podcast guests that you would recommend? And his first recommendation was you. And I was like, that's crazy because I'm currently reading his book. I absolutely love it. It's obviously super topical. Everyone wants to hear about NLFU with Transformers these days. So I'd love to have them on air. Richmond made an introduction, and now you're here. Thank you so much. Yeah, thanks a lot. It's a small world, really. I also met Richmond very randomly. I think one day he just messaged me saying, hey, I have a podcast. Do you want to come on? And it's just these things in life, connections that happen kind of very organically. Yeah, well, thank you for taking the time for me and for him. And for the listeners of all the podcasts out there that you educate, we've got a super educational outline plan for today. And we're going to start right with Transformers. So Lewis, what is a Transformer, and why is it such a big deal for natural language processing in particular? Great question. So maybe we can break it down into a couple of steps. So at the high level, the Transformer is just a neural network. And in particular, it's a deep neural network. So you've probably heard of deep learning, kind of taking over software in the world in the last few years. And it was developed by researchers at Google around 2017, who were trying to find a more efficient way to do machine translation. And up until that moment, the sort of standard way of doing any sort of machine translation task was basically using a type of network called an LSTM. And these LSTMs, they have this kind of recurrent structure, which means that basically you want to convert one sentence into another. And so you feed in the kind of words from, say, the English sentence and then this network would kind of iteratively process those words to then generate the translation. And these neural networks, they worked these LSTMs, they worked quite well. But I had a few issues, and the major issue was that no one at the time could figure out how to kind of scale them, which means how could you increase the size of the neural network in terms of parameters? And also how could you train on massive corpora? And so there were a few ideas floating in the literature at the time, probably the most prominent one was something called attention mechanisms. And these attention mechanisms were also designed for machine translation where the idea was, when we're trying to process language, how can we encode some of the context that is surrounding the words in some phrase? So an example might be like, if I say my name is Lewis and I come from Australia, then we sort of can imagine there's some kind of correlation or some relationship between the word Lewis and the word Australia. There's some sort of connection between those two words in that sentence. And what this attention mechanism did, essentially it's a layer in your network, it provided a way to essentially teach your networks how to model those relationships in a fairly efficient way. And so what the researchers at Google said was like, okay, maybe we can just take this attention idea and just train a network based entirely on this with a few other tricks and things that were common in the literature. And the result they found was, first of all, a machine translation system that was state of the art at the time. But more importantly, it was something that could basically be parallelized on GPUs. So instead of having to use a kind of recurrent structure where you have to process sequences kind of word by word, you could basically feed in the full sequence and then this attention mechanism would compute all these correlations which would then allow the sort of models to be scaled up. And at the time, this was already like a big deal for machine translation. But then researchers at OpenAI, they took this idea one step further and they said, well, maybe we can actually just do this for sort of just general text generation. So instead of just having a single task like machine translation, what if we just train a transformer? That's just very, very good at modeling the next word in a sequence. And this was the start of what was called GPT or the generative pre-train transformer. And in many ways, that marks the kind of start of this revolution in transformers where people eventually saw, okay, this is both very good at generating text. And then as you scale up to the size of the internet and also to hundreds of billions of parameters, now you get these kind of models today like GPT-4 and GTI. Yeah, really a cool explanation there. I liked how you transitioned from LSTM's handling the sequential data to these transformers that have this attention mechanism that are able to take in the entire sequence at once. It's interesting how today I'm aware of a few different research strands that are now trying to blend these two kinds of approaches because one of the big downsides of the transformer approach is that the larger the context window, so if you want to handle twice as many tokens, so roughly you can think of them as words, if you want to handle twice as many in your input, because the transformer needs to attend to that entire sequence, it vastly increases the amount of compute. So where LSTM is because they work sequentially as your sequence gets longer the compute scales linearly. But with transforms, it scales polynomially. So if you expand your context window by x, the amount of compute required goes up by x squared. So very, very quickly, way, way more computers required if you have these bigger context windows. And so I'm aware of these research threads where people are trying to find some way of kind of blending the way that LSTM's worked with transformers so that we can get the attention on the full context window without necessarily that polynomial explosion in compute. Yeah, that's totally right. And I think a lot of people kind of declared LSTM's were dead, you know, with transformers. And I'm trying to remember the name of this latest model. It's got like a funny acronym, like I think like RKV and W or something like this. But it basically it does exactly this. It tries to blend the kind of best of both worlds. So how do you have essentially infinite context length but also have the parallelizability? And from at least the demos that this set of researchers have provided, you can see that they are quite competitive with, you know, standard transformers today. So things that are like, you know, Larma and stuff. You see some of these LSTM hybrids, the smaller scales are quite competitive. And yeah, who knows? Maybe we'll see in a few years that, you know, it's not just a transformer that is the kind of key ingredient. But I think today most people kind of default to transformers just because the ecosystem has kind of become fairly commoditized. So it's now relatively easy to fine tune transformers and relatively easy getting easier now to pre-train transformers. And I think there's a whole bunch of like tools around that, which you know, for these more research based projects, they often take a bit of time to kind of, you know, coalesce in the kind of general practitioner's toolbox. Yeah, and we will dig into a lot of these tools, many of which you are involved in and the hugging phaser company are involved in the leaders really in making transformers accessible and easy to train. So we'll get into that in a moment. But before we get there, let's dig some more into transformers. So you have a whole chapter dedicated to transformer anatomy. So maybe you can give us a high end overview on these key transformer anatomy concepts like encoders, decoders, and then some of them are encoder decoders. So how do these different kinds of transformers vary and why would you use one in a particular scenario or another? That's a great question. So the original transformer, as I mentioned before, was trying to model basically machine translation. And in this task, you've got some input sequence of text that you're trying to translate into an output sequence. And so the actual original transformer is called this encoder decoder architecture, where essentially you have an encoder, which is taking your input sequence. And the role of this encoder is to essentially convert all of these kind of like raw tokens, so basically bits of words and so on, into a sequence of embeddings. And these embeddings are essentially like the sort of numerical representation associated with each token in your sequence. And then the decoder part of the transformer then takes that sequence of embeddings and then does, as the name suggests, decoding, which basically says, okay, given that input sequence, now how can I, for example, predict the next token in that sequence? So if you imagine that my input sentences, you know, my name is Lewis, and I want to translate it to German, then the input to the decoder will be basically these embeddings of my name is Lewis. And then the role is to now predict, given that input, that the first word now should be mine. So from German, so then mine, Nama is Lewis. And so that would be the kind of main distinction of these two components. And it actually goes back to four transformers. So people who are using like RNNs, there was a very common kind of sequence to sequence paper by Ilya Suskeva and others at Google. And that's where they kind of pioneered this approach. And it's very good at modeling these kind of like, you know, input sequence, output sequence kind of tasks. And then basically the sort of two main sort of branches off that encoded decoder. The first big one was the GPT model from OpenAI. And so what they did was they said, okay, we're really interested in generative tasks. And so for these, the more important part is the decoder. And maybe we can basically save some compute by just throwing away the encoder part of the original transformer. And then we just get the model to predict the next word in the sequence. And we don't have to worry so much about, you know, this kind of sequence to sequence mapping. And that obviously turned out to be a very impactful branch or a type of transformer. And these transformers are called decoder only transformers. And then the other side of this was when Google, a few months later, basically released a BERT. And BERT was the sort of first encoder type transformer where they did the opposite thing. So they threw away the decoder. And then they said, let's just focus on getting very good and rich representations of NLP sequences. And what that model can do very well is it can basically handle tasks where you're trying to extract information. So for example, let's say you're doing a text classification. So the representations that come from BERT, all these encoder models are very good. You can do question answering. You can do so named entity recognition. These kind of like core NLP tasks typically encoders do well in them. Whereas the decoder ones are typically, you know, where you want to do things like summarization or chat or you know, things like that today. Now, the boundaries are blurring because that was the story when we wrote the book, but then there were other models that came out. So for example, T5 is a model from Google researchers where they showed that you can actually frame most NLP tasks as a sequence-to-sequence task. So if you ask, for example, how do I classify a movie review, then the kind of conventional approach would be, okay, take your transformer encoder, feed in the review, you now get some sort of embeddings from that. You can then sort of look at those embeddings and say, okay, can I measure the sentiment associated with that input? Whereas the T5 architecture is this encoder decoder, and instead what they say is, well, you can formulate every task, like you can say, classify the following review as positive or negative. And then you put the review, and then the decoder now has to output, you know, positive or negative as a word. And this model is far more versatile because you can now do many tasks with just the same architecture. But kind of traditionally, I would say the field has typically, you know, split into this encoder decoder branch. And so these T5 models are widely used, but at least what I've seen in practice, people tend to kind of fixate on one of the other. Thanks. Yeah, that all makes perfect sense to me. Yeah. This episode is supported by the AWS Insiders podcast, a fast paced entertaining and insightful look behind the scenes of cloud computing, particularly Amazon web services. I checked out the AWS Insiders show myself and enjoyed the animated interactions between seasoned AWS expert role. He's managed over 45,000 AWS instances in his career and his counterpart Hilary, a charismatic journalist turned entrepreneur. Their episodes highlight the stories of challenges, breakthroughs and cloud computing's vast potential that are shared by their remarkable guests, resulting in both a captivating and informative experience. To check them out yourself, search for AWS insiders in your podcast player will also include a link in the show notes. My thanks to AWS insiders for their support. So the encoder decoder structure was the original concept. And I imagine that's the attention is all you need paper. That's right. And so with that original transformer paper, they, I think quite naturally, it makes a lot of sense to think, okay, and it follows along with this concept that we've had in deep learning for a longer period of time as autoencoder structure, where you're taking some kind of information. So in this case, it's strings of tokens, strings of kind of words. But this idea of encoding information into an abstract space is something we've been doing with all different kinds of data types with autoencoders for years. So it could be an image or a video. It could be a sound wave and you can encode it from the raw input information. So pixels in the case of an image and convert that into this abstract representation where provided enough training data that abstract representation is consistent regardless of the specific pixels. So you could have, you know, the encoded representation could be like, you know, this is a brown dog by a red fire hydrant or whatever and it's abstractly represented. It's not like written in language like that. It's based on a location and it's high dimensional space. But we can go from, you know, you could have one image of a brown dog around a red fire hydrant and the pixels could be completely different. Like there's no relationship between these two images of that same scene, but the encoded representations could be very similar. And then the decoder structure takes that abstract representation and can return it back into the pixel version. And so, yeah, that kind of idea of going from encoded to decoder. I can see how that's where transformers started because it makes a lot of sense conceptually. It's surprising to me and I still have a hard time really wrapping my head around how GPT architectures in particular work with the decoder only. Like, because for me, it's so sensible to think about that intermediate step where we have that encoded representation. And so it's, yeah, there's still this, there's a bit that I still need to wrap my head around with these decoder only structures that specialize in natural language generation like the GPT family, most recently GPT 4 and the other architectures that we have behind the chat GPT models. And yeah, because they have this decoder only, they end up specializing in being able to predict the next word in a sequence very well, whereas as you highlighted there, the encoder only structures like Bert specialize in tasks that don't require that kind of generation. So, more of a natural language understanding is a post-it natural language generation. And yeah, so as you said, that natural language understanding, we create that abstract representation from the raw natural language. And then that abstract representation can be used downstream for all manner of tasks. Yeah, so you gave lots of examples there, a question answering, the identity recognition, text classification. And yeah, so to me conceptually also the Bert thing, the encoder only also makes a lot of sense. I'm like, cool, yeah, we go from a string of characters to this abstract representation. And then we can do things with those abstract representations we can compare to similarity. And yeah, it allows for fast semantic retrieval of information and that kind of thing. Anyway, I think I've now been speaking for a very long time and I can kind of tell that you're ready to go. Yeah, sure. I mean, I would say the thing that you mentioned about the decoder is predicting the next word, or the next token in a sequence. It's a surprisingly hard task, right? So if you just give any human a random piece of text from the internet and say, you know, what is the next word in that sequence, I think you'll have a hard time getting, you know, very good performance on that. And so my understanding of these decoder models is that typically because this task is rather hard when it's done at scale, the models kind of pick up enough, like let's say service level capabilities around, you know, grammar and all the linguistic things that we do as humans online, that then when you want to do a task like, okay, sentiment analysis or, I don't know, write me a recipe for scrambled eggs, they've kind of seen enough examples where that kind of generation itself is relatively straightforward. Of course, the hard part is that if you try to do things that are very out of domain, I think you typically find that's where, you know, not just these decoder models, but most of these models, they tend to struggle. So they're very impactful, but they still, you know, today have some fairly, you know, serious limitations. Yep, nice summary there. And that does actually, really conceptually, I think you might have just cracked it for me. That was a really elegant explanation as to how these generators, yeah, just because they are specialized in this next word generation, that's what the model weights are structured to be able to do. We don't need that intermediate abstract representation. We can just skip right to predicting what the next word should be, as tricky as that can be. And so maybe just one comment to make is that the models that come from this are these circuit pre-trained models and these models, they're kind of like very sophisticated autocomplete. But if you play with chat TBT, it's clearly, you know, much more than autocomplete. And so there is another kind of whole secret source of ingredients around reinforcement learning and how do you model human preferences and things? And that's kind of machinery that's like sort of tacked on top of this, like, you know, predicting the next word. So even though kind of mechanically, we do say that the model is predicting the next word in a sequence for these very impressive models that there's a fair bit more happening but we can talk about that later. Nice, yeah, we will talk about that later for sure. RLHF, really exciting topic. So, yeah, before we get there, with respect to these kinds of tasks that transformers can perform, in your book, you specifically highlight feature extraction as something that transformers are really great at. So what is feature extraction and how do you transformers differ from traditional ways that we might have extracted features in natural language processing? Yeah, sure. That particular part of the book was, I think, born from the experience Leandro and I had as working as data scientists at Swiss companies at the time. And, you know, a lot of the time as a data scientist, you want to train the next fancy thing and you want shiny new toys. But then almost immediately, you know, your manager will be like, well, we've got no label data. Or we've got very little label data or something like this. And so then doing this whole fine-tuning process tends to be a bit of a struggle. And so we showed in the book, essentially how you can extract these embeddings from transform models. And the idea here was to say that, essentially, you know, the conventional way that people did this kind of pre-transformer time was to take a model like WordDevec or, you know, some extension of this. Where you essentially had kind of like universal representations for like every word in the vocabulary. So for example, the word like dog, you know, is just one vector or one representation that you could use to build, you know, your kind of features that you would then build to say a classifier on top of. And obviously these transformers, they have this contextual representations, which means that the representation of dog will actually depend on the surrounding words in the sequence. And so when you do feature extraction, using transformers, you get this kind of nice sort of representation of these embeddings, which pick up that contextual information. And then you can use those embeddings for downstream tasks. For example, we do text classification in the book. But as you mentioned before, a very common one is doing things like, you know, semantic search. So if I want to embed all of the documents in my company, I can feed them through transformer. I get vectors and then I can compare, you know, which documents are semantically similar to each other. Now, what we did do in the book was sort of like the vanilla thing, which was like TechBirt and just feed, you know, in this case, it was like a motion tweets through it to see, you know, what are the kind of representation of these tweets according to their emotion. But there are better models, for example, sentence transformers. They have a special kind of training process where it's essentially a simile network of two transformers, kind of like learning how to model the semantic similarity of documents. And so if you ever want to actually do feature extraction for things like search and stuff, you're much better off using these like special sentence transformers than, you know, just an off the shelf. But very cool sentence transformers. I'll be, I'll do my best to remember to include a link to those in the show notes that sounds super useful for people doing these kinds of applications where you're interested. So this kind of, I guess this builds on the conversation we were having earlier with architectures like Bert being encoder only and converting things into that economic space. We now have specialized approaches like sentence transformers that are even better for getting those abstract representations well aligned. Very cool. And this idea of the token dog being represented in an abstract high dimensional space as opposed to as like a one-hot encoded word that's just, that's kind of, yeah, in this, in the traditional way of doing natural language processing that word dog, you might have needed like a taxonomy to say, okay, like, you know, dog is related to cat in this way. They're all like in the animal family. And now with transformers, we can have this totally data driven approach where we don't need to be maintaining all these manual taxonomies. And it has way more flexibility because as you say, when we come across the word dog in a sentence like, I ate a hot dog with relish, it doesn't consider that dog to be in any way related to a cat. And so yeah, it's, yeah, I read about this a lot in my book, Deep Learning Illustrated came out many years before yours. So I didn't have anything about transformers. But even in that area working with LSTMs and approaches like word-to-vec or dog-to-vec document-to-vector, we all, you know, that's a big point that I make in my book is that you're gonna get way better results using deep learning and this data-driven approach as opposed to trying to manually hard-code meeting in natural language processing. And yeah, there's so many benefits, obviously, in terms of human time on tasks. And just in terms of quality, like it ends up working way better as we, as probably most of our listeners have now seen with tools like chat GPT. Yeah, I totally agree. And it's funny to think that I think we're old enough to have been the generation who lived, you know, pre-imposed transformers. So, you know, I remember doing an OP in the ancient days where you had to think about stemming and, you know, how you actually pre-process your data, like if you extract, you know, punctuation and stuff. And you have all these nightmares and you're not quite sure if it's gonna work. And then when you suddenly have a transformer and you just say, well, I just tokenize it and more or less, you know, for most tasks, the fine tuning will work. That was for me quite a big, you know, update to my way of working. The future of AI shouldn't be just of our productivity. An AI agent with a capacity to grow alongside you long-term could become a companion that supports your emotional well-being. Paradot, an AI companion app developed by With Feeling AI, re-imagines the way humans interact with AI today. Using their proprietary large language models, Paradot AI agents, story or likes and dislikes in a long-term memory system enabling them to recall important details about you and incorporate those details into dialogue without LLM's typical context window limitations. Explore with a future of human AI interactions could be like this very day by downloading the Paradot app by the Apple App Store or Google Play or by visiting paradot.ai on the web. Yeah, exactly. The the media chapter of my book was a chapter on all these NLP pre-processing techniques that you needed to go through. And it was actually funny when the book was being copy-edited when the copy editor finished that one. She was like, oh my goodness. That was like such a like such a crazy journey is so complicated, it's such a long chapter. And now, yeah, it's just like, it's probably just some like one liner that I can do with the Hugging Face Transformers library. I don't need to worry about it at all. Cool. So these kinds of conversations that we're having about what's going on inside a transformer model even more broadly within a deep learning architecture why should a practitioner care or should a practitioner care? Like, why does somebody need to understand how a transformer works Lewis if they're working on NLP problems? Yes, I think it's a bit of a philosophical question because in some sense it depends a little bit on, you know, how deep and curious you want to go into a topic. So I would say at a very technical level if you're training transformers so whether you're fine-tuning them or especially if you're pre-training them, at some point you're going to hit some errors. And those errors are going to be maybe the data is not set upright, maybe, you know, you have things on the wrong coded device, all this annoying stuff. And when you start looking at the stack trace, you're going to see some like, you know, lines of code is going to say, hey, in modeling on a school bird on this line, in this attention layer, there's a problem. And at least for me personally, having an understanding of how the computations are running in the network, it helps you kind of iterate faster through and debug things much quicker. And that's more just at the practical side of things. But then at the sort of like, let's say, more fundamental level, it's like any other piece of knowledge, right? Like if you're trying to build something, it's really, really useful if you know how like the things you're building with work because not only does it help you, as I said before, debug stuff, but it also helps you think about how you can extend them because if you sort of never go lower than just the sort of high level API transformers, you may encounter some tasks in your work where you need to do more sophisticated things, like for example, you know, blend different types of heads on the transformer for like multitask training. And then that's going to get to a point where it's going to be very, very useful. So I have a good understanding. So I would say those are roughly the sort of two main things I would suggest. And so, be honest, in general, it's just fun, right? It's good to, at least for me, intellectually, it's fun to know, you know, how these things, you know, work. And I do recommend everyone just once in their life implement a very simple transformer like we do in the book. Just in the same way that, you know, everyone has to implement backprop once in their life. This is, I think, you know, the next step. Nice, yeah, I agree with you on both your points for debugging as well as for building creatively. It's the second one that I, in particular, think is valuable. If the more that a data scientist can dig into the underlying fundamentals, like linear algebra and partial derivative calculus, the probability theory that underlie, that underlies machine learning, including deep learning, which is a kind of machine learning and transformers, which is a specialized deep learning architecture like fundamentally under the hood. You have these relatively simple mathematical operations going on. And if you understand those things that are going on under the hood, it allows you to have way more flexibility and creativity and the ways that you can be solving problems. So like you gave that example there about blending intention heads for multitask problem, for training multitask architecture. And there's, there's an unlimited way. It's like, who is giving this analogy recently? I think it might have been Harprese Hota, who was in episode number 693. So just a week ago, that episode was released. And in that episode, he talks about Lego blocks. I think he has, he has young kids. And so, you know, when you, when you understand what all of these different blocks are, it allows you to have an effectively infinite amount of flexibility in how those blocks are combined and the things you can do with it. And I find, I see with my team, with at my company, Nebula, on our data science team, when we're trying to, in particular, productionize what we're doing. In order to have that work efficiently for the specific use case that we have at the platform, almost every time, there isn't some, like there's, there's lots of tools out there that allow you to, with one line of code, productionize your model from a Jupyter notebook or whatever, and those tools are great and they're really amazing. But they also only work in a relatively narrow set of circumstances. There's all kinds of situations that we encounter regularly in production and that I imagine lots of companies do, where there is no turnkey approach. You're going to need something unique that nobody has ever done before, in order to have a performance experience, real-time experience for your users, that blends together all of the kind of, the back end things that are going on and the, and yeah, so, yeah, I'm a huge evangelist, I guess, for understanding the building blocks, so much so that regular listeners will know, I have this machine learning foundation series that is mostly available on YouTube already, and there's a GitHub repo where all the code is available that covers linear algebra, calculus, probability theory, algorithms and data structures, statistics, because, yeah, I think it's so important. It's so fundamental to know these fundamentals. Totally agree. Nice. All right, so thank you for letting me talk so much. No worries, it's interesting. It's good stuff. So, all right, let's move on to another topic from your book. So, one of the challenges that we encounter as data scientists, particularly when we're working with large amounts of data, like we want in transformer architectures, you already mentioned this kind of earlier when you were talking about how you and the Android were trying to come up with feature extractors and, or trying to come up with some model and your manager would say, oh, but we don't have any labeled data. This is super common in natural language processing that we have access to large amounts of data, like you're just a scrape of all of the internet, but we don't have any labels for those data. So, we just have the sequences. We don't have this, for whatever task, it could be some classifier task where a common example is sentiment. So, is this string of, is this tweet or is this movie review a positive review or is it a negative review? You know, we might have access to billions of tokens, but maybe only a few hundred of these labels or maybe none of these labels. So, I know this is something that you talk a lot about in your book. Would you mind sharing some of your favorite strategies for tackling this common NLP problem? Sure. And this was, I think, born out of pain, basically. I mean, there's a pain that you have when you're a data scientist, you know, trying to solve a business problem with little label data. So, I'll tell you the way we kind of did it in the book and then I'll tell you a little bit about what's changed since the, I think, the advent of, you know, tragedy beauty and GPT-4, which for me personally have kind of made me rethink a little bit how I would tackle this. So, the first one is, if you've got like no label data and this can be relatively common, especially for extractive tasks like named entity recognition or question answering because the price of labeling the data is quite a lot. There aren't a huge amount of tools available yet for tackling those kind of tasks. And for there, you might be better off just going for a generative model like GPT-4 or JTPT and saying, hey, here are a few examples of what I'm trying to do. So this is so called like few-shot prompting. Can you please, you know, now complete the final task? And these generative models are quite good at following those types of instructions. But if you're doing something that is more like, say, text classification, then there's far more tools available for this. So, for example, in the Transformers library we have zero-shot pipelines or zero-shot classification pipelines. And what these pipelines do is they basically formulate the classification task as a sort of what's called an NLI task or a natural language inference task where you take the context, which is the thing you try to classify, you have a sentence that is like a template to say, you know, is this positive or negative? And then you get the model to fill in the third part of that sequence. And this personally has always been like a good baseline. You just run this, it's like two lines of code, you run it over your data set, it gives you a rough idea of where you are. But then if you want to go beyond that, probably the sort of two approaches I would recommend. One is called set fit. It's a technique I developed with our researchers at Intel. And this is called a sentence transformer, or a few short learning for sentence transformers. And essentially we showed that you can classify documents across different domains with usually around eight to 16 examples per class. And you get results that are fairly comparable to training on the full data set. And this, as I mentioned, it works well for text classification. But if you want to go beyond that, then there are these other techniques called parameter efficient fine tuning techniques. And here the idea is to use a transformer like T5, which I discussed briefly is kind of a general purpose transformer that can solve many tasks. And then you try to basically prompt it in a certain way so that if you've only got a couple of labeled examples, you can still solve the question answering task in a fairly efficient way. And so I would say those are the sort of two main things I would recommend. But today, seriously, using these large language models is also, if you don't have security concerns with your data because of your company, just testing the API is often a good start, I would say. Deploying machine learning models into production doesn't need to require hours of engineering effort or complex homegrown solutions. In fact, data scientists may now not need engineering help at all. With model bit, you deploy ML models into production with one line of code. Simply call modelbit.deploy in your notebook and modelbit will deploy your model with all its dependencies to production in as little as 10 seconds. Models can then be called as a rest endpoint in your product or from your warehouse is a SQL function. Very cool. Try it for free today at modelbit.com. That's m-o-d-e-l-b-i-t dot com. Yeah, I couldn't agree with you more on all the solutions that you suggested. The one that I didn't know of the ones that you've mentioned is set fit. I just quickly looked it up and looks super cool. So I'll be sure to include a link to the set fit GitHub repo. Yeah, looks like a really cool way to be using fuchsia learning without needing to prompt yourself in order to classify sentences. And then the parameter-efficient fine-tuning heft is something I've talked about on the show a fair bit. I have an episode dedicated to the low-rank adaptation of Laura method for doing that in episode number 674. Very cool. And but yeah, it is crazy how you can be using a tool like GPT-4 and API like GPT-4, particularly for the complicated, like they were using GPT-3.5 prior to March of this year. There were all manner of tasks that I was like, oh, it would be amazing if I could just ask a model to do this. And GPT-3.5 might be able to do it a portion of the time, but not with an accuracy that was high enough that I could be confident about using those data for training a model. But then with the GPT-4 overnight with its release in March, I was like, oh, let's try some of those use cases again and it nails it every single time. But yeah, as you say, there's potentially reasons why you might not want to use GPT-4. So your company might not be comfortable with sending those data off. And then also the open AI terms of service do not allow you to be using GPT-4 to create a competitor to GPT-4. So it depends on exactly if you're not going to be creating a chatbot with the data that you label, then it's probably fine. I'm not a lawyer. This is not legal advice. But yeah, we're going to talk about this more in the episode later on. But we're getting really powerful open source alternatives to GPT-4 emerging every week, as you and I were talking about before, we started recording this episode. Every week it seems there's some major release of an open source approach that gets closer and closer to being as good as GPT-4. And some of those don't have commercial use constraints. And so it might be, by the time this episode is live, it might be the case that you can be using a completely open source, commercial use model that's as good as GPT-4. And then you can be running it on your own infrastructure. You don't need to be worrying about sending proprietary data off to a third party. And you might get comparable results. So really, really. Yeah, something that you and I were touching on also before starting recording is just that with how quickly things are moving and these capabilities that are emerging from so many people like yourself getting so deep into the open source opportunities here and releasing these models, these capabilities for all of us, it's an unprecedentedly exciting time for me in my career as a news artist. That's cool to hear. I just had one more thing where you might want to use an open source model. So I was playing with ChatGbT the other day. And I wanted to see if I could use it as a writing assistant. And so I started just taking some passages of text from the Georgia Martin's Game of Thrones. And I just asked it, can you complete this part of the text or rephrase it? And because Game of Thrones is so gory and violent and all that stuff, it just refused. It said not as a language model, I will not engage in blah, blah. And I think what this shows is that for these next frontier models which have a lot of this so-called alignment built into them, it's great to have that for general purpose chatbots. But if you want a very domain specific thing, you probably want to have something that is more adapted to your data or all the things that you're interested in. And so I can imagine a future where you have these very powerful, capable systems from OpenAI and others. But then companies use a lot of open source models then to just do the more domain specific stuff where, you know, for the reasons you mentioned about data leaving, but also the use case itself may not be supported, you know, just through the API. Right, when you need more violent language. Yeah, sorry, that's maybe the last example. The thing is that I want Georgia Martin to finish his book. I've been waiting for Game of Thrones for years now, and I just wanted to finish the last one. So, you know, if he's listening, he should just use chatbots. Yeah, it's a good example. Like there are the folks at OpenAI spent six months from when the GPT-4 architecture was trained to put barriers around it in terms of security. And, you know, I think they've done an exemplary job. In retrospect, I think it's amazing that they spent those six months, because I imagine that there would have been, this is completely speculative, no one has ever said anything to me to suggest this. But I just speculate that, you know, in a big organization like that, there were probably some people that were like, this is safe enough. We've got to get this out. This is crazy. But then, you know, some other factions were able to be like, no, like there's still these really dangerous use cases that we need to handle. We need to do more testing before this goes out. But it does mean that, you know, there are all kinds of perfectly legitimate, classic books, like you're saying, like the Game of Thrones series. You know, people love that series, but it has a level of violence that is completely fine to buy in a commercial product, yet, you know, the folks at OpenAI have decided to have these safeguards that mean that level of violence that's okay in a commercial product that you can buy as a book is not acceptable in their particular tool. And so, I am, there are probably violent use cases that we don't want. A chap want to be able to do under any circumstances, but being able to generate, you know, a fantasy novel prose, maybe you shouldn't be one of them. Yeah, totally. All right. So, Lewis, once we've created our Georgia, our Martin bot that can generate violent prose. When we want to make that model efficient in production, that's something that's a huge challenge with transformers. And I know that it's something that you tackle in your book. So, could you outline for our listeners the kinds of practical things that we can do to take these very large language models and have them be useful in real time in production for, say, a user of a platform that has an LLM running in the background? Yeah, sure. I would say also that the kind of techniques I mentioned, they often depend very much on the use case and also the size of the model. So, there are things that kind of work okay for small-ish models, but then they just don't really apply at larger scales. And in the book, we kind of cover, I would say, three main topics or areas. One is called knowledge distillation. And this is an old idea, goes back to Jeff Hinton and others before him. Where the idea is that you've got a capable model. So, this might be something that is like, let's call it the teacher. And this model is too big to deploy efficiently. Maybe you want to deploy something on the edge or, you know, in a cheaper way. And so, this knowledge distillation technique allows you to basically take, essentially, information from this teacher model and kind of imbue it in a much smaller, more efficient model. And when it works, you typically get comparable performance. You take a small amount of a hit in your, say, accuracy. But often the tradeoff is worth it because, you know, in real life situations, accuracy isn't just the only metric. You're worried about latency. You're worried about cost and things like that. So, this works quite well for models in the sort of 100 million parameter range. So, things like Bert, it works well for the small GPT models, it works okay. But no one has kind of figured out how to crack this effectively at the very large scales of, like, you know, tens of billions of parameters. And that's why, for example, we haven't yet seen, as far as I know, something analogous to, like, you know, distill, llama, 55B or something like that. So, that will often get you roughly maybe a 2X reduction in latency. You can usually compress your model about half. And then there are other techniques which we discussed. So, the most common one that's used for many use cases is called quantization. And the basic idea here is to take the precision of the weights that the model was trained in and just cast them to a lower precision. So, typically things like eight, eight bit or four bit is now the sort of new standard. And this then, because you've now got lower bits, you can basically, you know, do your map models or matrix modifications faster and use less memory. And, you know, we can talk about, there are a bunch of different quantization strategies we can talk about. But the sort of other element we mentioned is this idea of pruning. So, here the basic idea is, how can you kind of delete weights in the network together, but still preserve the sort of overall performance of the model. And when we wrote the book, the sort of currency of the art, was a technique developed at Hugging Base called movement pruning, where you basically, it's a pruning technique designed specifically for fine-tuning transformers. But all of the kind of hardware, sort of like the sort of consumer hardware that existed, basically didn't really help you, because even though you delete all these weights, you need to save them as sparse matrices, and then these sparse matrices, they don't really get any bit boost on like standard, you know, Intel hardware. So, we kind of concluded that pruning at the time wasn't quite mature enough to be used in production. But my colleagues at Intel have said that, you know, they've now got some quite impressive approaches where you do get genuine sparsity and, you know, fast latency. So, those are the three main techniques that we can kind of dive deeper if you want. Yeah, yeah, model distillation, quantization, pruning. Yeah, these are the three that come to mind for me as well in terms of production deployments, examples that you gave there, and cool to have you break down for us so clearly, the kinds of circumstances where one of these approaches might work well versus another. I think I will actually leave that topic there and not dig too much deeper because there's still so many more things that I want to get into while we're recording. But maybe I could mention one thing specifically about, like, generative models. So, these kind of techniques I mentioned, they're very generic and you can usually apply them whether it's an encoder or decoder, it doesn't really matter. But one of the big bottlenecks when you're doing chat plots is having a fast response, right? So, if I ask a question, like, you know, what's the weather like today? I don't want to wait a minute to get my text back. And there's been a lot of cool innovation around streaming tokens. So, this idea of, like, sending the user, like, kind of, like, you know, bit by bit the answer. And that's what you see in chat to you, right? You don't have to wait and then get the full answer. You can see the answer being kind of generated on the fly. And one of my colleagues at Hoganface called Olivia, he built this very, very cool server called text generation inference, which not only does this token streaming, but it does really impressive optimizations of the transformer architecture. So, you can do things that you can fuse operations in the transformer to basically run faster on certain cuda kernels. And you can do, like, cool things to do with, like, basically, how you shard the model across different GPUs. So, if you're doing any sort of generative text task, as far as I know today, this is, like, the kind of current state of the art for deployment. Nice. Yeah. What was the name of that library or approach? Text generation inference. Wow. Super cool, Lewis. I had not heard of the Hoganface LLM text generation inference library before, but I will definitely be checking that out because it sounds like exactly what I need for a lot of use cases, at Nebula, with our production deployments. Thank you so much for sharing that. And it ties in perfectly into the next topic that I wanted to cover, which is the Hoganface ecosystem in general, and all of these open source tools that Hoganface releases for the public. So, what is the Hoganface ecosystem and what role does it play in the practical application of transformers and NLP? So, you've obviously given us a bit of a taste here already with the text generation inference library, but that only scratches the surface. I mean, the Hoganface Transformers library is fundamental to this entire movement. That's right. Yeah. And just, I think, last week it crossed 100,000 GitHub stars. So, that was a pretty nice milestone. I think it's one of the first machine learning libraries to hit that. And as you said, right, like, in the origins of Hoganface, for those who actually don't know, Hoganface started out as a chatbot company building a chatbot for teenagers. And then Tom Wolf and Victor San, they saw this transformer release from Google with Bert, and they were like, okay, we need to put this in PyTorch, because TensorFlow is not what we want to program in. And so, they did a fast port of that to PyTorch, and then it just exploded. So, you know, I think it coincided almost at the perfect time where the community was very quickly getting excited about PyTorch, and then they had seen, you know, the performance of Bert, and now they could just run this themselves. And of course, the first thing that you face when you're trying to build such a library is like, okay, where do you get models from? And in those days, a lot of models were basically shared on Google Drive, or on GitHub. And the challenge that you have is that as the field moves very fast, how do you kind of synthesize all of these different pre-trained models? And this kind of, like, gave birth to this idea of the Hoganface Hub, which originally started off as like a model hub where basically you had the pre-trained weights of Bert and, you know, other transformers that followed it. And then you had a very sort of nice integration between the transformers library and the Hoganface Hub, so you could basically pull models from the hub, run them locally on your machine, and you could also then, you know, push your train models back to the hub so that, you know, you didn't have to again share them with your colleagues with Google Drive. I think you could just say, hey, check out my model. You can now test it yourself. And in machine learning, right, models are kind of often the focus of attention, but in reality, there's like a much wider range of things that you have to worry about. So of course, it's like data. Like, where do you get your training data from? And how do you kind of curate that data? And so this, eventually, the hub kind of expanded and scope to now host data sets. So we now have several tens of thousands of data sets. And these data sets are contributed primarily through the community. So we have some very cool ones, like, you know, the classic ones from NLP, but also you've got nowadays not just NLP, but like many modalities. So we have vision data sets. We have time series data sets, NLP, of course. And what's cool about this is you get this kind of, this like kind of like ecosystem building where people go, oh, now I can take a data set from the hub. I can take a pre-trained model. I can train a new model on that combination. I can push that model back to the hub. And then other people can build on top of that, or they can feed it into their demos. And so the sort of ecosystem today is broadly speaking, a collection of open source libraries, built with the kind of layer of the hub. So the hub is basically tightly integrated to all of these open source libraries. And the kind of mission of what we have at Hagenface is to basically provide these tools to the community. So they can then go and build, you know, cool companies, cool products. And, you know, we have our own paid services on top of this. But kind of in the core of the company, it's fundamentally open source. Very cool. It's a great breakdown. And there were some details in there that I wasn't aware of, particularly with respect to the initial history of where the Hagenface hub emerged from. However, I did know about the 100,000 GitHub stars. And yeah, it is, there's only a few machine learning libraries like PyTorch and TensorFlow that have that. Yeah, exactly. So very, very cool. Yeah, we're really grateful for your work. Stretching back even pre-pandemic. I don't know the exact year, but I know it was pre-pandemic because I was still working in an office, which I haven't since. And so a colleague of mine, Grant Balevelt, he was saying, so I guess it was around 2018, 2019, he was just marveling at all the cool things that Hagenface was doing. And he was like, this is the coolest machine learning company in the world. So awesome that you work there. It must be an amazing atmosphere. Yeah. So specifically, you are a machine learning engineer at Hagenface. What does that mean? How does that intersect with the kind of stuff that you read about in your book? And what are some of the exciting projects that you're working on? Yeah. So at Hagenface, the roles that we have are quite broad in scope. So even though, formally, I'm an engineer, I've also done a lot of work on education. So previously, I worked on a course for Transformers that we offer to the community. And also, nowadays, what I focus more on is typically around the sort of research side of things. So how can we build tools and artifacts around this domain of RLHF, which we mentioned earlier? So I would say, depending on which kind of branch of the company you're working on, a day in the life can hand look a little bit differently. But more or less, we all collaborate over our open source repos. And this can range from building features for the libraries that Transformers or just patching bug fixes and so on. But the core goal in general is to always try and pick the most impactful kind of projects to work on. And as a result, this means that you have to be very reactive to what's happening externally to the companies. So, for example, when stable diffusion landed, my colleagues at Hagenface, they very, very quickly had this diffuser's library, you know, having the integration of this model from stability. And when I say, quickly, I'm talking on the scale of days to weeks. So, you know, it's like you have to be very, very fast to keep up with what the community is doing. And we see that also today with large language models, you know, as you said before, we have these new models landing. So within the sort of Transformers side of things, you need to be able to kind of quickly decide whether to integrate it into the core library or not. So that sort of in broad terms, what I do specifically today, as I mentioned before, is more around trying to figure out if this reinforcement learning stuff actually works. So we kind of have, I would say, a few existence proofs from OpenAI and Thropic that it does. And, you know, talking to ChatGPT gives you a sense that it does work. But we haven't yet seen in the community a very clear end-to-end example showing that not only does reinforcement learning kind of work in a technical sense, but it actually makes for a better model that is, you know, more aligned with human preferences. And there have been a few attempts to do this. But the kind of, I would say, the conclusions have always been a bit murky because the evaluation of these systems is very complex. And so, kind of primarily what I'm looking at now is in this aspect of training and evaluating these more complex beasts. Yeah, could you break down for us a bit more this RLA shift concept? So, you know, what is it? What's involved? What data are needed? You know, why did people try this at all in the first place? Sure. So, again, OpenAI, whether we're the pioneers here, and they actually built towards ChatGPT in several kind of impressive papers. So, their kind of first foray in this direction was learning to summarize. And what they were interested in was we know that language models, especially generative models, are good at generating summaries. But people often complain that these summaries aren't very good. So, when you try to measure, you know, how good is a summary, you have some kind of automatic metrics like the Rooge score, which try to measure kind of the overlap of your summary with a kind of reference summary. But generally speaking, people had always kind of recognized summarization models weren't great. So, what they did instead was they said, well, why don't we get the model to generate some summaries, and we show those summaries to humans, and then we'll get the humans to rate which of the summaries is best. And so, the idea was that instead of trying to use some metric like Rooge, which always has some, you know, limitations, the thing we really care about is people reading summaries, so let's just teach the model, pardon me, how to learn that. And so, the recipe is relatively simple on paper. Basically, you take your summarization model, you generate some summaries, you show them to humans, they label them, then you train a second model, which is basically a classifier, so this is called a reward model, and this classifier is basically learning how to distinguish good and bad summaries. And then what you do is you take those two pieces and you do a third step, which is where the reinforcement learning comes in, and essentially what you're trying to do is you're trying to optimize the model to produce better summaries, and reinforcement learning essentially has a loop where you can essentially generate some summaries from the model, your reward model will basically rank them and say, okay, that's a good summary, that's a bad summary. And that gives you essentially a signal to sort of update the weights of the model in a direction that is more aligned with whatever the reward model is telling you. And if you do those kind of three steps, what they showed in their paper is that the resulting models basically were preferred much more by humans for summaries than, you know, the baseline. And that's kind of like the recipe that most people today are trying to follow, but now at much larger scales, and not just for one task for summarization, but also for multiple tasks. And the modern version of that recipe is that instead of having just summarization data, you now try to collect a large amount of what's called instruction data. So these are things like write me a recipe for an omelette, give me 10 things to do in Paris, and all these kind of very creative tasks that we have as humans, or, you know, how do I write Python code for X? And you train a model that is able to follow those instructions, but this model will always have this kind of problem that it may produce outputs that are a bit problematic, or it just veers off in the wrong direction. And so you do those, that again, the human preference step, reinforcement learning step, and then if everything works, you should get something like, you know, chat GPT, but no one has quite succeeded yet. And I think that's where there's a bit of an arms race at the moment in the open source community to see who is like first doing that. Yeah, very cool. So I'll quickly try to summarize back to you what RLHF is, or kind of paraphrase it, and then let's dig right into that exciting arms race. So the idea with this reinforcement learning from human feedback at a high level is that humans providing, so probably most of our listeners, and if you haven't, you've got to use chat GPT, a study actually recently came out that something only like 15% of Americans have used chat GPT. Hopefully in the data science community, it's above 90%, and if you're listening to this right now, and you haven't used chat GPT yet, you got to, or maybe, so if you're using like the GPT4 API, but haven't used the chat GPT interface, I will forgive you. But in the chat GPT interface, you have the opportunity after every single output that you get from the model, you can give it a thumbs up or a thumbs down. And so that thumbs up or thumbs down, can then be used as training data for this RLHF, and the lose outlined the steps as to how this happens in more detail. But the summary point is that it allows the model to have outputs that are more aligned with the kind of thing that you would like to see. So going way back to earlier in our conversation, this means that these state of the art generative models like GPT4 are more than just a sophisticated auto-complete because there's this additional layer, at least this one, maybe even more than we don't know about, layer of sophistication that means that the outputs are more like what you expect in a conversation maybe with another human, or maybe not even with another human, but just the kind of output that you want when you provide the kind of input that you do. And because of how popular chat GPT is, there's a huge amount of this training data presumably that allows OpenAI to be building a moat around what they've done. And we have, however, seen a lot of open source groups. So there are lots of open source models that have come out in recent months that have built on things like the Lama architecture that I talked about back in Episode 670. And so doing things like taking that Lama architecture, which was like just a sophisticated auto-complete, and then using instruction fine-tuning afterward, using open source versions of these kinds of thumbs-up, thumbs-down human data in order to fine-tune Lama to be able to be more like GPT-4. So some of these architectures are like Alpaca, Vikunia is one that is really popular. And there are ones that also have complete commercial use terms. So things like GPT-4 all J is completely suitable for commercial use. But anyway, so the main point is that with RLHF, yeah, we get way better models and it's really cool that there's folks out there trying to, with the relatively limited open source data, relative to probably what someone like OpenAI has, doing the best that we can to be approximating the way that GPT-4 performs. And that brings me to my next question, though, which is that, as we talked about, these are really exciting times. We have thoughts of people like you, like everyone at Hugging Face, and thousands of other people around the world are racing to be building open source tools that are as good as GPT-4. Maybe it's even conceivable that, and this isn't actually something that I've thought out loud before. So I'd love to hear your input on this. Maybe it's even conceivable that the next big breakthrough in these conversational agents, or in generative AI, or in machine learning in general, will be open source as opposed to coming from a commercial entity like OpenAI. Yeah, I think that's definitely possible. And we already see a wide variety of directions that the community has taken to tackle some of the engineering challenges. So, for example, we talked briefly about Laura, with this low-rank adaptation methods. This is kind of the driving strand at the moment in all of these instruction, fine-tuning experiments that the community is doing. Because, for example, if you want to try to fine-tune Lama 65B, so 65 billion parameters, you're going to need several hundred gigs of GPU memory. And for the average person, right, that's kind of at a reach. And just recently, Tim Demers and his collaborators, he's a really impressive PhD student at Washington. They wrote a paper called Qlora, so this was like quantized Laura. And they showed that, you know, with a four-bit quantization, you can run and even train Lama 65B on a kind of consumer-grade GPU. And I think those kind of innovations are things that you wouldn't see from a private company, because it would be your competitive advantage, right? Why would you share that knowledge with the community? And it just shows that when you've got a tough problem, which is like, how do you train large models with limited resources? People get very creative. The other thing that I think has been quite interesting is the evaluation of these models, especially these chat models, is kind of gradually growing in maturity. So a lot of the early evaluation was done using something called basically the Vikunya benchmark. So the idea here was, let's get GPT-4 to write a bunch of questions. For example, you know, how do I solve this coding puzzle? And then you give that question to the models that you're interested in rating. And then, you know, you get GPT-4 to act as a judge and then kind of compare which model is better than the other. And in the early days, this showed, oh, Vikunya is like 90% as good as chat GPT, according to that benchmark. But most people who then interact with Vikunya versus chat GPT can see a fairly big capability gap. I mean, you can see that Vikunya can't hold conversations over many turns effectively. Chat GPT can do things, for example, you just dump a stack trace into it and it will then debug it for you, like, unprompted. And so these models were lacking in certain areas. And the community has now kind of realized that a lot of these things are often evaluating the style. So basically GPT-4, as an example, as a judge, will often prefer outputs that are just very wordy, because you know, chat GPT is always like a kind of wordy chatbot, rather than if they're factually correct or not. And even humans fall for this. So there's a very nice paper from Berkeley where they essentially saw they showed that, you know, even human evaluators would get tricked by essentially chat GPT. And I think that's like a general challenge today in the community is like, how do we know if the models are actually very good? And again, it's something that I suspect OpenAI has cracked internally. But they're things that, of course, that's your competitive advantage, right? So the community is going to make the innovation there. And yeah, I mean, we can talk about other things like, I think one thing that's kind of been an open question is, do you even need reinforcement learning in the first place? And you know, we know we have this kind of existence proof from OpenAI. But there are other kind of researchers who are sort of skeptical that you truly need reinforcement learning, which has its own finicky problems. And it's kind of exciting to think that, you know, we already have a few candidate alternatives, you know, on the archive, which, you know, may prove to be more efficient and also simpler to achieve the same objective. Yeah, everyone's always trying to squeeze out reinforcement learning. It's like a few years ago, it was like, deep reinforcement learning is going to be fundamental to artificial general intelligence. And then it's kind of having this renaissance right now. We're like, okay, in order for us to have these approaches, these LLMs be really well aligned with the responses we want. We're getting any reinforcement learning. Finally, it's the fact. And then you're like, no, actually, we might not need it. We might be able to use simpler approaches. And yeah, it seems like a lot of these instruction tuning approaches, they're just supervised learning. They don't require any reinforcement learning. And yeah, I can personally vouch that we're getting amazing results without reinforcement learning. So very cool. If people are listening out there that haven't done open source before, and they want to get involved with it after they hear the kinds of cool things that you're working on in particular, you know, they want to get involved with the hugging for the hugging phase transformers library or some other library, the pie torch library. How do you recommend me get started? Yeah, that's a really great question. A common one that I often get. I would say there's a few different ways you can contribute. It depends a bit on like your background. So if you're already like a very proficient say pie torch developer, then reading the transformers code source code is relatively, you know, straightforward. So you can immediately, for example, pick up open issues on GitHub or look for open bugs that haven't been tackled yet and work on those. But for people who are a bit more like myself, so I started off being a non-coder. I was quite a late bloomer. I think I was like 28, 29 when I started learning how to code. Oh, really? Yeah, yeah. I'm like, I'm very, very late. And for me, for me, the thing was like, I was looking at this stuff. And I'm like, I have no idea like how to contribute and actually starting off with like just trying to read the documentation and improve the docs. I was often the sort of gateway drug to then actually writing code because often when you're trying to understand something, you realize, oh, there's a gap in in the way it's explained. So I would say those are like the sort of two main roots. One is like go through the docs, which is more like kind of high level and the other one is to sort of just pick up issues that are on GitHub. But the kind of open source landscape for machine learning is also more diverse than just code, right? So some of the sort of like most impactful things that we've seen on the Hugging Pace Hub have been from community members who for example created a translation of a popular data set or they curated like their own data set, which turned out to be very useful. So an example of this right is the alpaca data set, which was the kind of the data set that sort of launched this whole revolution in like Lama instruction models. It was like three grad students at Stanford who basically used, you know, chat GPT to, or I think it was chat GPT to generate a data set of instructions and they trained a model on that. And you know, it was like 300 bucks and I think probably a few days work. So it's kind of like that there are different ways you can contribute and the other one that maybe is worth mentioning is we often have at Hugging Pace a lot of events. So we have hackathons where people, for example, can get access to things like Google TPUs and train, you know, very cool projects. And so if you want to be part of like the community itself, that's another way of, you know, getting getting your hands dirty and seeing, you know, excitement. Very cool. Great tips for getting started in open source reading and improving the docs, picking up GitHub issues and things like data circulation. Very cool. All right, Lewis. So I actually had a ton more questions that I could have gone over with you, but I also want to get to some of the audience questions that we had. So when I posted a week before recording that I was going to have you on the show, the post got an extreme amount of engagement. So at the time of recording over 36,000 impressions, almost 400 reactions, 23 comments, 13 reposts, it's crazy. And some of these questions are really cool. All right. So the first question that I'm going to go over is from San Geeta. So she's an NLP engineer and she is interested in hearing about your views on the notion of synthesizing a data set using a enterprise model. And then fine tuning it on an open source LLM. So you and I lose we did talk about this earlier, but she mentions that your recent blog post on LLM evaluation was amazing. I wasn't aware about that. I'm going to have to try to make sure to include it in the show notes. And yeah, so I don't know if there's anything else that you want to add for San Geeta on on this concept, which yeah, we talked about a bit earlier, but you might have more to add for her. Yeah, so the the basic process here is like, you know, getting a very good instruction data set is quite a costly endeavor. Because if you use humans, you need to get people to sit down and come up with creative ideas of like, you know, give me a recipe for pasta and then actually give the recipe right. So you have this kind of very arduous task and you can pay companies to do that, but it will cost you quite a lot of money. So the short cut today that most people take is they say, well, let's just try to derive this from GPT for from chatty, but. But as you mentioned earlier, open AI have this kind of thing in the terms of service, which say, you know, you can't use these these outputs from our models to, you know, train competitors to, you know, our stuff. So even though I think, and I'm not a lawyer, so of course don't take this advice, I don't, I'm not sure how enforceable terms of service are really like, I mean, who knows that has to be tested, I think in court. But of course, if you're a company, you don't want to go near that, I think, I think that that's that's too high risk to take today. So the alternative. I would suggest is to maybe see if some of the newer models like Falcon or I'm a certified B can get you maybe a half the way there. So I've lately been prompting star coder, which is a kind of code, generative model. And it's quite, quite okay at generating some of this synthetic data for coding applications. So I think it's probably only maybe a few months away before we're able to do something analogous to like the sort of chat GPT generation using a permissive model. And then those issues will no longer be with us. Nice, great answer. Thank you for elaborating some more for us, Lewis on that point. The next question here comes from Murillo Gustinelli, who is a data scientist at a firm called Insight. And so this also builds upon some of the conversations. It seems like you and I through our conversation hit on a lot of the topics that are most interesting to the audience at large. Because this again will build upon something that we are to talk about. So Murillo points out how hugging faces undoubtedly played a significant role in lowering barriers to open source ML with the emergence of LLMs and increasing complexity and cost of deep learning models. How critical do you believe the continued democratization of ML models will be in the near future. So yeah, what are the challenges and opportunities associated with having better democratization of ML models, particularly these really large ones. So you you already made a great point shortly before we started tackling these audience questions on techniques like Q Laura quantized lowering adaptation. I'm not sure if you have anything else you'd like to add. Yeah, I think probably for me personally the biggest reason to try and make sure that we still have open models is that as we've seen in the last few months, the community or the collective intelligence of humanity is able to learn and discover a wide range of impressive cool things. So Q Laura is one, but also this whole thing about evaluation and trying to deeply understand, you know, how these how these language models actually work. A lot of this would be much, much harder if we only had an API from a small number of companies to work with. So I think from just a purely like scientific perspective, it's really important that we were able to continue making such models and releasing them. But there are of course several risks and like one of the main risks I see is that at some point we will have a model that is, you know, fairly capable. And someone's going to do something bad with it. I think that's a bit of an inevitability and bad could be, you know, some large misinformation campaign or even worse. And then the question will be then, you know, who bears responsibility for that? Like, is it the organization that opensource the model? Is it the company that hosts the model? Is it individual? And I feel like society doesn't quite have yet the kind of mental model for dealing with that. And so what I suspect will probably happen is that techniques like RLHF will become progressively more important in the release of new open models. Because if you can sort of make some level of guarantee that, okay, this model has been, you know, has some guardrails, then you're at least able to sort of partially limit the downstream risks. But I mean, you probably heard this is like a super big topic in legal and political circles, like there's the Congress hearings and the EU AI Act. And so I think all of this stuff is really being, you know, negotiated at the societal level. But I feel fundamentally that we would want to have a future with open models because it reminds me, you know, a lot of parallels to, you know, like science, like when I used to be a physicist, there were eras in the Cold War where people didn't share any information. And it's kind of to the detriment of humanity to do so. And we'll see how it plays out. But yeah, exciting times either way. Yeah, exciting times for sure. And there certainly are risks to, there's pros and cons to these two different kinds of schools of thought on should companies like open AI be keeping every keeping the secrets to themselves, you know, maybe some government auditing body gets to access what they're doing, but we don't want any actor to be able to have access to a system that approaches artificial general intelligence or even systems like we have today that we use for misinformation. And then, yeah, in the other camp, it's this idea that if everything's open source, then you can really get in there and understand exactly what's going on. Yeah, it's complex, but it's nice to see that unlike some of the other issues that we've had in recent history with digital platforms, things like social media feeds, polarizing politics, that isn't something that politicians got ahead of. And I think they're kind of realizing the mistake there and trying to make sure that that same kind of issue doesn't show well, I mean, some issues are going to show up, but where there's a lot of people in government in commerce in, yeah, and just in the open source community that are trying to get ahead of these issues that we have an AI. And so I'm personally optimistic that well, it is inevitable that some bad things will happen that the worst things hopefully won't and that even some of those bad things will be mitigated. So very cool. There were lots of other questions, but I've gone over all of them now and it seems like we tackled all of them. You know, or the main points of these questions. I mean, we've answered it. And, and then so I've got to apologize to our listeners, I actually I promised someone at the time of recording just yesterday on LinkedIn, there's a listener named Jonathan Bound out there and he, he said, I just got to the end. The end of your most recent episode and at the end of that episode, you mentioned a book giveaway and he was like, I, I guess I'm too late because the way that I run these book giveaways is when the episode comes out. So your episode, your episode will be out on a Tuesday morning, New York time. And so I'll make a post around 8 a.m. New York time announcing the episode and I'll say the first five people that respond that they'd like a copy of Lewis's book will get a copy generously from O'Reilly. So that is happening again. And I was supposed to mention it at the beginning of episodes, obviously. So that the listener is anyway, I guess so so I promised Jonathan Bound, he was like, I just got to the end of this episode and you have this a really deal like I got to your post, but I'm a week late. So I'm sure all the books are given away and in fact, they are. And so apologies again to those of you who are doing this on the other hand. It revolts the person who listens, right? Yeah, who listens right to the end right when the episode comes out. Exactly. It does reward that that behavior. So yes, that yeah, there's a book giveaway again. Yeah, thank you very much to O'Reilly for offering to tough. So yeah, the five people that write on my LinkedIn post announcing Lewis's episode and I'll mention it in that post as well. Yeah, we'll get a book and it's a fantastic book. I've got my copy here and it's been invaluable to me as my company Nebula has for a more and more into generative AI, particularly with open source approaches. So thank you Lewis and thank you hugging face for everything that you've done for us. Now Lewis, before I let you go, I always ask for a book recommendation other than your own book. Do you have one for us? Yeah, so I've been reading this is nothing to do with your transformers or machine learning. It's called the making of the atomic bomb. And maybe it was recommended to me by a friend who was saying, hey, you know, I've been thinking about existential risks and stuff. There are some parallels. And what's really interesting in the book is it's a really an in-depth history of from basically World War One, pre-World War One all the way through to the bomb is like the extreme amounts of government level coordination that was required. First of all, to build the technology, but then later to figure out how to regulate it. And I think the cool part of this is that there's at least we managed to sort of more or less survive that part and figure out how to kind of live in a world with like very, very scary, scary weapons. And so I'm sort of still optimistic, like you said, that we will find a way through the next few years of AI development. I think the book is very nice. And if you like, you know, technical physics and stuff, it's got a lot of stuff in there too, so I recommend it nice. Cool recommendation. Lewis, thank you so much for being generous with your time today. I know we've run over on the allocated recording slots. I really appreciate it. It's been a fascinating episode. And I've learned a ton. I'm sure audience has as well. Lewis, before I let you go, how can people follow you after this episode if they would like to hear more from you? Sure. So these days, I'm mostly on LinkedIn. So just look up LewisTunstool. You can see my face. I don't think there's too many of us on there. And until recently, I used to be on Twitter, underscore L-E-W-T-U-N. So Lutun. And yeah, my phone broke and I got locked out. And yeah, unfortunately, Elon seems to have fired all of the support stuff, so I can't get back in. But one day I'll get back in and then, you know, he might see me on Twitter. Nice. All right. Good luck. Maybe someone at Twitter is listening and can fix this situation statistically speaking. There probably isn't. Okay. All right. Lewis, thank you so much. Awesome to have you on the show. Amazing to be able to go full circle with this amazing book of yours that I'm reading. And yeah, best of luck to you. And maybe we can be catching up with you again someday in the future and hearing how your journey's coming along. Thanks a lot, John. It's been a pleasure. Boom. What a sensational guest who made for a sensational episode. In today's episode, Lewis filled us in on how transformers are all you need for state-of-the-art NLP models. How we can efficiently label data using few shot prompts to the APIs of cutting-edge models like GPT-4. How we can distill, quantize, and or prune LLMs to make them affordable and fast in production. How RLHF uses human label data to align LLM outputs with what users are hoping for. And how you can get involved in open source yourself by improving GitHub documentation, resolving GitHub issues, or curating data sets. As always, you can get all the show notes including the transcript for this episode, the video recording, any materials mentioned on the show. The URLs for Lewis's social media profiles as well as my own social media profiles at SuperDataScience.com slash 695. That's SuperDataScience.com slash 695. If you too would like to ask questions of future guests of the show like several audience members did during today's episode, then consider following me on LinkedIn or Twitter as that's where I post who upcoming guests are and ask you to provide your increase for them. And if you enjoyed this episode, nothing's more valuable to me than if you take a few seconds to rate the show on your favorite podcasting app, or give it a thumbs up on the SuperDataScience YouTube channel. And of course, if you have friends or colleagues that would love the show, let them know. Alright, thanks to my colleagues at Nebula for supporting me while I create content like this SuperDataScience episode for you. And thanks, of course, to Ivana, Mario, Natalie, Serge, Sylvia, Zara, and Curel on the SuperDataScience team for producing another fantastic episode for us today. For enabling that super team to create this free podcast for you, we are deeply grateful to our sponsors, please consider supporting the show by checking out our sponsor's links, which you can find in the show notes. And finally, thanks, of course, to you for listening. I'm so grateful to have you tuning in, and I hope I can continue to make episodes you love for years and years to come. Well, until next time, my friend, keep on rockin' it out there, and I'm looking forward to enjoying another round of the SuperDataScience podcast with you very soon. You