687: Generative Deep Learning, with David Foster

This is episode number 687 with David Foster, author of the book, Generative Deep Learning. Today's episode is brought to you by Posit, the open source data science company, by Anaconda, the world's most popular Python distribution, and by withfeeling.ai, the company bringing humanity into AI. Welcome to the Super Data Science Podcast, the most listened to podcast in the data science industry. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. I'm your host, John Crone. Thanks for joining me today. And now, let's make the complex simple. Welcome back to the Super Data Science Podcast. Today, I'm joined by the brilliant and eloquent author, David Foster. David wrote the O'Reilly book called Generative Deep Learning. The first edition from back in 2019 was a best-seller while the Immaculate Second Edition, which was released just last week, is poised to be an even bigger hit. He's also a founding partner of applied data science partners, a London-based consultancy, specialized in end-to-end data science solutions. He holds a Master's in Mathematics from the University of Cambridge, and a Master's in Management Science and Operational Research from the University of Warwick, both in the UK. Today's episode is deep in the weeds on generative, deep learning, pretty much from beginning to end, and so will appeal most to technical practitioners like data scientists and machine learning engineers. In the episode, David details how generative modeling is different from the discriminatory modeling that dominated machine learning until just the past few months. He talks about the range of application areas of generative AI, how autoencoders work, and why variational autoencoders are particularly effective for generated content. He talks about what diffusion models are, and how latent diffusion in particular results in photorealistic images and video. He tells us what contrast of learning is, why world models might be the most transformative concept in AI today, and lots on transformers. What transformers are, how variants of them power different classes of generative models, such as bird architectures and GPT architectures, and how blending generative adversarial networks with transformers, super charges, multimodal models. All right, you ready for this profoundly interesting episode? Let's go. David, welcome to the Super Data Science podcast. It's great to have you here. I understand that you're actually a listener of the show. Yeah, a massive long time listener. Thanks for having me on John. I really appreciate it, and I'm really looking forward to getting into in-depth conversation with you about generative AI. Pleased to be here. I'm glad that you've reached out to us about having an episode because you have this amazing book that just came out. It's really exceptional. I wish somehow I could have written this book. It's so timely and so comprehensive around generative AI models, which are obviously the hottest thing right now in the world. There's nothing that people are talking about more than generative AI. Whether they call it that, when people are talking about high forms like chat GPT or mid-journey, they are talking about generative AI, and so I was delighted that you reached out as a listener to be on the show. You're like a celebrity listener out there. Thanks, David. Where are you calling in from today? I'm based in London. This is our office here in London, in Old Street. Yeah, it's actually sunny here in the UK, which is the first. Finally, someone's dawning on us, so yeah. Nice to see you to be talking. All right. Well, let's rock and roll and get right into the content that we have planned for you. There's so much to cover today because I know I'm going to learn a ton of filming this episode, and no doubt our listeners are going to learn a lot about generative AI as well. So you just released the second edition of your popular book. It's called generative deep learning, teaching machines to paint, write, compose, and play. And so the first edition came out in 2019. It did very well, and I know that this one is going to be huge. Can you explain the differences between generative modeling, which is the focus of your book, and discriminative modeling, which is the probably up until recently was the much more common type of machine learning? Yeah, you're absolutely right. It was, and I think the reason for that is, first of all, it's just a lot easier than generative AI. If you look back at the history of machine learning, the field has been driven by discriminative modeling primarily because, first of all, it's really useful in business. It's really useful to a ton of applications, and you've got a label data set, and there's a very clear outcome that you want to drive. You want to drive predictive accuracy against that label. With generative AI, first of all, the application isn't perhaps as clear, or at least it wasn't when the field was in its infancy. But also, secondly, it's really difficult to determine how well you're doing, because it's kind of subjective as to whether a piece of text or a piece of art is good. There's no such label that you can use to determine that. So in terms of the differences, like, discriminative modeling is all about being able to predict a specific label that you're given about an input, and typically you're moving from something that is high-dimensional, like an image or a block of text, or highly structured data, for example, through to something that's low-dimensional, like a label, or a continuous variable, maybe a house price or something like this. Now, generative AI moves the other direction. It's saying, can we start with the label and move back to the data? And so it really focuses on whether the model has understood what patterns are present in this data set, so that it not only can do something, like, collapse the dimensionality from an image to a label, but it can say, here's a label, dog, cat, boat, ship, go and find me the data that would produce this label. I produce me an image that looks like a ship. And why is this difficult? Well, the reason is because when you're moving to this higher dimensionality space of, say, pixels or word tokens, there's a lot more that can go wrong. Array is very, very good at detecting something in an image that looks off, or something in a paragraph that just grammatically doesn't make sense. And so we really have to try hard when we're building models like the ones that we've seen such as GPT or the diffusion models that I'm sure we'll come on to later to make them good enough to be plausible. And so it's like finding a needle in a haystack, right, to find that one image of a boat that looks real. We are working in maybe like a thousand dimensional space. When we're collapsing stuff down in terms of discriminative modeling, we've got to collapse maybe to just one dimension. And that's a lot easier. So yeah, I would encourage anyone who's kind of getting started with machine learning to start with discriminative modeling because even though generative AI is the hype, you've got to know the fundamentals and a lot of the techniques that come up in generative AI are still fundamentally based in good old-fashioned discriminative modeling. But they have often within them a slant that makes it like you're predicting something in a higher dimensional space. But you're still using the same concepts like loss functions, like modeling a density function, for example. And so discriminative modeling will give you that basis. But if you just want to get started, you start there, but you can move pretty swiftly onto generative AI, which is the current hype. Yeah, and speaking of swiftly, I mentioned how your first edition came out in 2019, which is just four years ago, the field has changed dramatically since. So yeah, like, run down for us how different the content is from your first edition to the second edition that's nearly released. It is a totally new book. I got to be honest with you. I sat down with the publisher and they said, do you want to write a second edition? And this was about the time, this was basically this time last year. So maybe slightly earlier. So this was before like Dali too. It was before anything with stability. And I kind of sat down and thought, yeah, actually, I think this is about the right time to write a second edition. There's a lot of change, but ultimately, I can move some stuff around. I can move some chapters around. I can update the refresh the examples, refresh the content. And at the moment, I signed that contract to say, I'd write the second edition. It all went nuts. And like, Dali too was released. And then suddenly, like, there was just this explosion of large language models and text image models, which is, first of all, incredibly terrifying. If you've just agreed to write a second edition, and I realized through the writing process that I needed to completely rewrite the poll book. So there is so little content that is the same. I would almost be, I think there's basically none of it is the same. It's a new book effectively. And I'm proud of that because it means it's current. It means it's up to date. And I can honestly say, I'm really proud of it. It's something I think takes you from beginner through to understanding the entire landscape of Geno and AI models. It doesn't just focus on one model type or one, you know, what's currently in Vogue. It tries to take you on the journey from, let's just lay the fundamentals down in the foundations through to, okay, now let's talk about stability and stable diffusion or Dali too or mid-journey. Let's really get it to grips with what these models are doing. And obviously, GPT and the Open AI series. So yeah, I'm really proud of it. And I think I feel privileged to be in the position where I can I can write this book. I think hopefully lots of people will get a lot out of it. And I'm really excited to see it in the market. I wouldn't be surprised if this edition became like a standard in the field. And based on what's covered in here, how well you covered it, it's so comprehensive. And the kind of praise that you got in the outside of the book kind of backs me up. You got François Chaudet, the creator of Keras, is writing about how great he thinks the book is. You've got the head of strategy at Stability AI, the company behind stable diffusion. You've got senior people at Microsoft Azure. You've got people from Aluther AI, which is like in recent episodes, five into Friday episodes, I've been talking a lot about the open source, large language models that Aluther have made available. You've got Ashwarya Srinivasan, who is this extremely famous content creator who works at Google Cloud. I mean, yeah, so I'm just kind of backing myself up. I'm quantitatively now. I've said, I've given so many of these. So yeah, I think your book has a lot coming out. You're going to say something, I'll just go with you. No, yeah, I feel really privileged that these people have taken the time to leave a review and to actually read the book and say that they think it's a useful addition to the library. I think when I look back, I'm standing on the shoulders of these giants really. I mean, I'm just reporting on their incredible work. So I wouldn't be able to write this book without what they've done, particularly someone like Francois Sholle, who's basically created the library that I'm using throughout the book to build practical examples of generative models. So yeah, really privileged. And then you used open source LLM's from Aluther to just write all the content. I was, yeah, I missed the chat GPC stuff. Yeah, it's the stuff by a year. Like if I was starting the right in the book now, maybe it'd be pretty easy. Yeah, well, I joke. I think, you know, obviously, there was actually a really interesting discussion on the last week in AI podcast, which is hosted by Jeremy Harris. Yeah, and Andre, I can't remember his last name, but that Jeremy's been a guest on the show a number of times. And they were recently talking about how they, you know, for, for like online content that's like listicles, like BuzzFeed type stuff. That's very easy to automate. But New York Times journalists, where you have to be, you know, doing investigative reporting, you know, you could be working on one story for months and really digging into things and interviewing people, like that kind of job isn't going to go away. You can be augmented of it, you know, you can help with making sure that you're doing everything around it, like, correctly, and, and, you know, suggesting some small parts of what you're doing. But with a book like yours that is so technical, so advanced, so cutting edge, while these tools could be augmenting, you're writing, absolutely. It couldn't actually be generating all the, like, yeah, it can't be, it can't be generating all the content. Not yet. Exactly. Yeah. So language generation, like text, as well as audio. These are some of the examples of generative AI images you talked about with like dolly to what other application areas are there? Yeah, we cover the lots in the book. So there's, for example, music is a field that I find particularly fascinating. I'm a musician myself. I can see what you guitar there and in the background on the YouTube video. I'm really surprised actually that music generation hasn't really taken off in the same way that language generation has because, you know, many ways you think it's perhaps a little bit easier because there's so many genres of music and, you know, like, we've got to arrange these audio waves in such a way that's pleasant to the year, or as words have a grammatical structure and there's very strict rules about what we want to see. But, you know, I sort of think to myself, why is that? And I wonder if it's put in part because of the lack of data that's available. There's obviously a ton of text data available on the web. Not as easy, perhaps, to find music data in such quantity. Perhaps it's also driven in part by necessity and large language models are also extremely useful. So yeah, we cover it in the book, though, so music generation. I'll just quickly interrupt you on the music thing. I think that you're absolutely right. I think it's the, I don't think it's the posity of data, although there is obviously a lot of language data out there. There is a lot of music data as well. I think that you hit on it right at the end there, which is that very few people are employed in creating music, but almost all white-collar workers, our, our lingua franca, like our, the medium that we, that we intake as well as output is in text. And this became even more obvious through the pandemic when you saw so many jobs could be done remotely, where it's just like emails and Slack messages, and so it's text in, text out, for a lot of what we do. So I think that's why it's, you know, something that's talked about more, but it is interesting. There has been an explosion in generative music, so Spotify apparently has a thousand tracks, a hundred thousand tracks uploaded to it every day, a hundred thousand tracks a day, and almost all of that is AI-generated music. And the reason why that happens in such, because you think, well, what's the point? Why are people wasting server time uploading that? Is that then they also have bots that listen to those fake tracks, which brings in royalties for these people. But Spotify is starting to crack down on that. Anyway. Yeah, I can imagine. Yeah, I think it's interesting to see where this goes, because I know, for example, acquaintances, I guess, with the VP of Audio Stability AI, and he is first and foremost a composer. So he's not someone coming at this from kind of machine learning perspective. Firstly, he's someone who commented to this as a composer, so he cares deeply about the rights and the the authenticity of the music that's being generated, but seeing the potential for a different kind of music that we're listening to in future. So yeah, it's been exciting to see how platforms like Spotify jump on the on the bandwagon here. Absolutely. This episode is brought to you by Posit, the open source data science company. Posit makes the best tools for data scientists who love open source period, no matter which language they prefer. Posit's popular RStudio IDE and enterprise products like Posit Workbench, Connect and Package Manager, these all help individuals teams and organizations scale R and Python development easily and securely. Produce higher quality analysis faster with great data science tools. Visit Posit.co that's POSIT.co to learn more. All right, so I interrupted you while ago. You were going to transition away from music to another application area for generative AI. Yeah, sure. So we cover music in the book, but also other modalities, especially cross-modalities. So we're talking about things like text to image and also interestingly kind of things like text to code, which I guess is another kind of language model, but very specific kind of language model. But also in the final chapters, how reinforcement learning plays a part when we're talking about things like world models where there's a generative model at the heart of the agent, which is trying to just simply understand how its environment evolves over time. And then layered onto that is the ability for the agent to use this generative model to understand what its future might look like and therefore hallucinate different trajectories through its action space. So yeah, we might come on to this a little bit more detail later. Yeah, we got it all covered in the book. Awesome. Yeah, there's so many exciting topics to come from this episode. Yeah, so application areas that I've now, I think, jotted down relatively comprehensively. You got text generation, voice, music, images, video, code, multimodal models, tons of different areas, really exciting times. So in what way do density functions serve to distinguish these different generative AI techniques from each other? Yeah, that's a great question. So I would say, if I just sort of briefly talk about you know, the how we cover this in the book, so that the first section of the book is we call methods. And this is where we I'm laying out the six fundamental families of generative AI model and the second half is based on applications. So like, what can you do with them? Now the six families of model are differentiated by how they handle the density function. So let me give you an example. The first split that you can kind of make is between those that implicitly model the density function and those that explicitly model it. And what I mean by that is imagine the density function is basically like a landscape over which you're trying to move to find images that are more likely to be real than others. And the images that are most likely to be real are say at the bottom of the valleys and the least likely to be real are at the top of the mountains. So you're always trying to move downhill in this in this in this model. And you're trying to come up with the landscape that truly reflects how real images are produced. So we're kind of like postulating that this landscape really does exist and that we need to find a model like an abstraction if you like a reality that captures the true nature of this. So if you imagine the different dimensions of this landscape are the pixels in an image, then there are some configurations of pixels that are in the valley by they are producing very realistic images and there are some configurations of pixels that are on the mountains and they aren't very realistic. So the question always becomes firstly, how do you model this landscape? What does it actually look like in this very really high dimensional space? And secondly, how do we navigate it? How do we move downhill to find images that look real? So implicitly you can model this by something like a GAN where you don't actually write down an equation of what this model looks like, but you you play a game between what's known as the generator and the discriminator and generator. Two quick we jump in for sure as far as they don't know that term GAN, it's a generative adversarial network. Yeah exactly, generative adversarial network and GAN and you're basically playing a game here between the generator that's trying to create images that look real and the discriminator that's trying to pick between those that are real and not. And so at no point in that process, do you write down an equation that says yeah this is what I believe the density function to be, you're implicitly modeling it through this game and that is in contrast to pretty much every other kind of model that does in some way try to try to create this density function that's which we call usually called PFX. So in this other set of models there are different ways of dividing it up and one of the ways for example is to say okay we can approximate it in some way and we're not going to try and find it perfectly but we're going to approximate it. So variation ought to encode us do this and some other model types as well. On the other side you can also find models that try to model it really explicitly such as your autoregressive models where you basically place some constraints on how the generation is produced so autoregressive models always look to produce one sequence step ahead so something like GPT as a good example of this where you're just predicting the next word or token at the time and you can write down an equation that says like this is what I believe the landscape to look like because I am I'm restricting it to just predicting the next word so if you wanted the equation would be huge but obviously you can write down you know what that looks like and then you got some other types like normalizing flows where you enact a change of variables on the landscape and you try to you try to morph the landscape into something that is easier to sample from. You've got energy-based models which are the fundamental root of diffusion models which again we can talk about later again this is basically like saying how can I come up with a function that tells me how to move downhill in this landscape and then yeah I think that covers it that's our six kinds of model yeah so they all kind of try to model this density function slightly differently but ultimately it's a fundamental part of generative iIs understanding what we mean by a density function and we cover that in the first chapter of the book. Sweet just why we've kicked off with that here so something that's happened recently at the time of recording is that Jeff Hinton who is perhaps the single most important person in the history of deep learning and deep learning is essential to all of these generative techniques that you've just been describing indeed your book is called generative deep learning i'm not really aware of contemporary generative approaches that don't use deep learning correct yeah you do pretty much now um so Jeff Hinton sometimes called the godfather of artificial intelligence but probably like more accurately the godfather of deep learning and he won the Turing Award with Joshua Benjio and Jan McKess so this is like the equivalence of a Nobel prize for computer science and he was at Google for a very long time he recently left at the time of recording this at least and he cited significant concerns about the misuse of generative AI is the key reason for him leaving he wanted to be able to express himself more more clearly he's actually clarified since that he doesn't think Google is doing a bad job but that there's present concerns here and that he needs to be able to speak really about him so do you agree with Jeff Hinton uh yeah what do you think about this whole situation okay there's a few things to unpack here um so first of all massively respect Jeff Hinton's work i think a lot of us wouldn't be doing what we're doing without his fundamental breakthroughs in the field around things like back propagation obviously in the early days of deep learning um so yeah it's worth listening to what he says first of all because i think he's got valid points and he puts them very eloquently across in his interviews first of all i would say it's important to note here that the the difficult position to take in this is that we're going to be fine and the reason i say that is because it's very hard to prove somebody wrong that says AI is an existential threat because if it hasn't happened yet then they can just say well it hasn't happened yet but it will happen so you're kind of always in this position of like well how do i how do i show that this argument i don't particularly agree with how how do i show that i don't think it's an existential threat and that we can put things in place to prevent the threat from happening or that it's just not viable for it in the first place so i think you've got a first of all thing really hard if you're going to come out and say i think AI isn't an existential threat and i and i have been doing a lot of thinking about this you know listening to arguments on both sides and i think there are hugely valid points to be made but ultimately i've come down on the side of not thinking it's as greater threat as perhaps the likes of Jeff Hinton are putting out there and i think one of the reasons and one of the criticisms i perhaps would make of the argument that it is is that i i don't like the idea of the the just waving the hands and saying that the AI will want to take control i think there's a huge leap here from saying that we have large language models which now predict the next word very very accurately and of course can be changed with tools and on all of those things to then saying that those same language models will have wants and desires and long-term aspirations to achieve a particular goal i i really don't think that a model which is ultimately interpretive the in these models whilst they look as if they're doing very clever extrapolation i believe ultimately are still confined by the dataset that they are trained on i don't i don't think that and i you know i might be wrong with this but i just don't think that they're going to have the capacity to want to eliminate us and that is ultimately what he's saying and to be clear you know this is very separate from saying bad people would do bad things with AI and i think they will there's no question i mean we see it with every technology that you know bad people if they want to do bad things with the technology they will and i agree with him there that that we need to be extraordinarily cautious that we don't let that happen and put the regulation in place to ensure that it doesn't but there's a huge leap to then say like the AI itself is going to want to dominate us just because it's more intelligent than us or apparently more intelligent than us i think we're downplaying our own capabilities here you know the example i would make make as a counter example perhaps is if you trained a large language model on all scientific data or just all data up until say 1910 would it come up with general relativity and i i just don't think it could i don't think it can make that extra relative leap that says given the data i have available to me at the time i can run this thought experiment myself and want to run the thought experiment to come up with something as profound as relativity i can't see that happening and therefore it makes me it leads me to the conclusion that we've got we've got something worth sort of fighting for against this AI and we shouldn't just lay down and say yep we're now on path to existential annihilation because we've built something that can predict the next word very very well i'm optimistic basically did you know that anaconda is the world's most popular platform for developing and deploying secure python solutions faster anaconda solutions enable practitioners and institutions around the world to securely harness the power of open source and their cloud platform is a place where you can learn and share within the python community master your python skills with on-demand courses cloud hosted notebooks webinars and so much more see why over 35 million users trust anaconda by heading to super data science dot com slash anaconda you'll find the page pre populated with our special code sds so you'll get your first 30 days free yep that's 30 days of free python training at super data science dot com slash anaconda yeah there's a lot of different ways we could go with this and i'm i'm not going to let us i mean we could literally spend this entire episode talking about this stuff but we have a lot of technical stuff that i'd like to get into with the gendered AI that you specialize in so i'm not gonna i'm not gonna drag this out too long there there are interesting things where so yes today models like gpt4 they're predicting the next word they're they're not in another cells of risk but if you you know we have tools like auto gpt that were built on it where auto gpt could potentially be given a large amount of resources including a lot of its own gpt4 agents and we could give that auto gpt a broad task like here's a million dollars increase the amount of money and and one person might say increase the amount of money but also don't break any laws whereas another person might not give that qualifier or even without breaking any laws it might figure out a way you know that takes advantage of some people to to generate more money in the bank counter so it and while auto gpt today might not be too sinister you know maybe we're with with how crazy things that become just in the last year like so you talked about signing your book contract a year ago and the incredible progress that's happened over that year if somebody had asked me a year ago whether i thought a model like gpt4 could exist in our lifetimes i might have said i don't know that's really good yeah so i don't know so what is even just scaling like that that huge innovation has come about through just scaling the same architecture transformers and so you know scaling that another ten times or another hundred times before it gets prohibitively expensive to train you know there's like if we can't do that many orders of magnitude before we're talking about like a hundred billion dollars train a model which which so like this probably also going to be scientific breakthroughs beyond just the engineering breakthroughs that we're doing today on scaling bigger and bigger so anyway so i just so it seems like i can get why people including Jeff and ten are concerned about the existential risks but tying kind of more immediately and more into the kinds of concepts that are covered in your book he also expresses concern about just you know fake content misinformation which you alluded to there you know that that is the immediate risk like with the tools we have today anybody who wants to misuse them can and they can do things incredibly powerfully you know just just tying lawyers up with like a specific example i read yesterday was i think this was in the economist they gave this example of how a person could create a thousand page document as to why like a nimby someone who doesn't want um so not in my backyard nimby yeah some of you a nipiest uh they could create this thousand word proposal for government officials to read about why they don't want uh you know electrical wires visible from their back window and a human then probably is gonna have to read that and respond to it so there's just there's all these interesting things like and that's not even a misuse really the technology um but there's there's so many this scale now that we can create language ad uh it it's going to cause problem and so it doesn't seem like you're too concerned about it so i guess yeah why aren't you that concerned about the immediate risks or do you already have in your mind ways that we can overcome these risks perhaps with AI itself yeah so i would say the immediate risk i'm slightly more worried about it's the existential risk that i perhaps take these over played um the immediate risks of disinformation and the ability for large language models to create a huge amount of noise in our world whether that's creating work for people like you know lawyers reading the document that you just mentioned or just the fact that it it might nullify the power of things like social media platforms if we can't really determine you know what's real and what's fake as well as democracy itself if we're now influenced by uh a media content that isn't correct or isn't isn't real i think is more of a risk and the way i would like to see this handled is first of all education so i think we are going to have to get used to a world where we need to be a lot more vigilant of what's real and what isn't real um i think we've been extraordinarily privileged actually to live through the start of the internet era being relatively free of fake content and i think that has generated a huge amount of um a huge amount of worth in things like for example programming where before i would have to go and buy the book on python if i wanted to learn how to do something now i can just go online and i know i'm going to find a article written by a human um that tells me exactly how to do what i want to do so there's a huge amount of value that's been created by that and i think that value is now being condensed into the model such as GPT4 which is going to be even more powerful than like me trawling through hours and hours of stack overflow content to find out how to do something in pandas which is what i usually end up doing um so you know on the one hand it's going to actually improve efficiency like this but also like you say i think we just need to be extraordinary careful that we don't let this thing you know run away with itself um and i just you know humans are incredibly slow to react to new technologies like this we often need some sort of event before we go i yeah we don't want that happening anymore and nobody really knows what this event is going to be um i was talking to a a i i p lawyer earlier and she's along the same lines that she's like it's very hard to get people to take notice or listen before something happens that kind of we go yeah that's the thing we didn't want to happen so i think this is in line with kind of like some of Sam Orman's comments and and also Jan Lecun's comments around like how can we start legislating against something that we don't we don't know and it's you don't you don't want to kind of stifle innovation you don't want to stifle research just for the just because you're worried that something might happen you know otherwise you just legislate everything um so look i don't have all the answers but i just i i'm optimistic and i i would hope more i would like to see more people optimistically trying to come up with solutions rather than just kind of pointing out that there's a annihilation around the corner which i just don't think is credible at the moment yeah and i think that there can be i think that a i itself can be used to solve a lot of these issues so Jeremy Harris whom we've already talked about he is a he has a lot on his show last week in ai about ways that we can be mitigating some of these risks and one thing that he talks about regularly and that he talked about even on our show so we had him on or a gbt4 risks episode it's episode number 668 and in that episode he talked about how we can be using ai to be monitoring ai because it's much faster than us so we can't have people monitoring for slight aberrations but we could train ai to be trying to keep it in line so that's like the existential risk thing that's i guess a leading approach today for how we deal with that and even with misinformation stuff i mean we can have misinformation detectors um that are automated and i'm usually pretty uh skeptical about the crypto hype and blockchain in general but a real life application of the blockchain that was first brought to my attention by Sadie St. i believe in episode number 537 of this show was uh that you can be using the blockchain to verify that a document is real so if there's a source that you trust like the New York Times or the economist or whatever and then uh an image or a video it can be tagged i don't know there's a knowledge of very well but you can you can verify on some blockchain they're like okay this actually really came from that trusted source mm-hmm yeah attribution is going to be a critical thing that we have to we have to care about going forward and i think what's important is that we don't sort of say we don't make it black and white this is ai generated this is not ai generated because ultimately it is a gray zone if i use a ai tool to generate a structure of a document but then i fill in the blanks or i extrapolate i don't really want to be having to label that as ai generated because ultimately it's had my eyes on it i've overseen the process it's a bit like if i use a tool you know like a spell checker i don't have to declare yes i spell check this document um i i just put it out there because it's had my eyes on it but what i think we need to label is anything that is ai generated that has had no human i look over it and that's where i think we might need to start saying like okay if this if this content has been produced and no human has had any part of production of that content i think people should know about that and i and i think it's important that we can distinguish or at least label in some way any content that that has gone out unverified because that's where you might start to see the the problems and i go back to my example there if you know stack overflow um if there's content on there that is an answer to a question that has been ai generated i kind of want to know if i was reading it like take this with a pinch of salt because it might not be something that someone has actually produced um it might still be useful but it's not human generated it's ai generated nice yeah so there are risks but we can mitigate most of the risks and i think it's good that people are calling these risks to our attention and so hopefully we can get ahead of them to some extent and the most audacious uh issues can be tackled upfront um so let's move back to technical stuff so um one of the fundamentals of generative ai is autoencoders so we talked about density functions earlier um let's talk about autoencoders these are a really key concept in generative ai and so there's this idea of encoding information so um you know let's let's take the example of a text to text model so um this is like the chatgbt experience we provide text to to the model it encodes that text and then it gets encoded into something called a latent space and then there's a decoder that takes that latent space information converts it into new text that then in this example what i'm giving chatgbt provides text back out to us so encoding text latent space representation decoding um i think altogether this describes an autoencoder so uh yeah maybe fill us in a bit more on what these terms mean and what important role they play in generative ai systems yeah cool great question i i'll take it back actually to an example with images because i think it's slightly easier to um to sort of visualize uh for your listeners so um let's imagine we've got an image and it's in like a thousand and twenty four pixels so high dimensional space every single one of these pixels has three color channels so you've got a lot of numbers basically to describe that picture um and as we've previously mentioned there is some density function that describes why that image is very likely to be a true image and other noisy images aren't now what autoencoders look to do is say can we map this high dimensional space of the image domain to a latent space what is known as a latent space of a lower dimension so you could even map to a latent space of two dimensions then it's very very easy to visualize you're just imagining a plane uh and then on that plane there are mountains and valleys and and that determines whether some of those points in the latent space are likely to be generated in some art now the reason why this is useful is because it forces the model to make generalizations over the pixel space so that it can compress that information into the latent space it's a bit like if someone said to you um I always give the example of like cookie jars or or biscuit tins which are cylindrical um how many numbers do you need to describe the shape of that biscuit tin the answers to you need to know the height and you need to know the diameter of the uh the the cross sectional circle if you know those two numbers you could reproduce the biscuit tin so even though this thing exists in three dimensions and it's you could view it from different angles and and come up with different you know pixel pictures of it actually you can describe it using two and in that latent space uh you could basically move around it to produce different kinds of cylinder and the same thing exactly the same thing is true with models like diffusion models or even uh a very short to encoders you're basically saying to the model find me a load-dimensional latent space where if I choose any point within that latent space over some distribution like a normal distribution uh centered on the origin I am pretty likely to find something that is uh truly a real image and then so what the decoder is trying to do is move back from the latent space to the pixel space can you take these two numbers and recreate the biscuit tin so if you join those two things together you've got what's known as an auto encoder because it's trying to effectively compress the information down to something small and then expand it back out again to the original image it's auto encoding itself nice great explanation I love that three-dimensional biscuit tin cylinder represented in two dimensions that's such a crisp way of describing how this latent space can can contain information like that awesome um so there are different kinds of auto encoders we've got variational ones for example uh which are more popular today so how do variational encoders differ from traditional ones what unique capabilities do they are yeah so the problem with vanilla let's call them vanilla auto encoders so not variational is there's a few problems first of all um if you just let the model map to any old latent space like you just say uh take the pixel space and I just want you to find two numbers that kind of represent what that image is so that you can decode it the problem is it's very hard to sample from that two-dimensional space because okay so let's say the point a hundred a hundred is that a valid image what about two hundred two hundred like two million two million you know like where in this where in this still free vast space should I be sampling and so what you end up with is a very difficult latent space firstly to sample from without much structure it's got no incentive to kind of pull similar concepts together because ultimately it's it's unconstrained what a variational auto encoder does is it makes a very slight change to the loss function and it effectively says you've got to include a term which makes sure that the points when you map them into this latent space are as close to a standard normal distribution as possible by a standard normal distribution what I mean is a normal distribution with a mean of zero and a covariance or standard deviation of one so we know how to sample from this object is it's really common and we we know exactly how it works and what in by doing that what happens is first of all everything gets compressed to something that looks like a normal distribution and that helps us in two ways first of all it means that there is a degree of continuity in the latent space so you can move around it and be pretty sure that anything within this normal distribution is going to be something that's likely to be a real image and you know if you move to the extremes then you're going to find something that's less likely but we can understand what this distribution means so we can sample from it really easily and just make sure that if we're choosing random points from a normal distribution standard normal that we're going to be able to decode these points to a real looking image so it's a bit like the glue the variational auto encoder is a bit like the glue that glues everything together and makes it a true generational model that we can sample from and not just this abstract auto encoder object which doesn't really it's not very easy to work with basically nice the future of AI shouldn't be just of a productivity an AI agent with a capacity to grow alongside you long-term could become a companion that supports your emotional well-being paradigm and AI companion app developed by with feeling AI reimagines the way humans interact with AI today using their proprietary large language models para.ai agents store your likes and dislikes in a long-term memory system enabling them to recall important details about you and incorporate those details into dialogue without LLM's typical context window limitations explore with a future of human AI interactions could be like this very day by downloading the para.app by the Apple App Store or Google Play or by visiting para.ai on the web great explanation crystal clear so variational auto encoders they allow us to constrain distributions to standard normal which leads to better behavior in the auto encoder we get better results precisely yeah and the term in the loss function to do that's called the KL divergence Kubik libel divergence and it's the glue that makes it the first kind of generative model I would recommend everyone starts with nice yeah that KL divergence that's big in information theory yeah it's a way of measuring the difference between two distributions so like if you got your distribution of points and you want to compare it to the standard normal you could use the KL divergence to do that the beautiful thing about this is that it's actually got a closed form solution for a standard normal which means that you don't actually need to do any sampling to work out what is value is you can just write write down the answer which is super powerful cool all right so we've got kind of the key terms now under our belts for generative AI we know about density functions we know about the application areas we know about auto encoders so let's talk now about that big breakthrough the captured public the public's imagination so even before you signed your book a year ago there was already a lot of hype around dolly so this was released by open AI the same company that released Jaguar T this is a text to image generator and the original dolly while miles behind the dolly too they came up shortly after you signed your book deal and even the original dolly for some kinds of requests it created stunning imagery like being able to on their website for example there were examples of you being able to say I want a shark you know to to walking a crocodile or whatever and it could create a cartoon of that and you know compared to dolly too or mid journey it wasn't that many pixels it was definitely better at cartoony type stuff relative to photorealistic stuff but still this was like the first time that I and probably most people were able to have this unbelievable creative outlet of being able to take any text that comes to your head and that automatically generate an image so that dolly model it leverages diffusion and your book has an entire chapter devoted to diffusion so can you explain what diffusion is and how noise can be employed in the generation process yeah sure let me start with diffusion then so dolly actually is made up of a few components diffusion is used in a few of them so it's definitely a core component of dolly too so yeah great place to start is this first of all explain what diffusion is so diffusion the best way I can describe it using kind of a metaphor is imagine that you've got like a set of TV sets all linked together in a long line and the first TV shows just random noise so just complete random static and the last TV in that sequence shows an image from your your data set now if you want to move from the image in your data set on that television to the the random noise it's very simple you can just add tiny tiny bits of random noise to that image in tiny steps just kind of Gaussian noise which basically means noise sampled from a normal distribution and eventually over enough time steps you won't be able to tell what that image was it's got it's basically as good as random noise so you've kind of moved from the image domain of your data set through to the noise domain which we can sample from and with generative AI where you're always trying to get to is can I sample from this thing because if you can sample from it that means you've got this random point that you now just need to decode and so you know we talked just now about encoders and decoders the adding noise is a bit like encoding it's not quite the same because it's actually it's not a learned model that does this it's just noise addition but what the the beauty of the diffusion model is that it learns the reverse process it learns how to undo the noise and get back to the original image now you might say like well how on earth does it do that like how do you just out of random noise find an image but you can kind of think to yourself well if you do this in enough in small enough steps then this is kind of possible because you can say to yourself well let's imagine your data set was just images of houses okay outdoors so most of the time the upper pixels will be blue because they're the sky and you're going to have some kind of maybe greeny pixels down the bottom so to get from random noise to an image you might train a model to say well let's try and keep some of the green pixels at the bottom and I you know I think they're the ones that need to be adjusted in such a way that they're slightly more green and the pixels at the top I want you to adjust those in such a way that they stay roughly more blue than the other pixels in other parts of the image and it turns out that if you do this in enough time steps and in small enough steps the model through taking what it already has and making a slight adjustment that makes it slightly more like an image can make random noise turn almost like before your eyes magically into back into something in the image domain and the the the way that the diffusion model actually works the nuts and bolts of it it's something called a unit model which it doesn't try to unlike a various should also encoder which kind of yeah tries to move from say the the latent space back to the original pixel space in the decoder this unit model just simply maps the image to another variation of the image with slightly less noise that's what it's trying to do and yeah if you do this over enough time steps then it turns out you can train a pretty good model to learn how to decode noise back into the original image domain so yeah that that's how they work diffusion models are all about units and they're all about adding noise through a forward step and then trying to remove the noise through a backward step nice and so I guess that's how stable diffusion works as well so that's the so behind mid-journey at the time of recording mid-journey version five is the state of the art it creates amazing photorealistic images and so this same kind of approach is in behind there it's probably just scaled up right probably more like yeah larger model architecture and more more training data yeah and the beauty of stable diffusion is in something an advancement that they made called latent diffusion and this is where and all of these ideas are kind of tying together now that we talked about because what latent diffusion does is it works in the latent space so there's actually an initial part of the model that tries to compress the image down to something that isn't pixels anymore but it's like it's it's a latent space of concepts effectively and then latent diffusion works in the latent space the diffusion model just works on this and then there's a decoder effectively that sits after this that takes the denoise latent space back into pixel space so yeah what they realized was that you don't need to work on the pixel space itself because you've got a lot of redundant information you can work in a much smaller and faster latent space that's the beauty of it that's why it's so good nice that makes perfect sense so the distinction between latent diffusion this newer technique that powers same mid-journey version five relative to the diffusion that's been around for all these years yeah all these months is that it allows for diffusion on the latent space which as we've talked about earlier in our discussion of how we use an autoencoder for example to go from an encoder into a latent space and I mean you need to decode that later the latent space there like your 3d biscuit tin in how that can be represented with just two pieces of information and similarly here when we're doing diffusion on the latent space we're doing diffusion on more compressed information and so it's more computationally efficient easier to scale up we get better results yeah perfect exactly that nice cool and related topic is clip models so what are clip models and how are these leveraged in these kinds of text image tasks that we've been talking about like dolly and stable diffusion yeah cool so a clip model is one part of dolly too and specifically I'll come into exactly which part and how it's used because clip itself isn't a generative model clip itself is actually it uses a technique called contrastive learning to effectively map pairs of text and images so you can imagine so you've got a data set where you've got loads of pairs of images and their corresponding descriptions so let's say you got a picture of a field with a tractor and then you've got a text description that says this is a field with a tractor in on a sunny day okay so what clip does is it tries to learn a model that can match the image to its matching text description and the way it does that is it trains two different kinds of transformer which you can come onto the details of a transformer for the text side basically says can you encode this text description into a vector and the transformer on the image size which says can you encode this image into a vector and then what it's doing is taking these two vectors and quite simply just calculating cosine similarity between them and what you want is you want true pairs to have a very high cosine similarity score and you want mismatched pairs to have a very low similarity score and that is what the clip training process does it tries to find this kind of like identity matrix of along the diagonal you get very high scores because these are the matching pairs along with the diagonal if you can imagine the images on the rows and the texts on the columns and on the off diagonal you want this to be as small as possible because you don't want these things to be regarded as similar so it's a bit like a recommendation algorithm you know like is this image recommended to be with this text and so this isn't generative right we're not going to be producing more images through this one of the cool things about this I think because OpenAI released this clip model standalone as well like this and so one of the cool things about this approach and it follows on from what you were just describing is that this allows you to have an image classification algorithm that didn't necessarily have the label that you'd like to extract in the training data so you so when we were 10 years ago the state of the art and not until very recently the state of the art image classification so with models like Jeff Hinton's Alex net that came out in 2012 that was trained on the image net data set which had tens of thousands of different labeled categories cats horses it had tons of different kinds of dogs because they use that as like they wanted the model to to be able to demonstrate that not only is it good to classifying a wide range of images but also for a specific like category of images it could distinguish fine details and be able to distinguish a Yorkshire Terrier from an Australian silk Terrier even though these are extremely similar looking dogs and so the state of the art was that you needed to I guess going back to one of our first topics in this conversation talking about discriminative models where we were discriminating down to specific class labels and even if it was tens of thousands of labels you still you could not use a model trained in that approach in this discriminative approach to be able to guess a label that's outside of the 10,000 labels that's been trained on but with clip we get exactly that so with clip you can just say he could just ask it to label images that it's never seen before in class categories it's never seen before but it uses yeah it uses this approach that you just described to map it to any natural language yeah precisely and it you know it's the reason it can work is because it's encoding everything into the same latent space it doesn't matter if it's not a label in the data set you can make it a label by doing by pushing it through the encoder whether it's an image or a text. Right so it's a latent space the the meaning that is embedded in this latent space we can extract that visually or linguistically exactly and that's what Dali to excels at it basically takes the text embedding from your input so say you've written something about I want to see a cat riding escape board then it takes that text embedding and tries to predict what the image in the corresponding image embedding looks like that's called the prior and then the final step takes the image embedding and uses diffusion to generate the image so it's like a three-step process text goes through the texting code to create the text embedding and that's that's just the clip text embedding you've then got a prior which sits in the middle that says now go and predict me what the equivalent image embedding looks like in the latent space of the image model and then just decode it I mean I say just there's a lot of work that's gone into that but that is how Dali 2 works nice okay super cool so this clip approach great not only for associating natural language that wasn't in the label training data but also great for allowing Dali 2 to be so much more effective than its predecessor Dali and yeah so I guess we already talked about I was going to ask you a question about how clip can be used for zero-shot prediction but I think we've had to cover that so this idea of zero-shot prediction is using a machine learning model typically a large language model to be able to do some task that it wasn't trained on and without any training examples at all so you just read you know you take the model weights as they were trained and you say do this task so you know you know is there is there a skateboard in this image and it can answer that question even if it's never been trained to do that precisely that's exactly it yeah even if you never sort of showing it you've never given it that task before it can have a good go it's sweet all right so all right we've got lots of great foundational generative AI knowledge now under our belts really cool topic that we alluded to earlier in the episode is world models so you've got a chapter in your book dedicated to it what are world models and how can a model learn inside its own dream environment yeah I love this I love this topic this is so fascinating to me and it's actually the reason I started writing the book in the first place was a paper in 2018 by David Haayou instrument huba called just simply called world models and it's effectively a it's like a collision between two of my favorite fields which are generative AI and reinforcement learning and in the paper they describe how you can build a agent an agent is in reinforcement learning something that takes actions within the environment and the agent has within it the very ashore to encoder that we've just talked about and what that's doing is it's trying to collapse down what it's seeing in the example in the paper it was a car racing and around the track it's trying to collapse that down into a latent space which it can predict chronologically so it's now trying to model how its future looks given its latent understanding of what it's seeing and the action that it's just taken and this is where everything collides for me because it's like you've got the VA the very ashore to encoder creating the latent space of the environment and understanding what it's seeing you've then got an auto regressive model they used an RNN recurrent neural network in the paper which tries to predict auto regressively how that latent space will evolve over time given its actions and then you've got reinforcement learning which is an entirely different field which then says how do you how do you take actions that maximizes the reward given the environment that you're in is in your own hallucination of how this latent space evolves and the latent space of course includes how the reward evolves over time and what kind of episode reward you're going to get so I love this field because a world model for me is it encapsulates everything about machine learning that we've learned so far there's discriminative stuff involved but also a generative component a reinforcement learning component and I think this is a really powerful concept in teaching agents to behave as in an environment with their own sort of generative understanding of how that world operates feels very close to how we do it as humans you know when when we're learning a new topic we're not it's not really something that we expect the environment to give us a nice package that reward function for we seem to be able to have an inherent understanding of how the world operates and then layer on top our actions onto this understanding so if I'm shooting a basketball through the hoop you know I I kind of know what's going to happen because I can imagine what the action is going to do to my latent interpretation of what I'm seeing and so it makes me learn I mean I'm still terrible at it but it theory should make me learn a lot faster because I have I have an internal internal representation I'm not just operating on the pixel space of my my eyes yeah so well models are the reason I wrote the book really so I've got a lot to add to them super cool all right so world models blend variational auto encoders auto regression deeper enforcement learning and to allow machines to visualize to imagine to dream some time steps into the future as to like what the most likely outcomes are given in a current state and this allows it with the deeper enforcement learning component to then take actions that allow to achieve some objective and just break down a few of the terms that you use there from reinforcement learning you talked about a reward function so and you also talk about agents so in a reinforcement learning paradigm so reinforcement learning has been around for decades and reinforcement learning is a is a class of machine learning problem really where you want an agent you that could be a person or it could be a machine to be able to take a series of actions so a really big example of deeper enforcement learning in recent years is the AlphaGo algorithm by Google D-Mind which was able to be the world's best go player so this kind of thing where you have a board game where the sequence of actions and you want the agent to be able to to predict what likely actions are going to lead to winning the game of go or winning a video game could be an Atari video game was very popular a few years ago for training these deep reinforcement learning algorithms and oh yeah I should say that a reinforcement learning algorithm is a deep reinforcement learning algorithm when we use deep learning to solve the reinforcement learning problem exactly and so I think that ties together all the terms oh it'll reward was the last one there where so in reinforcement learning we let's say we have it playing a video game then we provide it with the pixels on the screen and that's like the state of play but in addition to that we have a reward function which in video games is often really easy which is why Atari video games were so popular a choice for tackling with deeper enforcement learning problems because they have an inbuilt score like Atari games like all of them have a point score that we're trying to maximize and so we feed that reward to the algorithm and it learns okay if I take this action if I press right on the joystick or left on the joystick is that likely to increase my reward in the future or decrease it or keep it flat and so reinforcement learning algorithms are trying to maximize their reward and so your point there was with most reinforcement learning approaches in fact as far as I was aware until this conversation all reinforcement learning approaches we had to have this reward function made explicit that the algorithm is trying to maximize so if we go outside of the video game scenario once we're say teaching an algorithm to drive car we'd have to come up with we'd have to manufacture some function like you you get one extra point for every meter traveled exactly towards a destination but you lose a thousand points if you hit a pedestrian and so what you were just saying now is fascinating to me because I think you said that with these world models we can have a deeper enforcement learning model learning real world problems without needing to specify explicitly what that reward function is yeah it's a it's a case of the world model itself doesn't need the reward function the the world model is simply trying to understand how its actions can be used to effectively model and predict how the environment will move in future then the power of it is that you can layer on top of that a particular task and of course that task would have to have a reward function but obviously this is a lot faster than just from scratch learning a reinforcement learning task from scratch with a reward function it's almost like the world model gets you 80% of the way there because you have an inherent understanding of the physics of your environment before you say to it now try and drive the car fast and so in the paper for example what they do is they they actually just train the world model completely task independently so there's no reward they just say take some actions observe what happens so drive the car forward drive the car left drive the car right brake and just see how what your observation does like don't worry about going fast just drive and see what happens like randomly which is feels like what a baby does when it's you know crawling around on the floor and my eight month year old is doing this hopefully more and more every day until the point where we wanted to definitely not do this you've been you've been raising a new born baby the entire time you've been reading this book yeah it's a bit mad I don't I sort of blur to be honest with you I've yeah I think I think that's been sacrificed to sleep so yeah that's the how it is but yeah I'm delighted to have a new daughter but it's yeah she's actually the she's dedicated in the book the book's dedicated to my daughter say yeah I was I was lovely as always yeah the lovely as most vector of them all Alina that's the one yeah exactly so she'll be embarrassed by that in about 16 years time I think but hopefully by the end of January I would yeah maybe the height would have died down cool all right so fascinating area now the final topic area that I want to get into at least related to your book is GPT so to some extent I grew up with this I was like should we even just be starting the episode with the GPT stuff but I think that by going through these kinds of foundational concepts it will allow us to speak in you know it will be able to get more into the weeds on GPT and how that relates to generative AI generative deep learning then we could have if we just started with it so GPT generative portrayed transformers they have become by far the most widely known transformer model in fact I recently learned that open AI is trying to trademark GPT so these three letters generative pre-trained transformer so the generative obviously like everything we've been talking about in this episode so far it generates something in this case it generates text at least for now that's all that it does I'm sure that'll change soon and pre-trained meaning that it can do the kind of zero child learning that we described so you it can perform lots of kinds of tasks it's trained on so much data and it has such rich encodings of meaning that we can ask it to do something that it's never encountered before that nobody has ever thought to ask a machine or a person to do before and it can do it at least in terms of GPT4 magnificently and then so that's G generative pre-trained P and then transformer T so David what is a transformer? Yeah so transformers came into the world in 2017 seems like a lifetime ago but it's actually think about it's only sort of six five six years and what they are based around is this concept of something called attention and to understand transformers you first of all got to understand what attention is because the whole transformer architecture is at its hearts like the large majority of it is just how these attention mechanisms work and how you build them up together into what's called multi-head attention so let's talk about attention first attention basically is a different way of modeling sequential data that is the complete opposite of the way recurrent neural networks do it so recurrent neural networks says oh yeah I'm going to take each token one at a time in sequence and I'm going to update my latent understanding of what this sentence or stream of tokens means so far and then I'm going to get to the end of the sequence and I will try to use that latent understanding to predict the next token because I've built up enough understanding as part of this vector to do so okay attention takes this a different way and it says what you need to do instead is care about all previous tokens within your context window equally don't try to maintain a hidden state because as I'll explain in a minute there's a ton of problems associated with that but instead I want you to look at these previous tokens and first of all make a decision about where you think the information lies that you need so instead of trying to like incorporate all information from all tokens the first step is to simply say where do you want to look and part of this model is about it building up an understanding of where it needs to look for information so an example would be like the elephant tried to get into the car but it was too okay big right so the missing word there is something to do with its size now what are we using to do that the word elephant is clearly important car is important because we need to understand what it's trying to get into but say it was the pink elephant then the color pink is just irrelevant to this whole scenario having said that if we change the context slightly and we say the pink elephant was trying to hide then suddenly the color becomes all important and like a pink elephant is probably harder to hide than a elephant you know that's a different color darker color so like the attention mechanism says first of all come up with a way of combining what you're trying to do which is known as the query with all previous context tokens which are known as the keys and a little bit like clip that we just talked about it's constantly comparing the key and the query and then pulling through a certain amount of information from that token which is called the value and combining it in a clever way using weights multiplication through weights into the next latent understanding which is past of the next layer and so on and you build enough of up enough of these layers and you get such depth of understanding of the entire context of the sentence that you can mimic intelligence it turns out and that's what GVT4 does so yeah that's basically how attention works and transformers really you don't need to know a lot much a lot more about them they are just there's a few extra layer types like layer normalization positional encoding and so forth which tells you how to how to how to basically tell you where in the sentence a particular word is but ultimately what you got to know is it's all about attention attention is all you need you might say precisely yeah I mean that punchy title is still yeah it's one of the biggest memes I think in all of deep learning which is cool so there are different kinds of architectures that rely on transformers in different ways so GVT relies heavily on the decoder part of transformers so earlier in today's episode we talked about encoding and decoding in the latent space so encoding takes say text or in your your analogies for images but whatever it takes it text so like tokens of natural language or pixels of images and it converts it encodes that into the latent space and then we have the decoder part of an autoencoder that decodes that that lower dimensional representation into some desired output it could be again be text it could be an image it could be video whatever could be code so similarly transformers encode and decoder they can but in some architectures like GVT we're reliant more on the decoder part whereas other architectures like Bert which came out a few a few years earlier but it's still enormously useful for a lot of applications it only has an encoder on its transformers so why would somebody want to encode only yeah what are the key these key differences between these encoder based transformers versus GVT right this is the biggest misunderstanding I come across with people when they're talking about transformers is this differentiation between encoders and decoders and everything in between there are some architectures that have both the very first transformer actually and I think this is where the confusion comes from was an encoded decoder architecture which means it had both and so people now think that all transformers are still based around this initial architecture and they're not like you rightly pointed gpt is just decoder only they drop the encoder so what's the difference well there is one difference basically that you need to know about and that is something called masking a encoder like Bert doesn't care where it pulls the information from in the sentence to have a contextual understanding about a particular word it can look forwards in the sentence and it can look backwards so let's say I wanted to understand and come up with an embedding for the token elephant in that previous example it can look into the future of the sentence and pull information from future context in order to to to come up with a realistic embedding for that word a decoder can't do that because if you want your model to to produce and generate and go into the future you can't be reliant on future information to do that because it doesn't exist yet so the only difference is that a decoder simply says mask future information from every step of the process don't ever pull information from the future only use where you're currently at to determine the next token and that is why you can use a decoder model for generation like gpt but you can't use an encoder model like Bert Bert is for natural language understanding it's not natural language generation that's the difference cool yeah so n l u n l g so Bert an encoder only architecture we use it for natural language understanding so we can take natural language and we can encode it into this space and then we can do useful things with that we could train a discriminative model to be able to do interesting things with that encoder we could be using it to classify text so you know is this is this have a positive sentiment or a negative sentiment that kind of thing whereas these decoder only architectures like gpt specialize again in sequence and generation yeah and the thing you should use to determine which one you need is like if you want to if you want to build something on top of it like a discriminative model like you say then you've got to be looking at things like encoder architectures if you want to produce a word like the next word in sentence look at decoder now there's some examples like gpt 4 we're actually you can do pretty good discriminative stuff using a decoder model because you can just get it to output the predicted token so decoders are kind of ruling and dominating at the moment because they're they're just incredibly powerful generalist learners yeah but yeah but you might be able to more efficiently like if if you want to be encoding the language to do to do a classification task probably better you could be probably more computationally efficient using an encoder only architecturally but definitely and there's small versions of these things like distil but which you can fit on smaller hardware so yeah I think you know at first but for the call if we're approaching this kind of stuff is always to go for the encoder models first and see how they do because you're in so dangerous territory with decoders because you don't actually know what it's going to produce next you know it's um whereas with an encoder you've got the vector so you can do what you want with that nice uh and so we talked earlier about music and how that's kind of one of the more exciting areas for you and it it while we do have some isolated cases of well-known music generation by AI like there was a song which actually I candidately haven't listened to not really my genre but there was Drake and the weekend two of candidates best known musical artists it's actually wild to me is like I'm Canadian and it's wild to me like how in like comedy as well as in actors in general and in music how like Drake was like the most dominant person globally in music for years and he's Canadian and then he's replaced by the weekend who's also Canadian but and it says so somebody took it upon themselves to generate an AI generated track where Drake and the weekend appear together and if my memory serves me they like they sing about being a love of Selena Gomez um yeah someone like that to be I'm also in your boat I haven't listened to it as not my genre either so yeah I think that's correct obviously I heard of the story um and briefly on the Canadian thing I think there's a guy called David Foster who's really famous as well like a musician um so he's like every time you google me you just come up with the Canadian David Foster which I quite like to be honest there you go I can hide behind him all the musicians and all the David Foster's you need a track already again um so yeah so how can we be using transformers for music generation um I think they can play a key role in doing it well right yeah definitely I think you know the first part are cool for everybody doing any sort of generative tasks these days as transformers and music's no exception um in my book we cover this so there's kind of we go through the process of single track music where you're looking to generate a single stream of notes um because that in itself has problems because you have to care about not only pitch but you have to care about duration and unlike tokens uh text tokens where you're just dealing with a single integer and words come in discrete units like it's one at a time there's no such thing as duration in music you've got to care about not only where the note is pitched harmonically but also how long it is so there's a modeling choice to be made there about how you do that uh you can there's a few ways of doing it you can code up the duration as it's own token you can um you can basically model both streams in parallel and and model it as almost like a dual stream of tokens but ultimately um you use the same ideas that you do from text modeling so you still got attention where it's looking back at previous notes and deciding what note to come next you know and it would make sense harmonically if like you're in the key of D and you've got things that are all notes that are also following in the key of D um so it's the same idea you know there's a grammar music just like uh language um but then also we talk about polyphonic music which means music that has more than one track at once so you've got a ton of challenges there like what do you do about uh parts that just drop out for a few bars do you have to model how do you model that if like two of the parts continue and two of the parts drop out you know how do you model that it's no longer one stream of tokens but you've got like maybe a four stream token if you've got a quartet of musicians um so there's there's different ways of approaching it basically uh one of the first attempts was something called music and uh this was back in the day i think when gans were all the rage and you know it was looking at how you can actually model um polyphonic music as a picture so you imagine something um it's something called a piano roll which is basically where you you kind of draw the notes out and you can imagine um you know one of these like music boxes where it's got punch cut it's like a punch card thing and you can almost see the music being fed in as a as a picture and then it you know you spin the little crank and it kind of makes some ballerina dance or something on top but you can see that the the the way in which it's being generated is effectively a picture of the music that's being fed through so you can model it that way but obviously transformers are now obviously yeah making waves in music generation even polyphonic music so yeah lots of different options but it's always a modeling choice that you need to make out from as to how you approach it super cool i'd love to hear about uh all this music stuff i'm really excited about it something that i and i think this is the first time that i'm saying it publicly but something that i'm really excited about doing is um generating music where i'll be involved so like you pointed out if people have there's a guitar behind me which people have watched the video version of the show before they'll see the guitars always there and i actually can play guitar and singing there was an episode um it was the year end episode a couple of years ago um episode number five thirty six uh where uh i ended it uh in a song i played on episode uh i can't play guitar very well but i i'm competent at like rhythm guitar to accompany my voice and so well i have this idea for attracting really big name guests like i'd love to have Jeff hinting on the show we've had emails with him back and forth but he's always too busy and so something that's like i'm hoping we'd get his attention or if it doesn't get his attention i think at least it'll get lots of other people's attention is uh performing a song about Jeff hinting and i could use these kinds of so i haven't yet experimented with the generative AI tools for music very much but uh i have this idea that i could it could enrich this song rating process because i could have drums at bass in the background and i generated um so yeah anyways look cool idea i'm you've got to find something that rhymes with hinting i think i think he's struggling i've got the uh uh g4 will help me out i'm sure oh that's true yeah get shot um and uh yeah and then very last uh technical topic for you is uh gans so generative adversarial networks we talked about them really early on in the episode a few years ago they were the way for generating things and so i suspect i haven't read your first edition but i suspect your first edition was really heavy on gans yeah definitely uh yeah it's uh i i sort of see gans as a lot in many ways the trial blazer because a lot of the techniques and the way in which we approached generative AI was founded through the GAN movement and it you know there was like a GAN a week at one point and it was kind of a running joke that you know which GAN are we going to be seeing this week to do some niche thing and you know i hear people now saying gans are dead and i don't know why you've included them in the second edition and it should just all be about diffusion models and transformers and that's it and i you know i first of all the GAN discriminator is still used in so many really powerful models um the the concept of a discriminator constantly operating over the top of whatever you're using as the generator to distinguish real from fake and using that in the loss function is something that's still very much alive today uh take a model like as model called uh vq GAN um so vector quantized GAN and uh variations of this um is still one of the most powerful um image generation models out there you know it's it's not the case that diffusion models are eroding the world just yet and that there are and style GAN excel for example is still credibly powerful and is dominating a lot of the leaderboards so look i i never like to chase the chase the latest like thing and say that this is it and we've we've all innovation has stopped but i i hope what people can sort of take from the book and also take into their own learning is that it's good to have a general understanding of what's come before because you never know what's going to come next and where it might come back into fashion um so it there are super interesting idea that i think it's going to be around along well and in addition you have a bit in your book about combining GANs with transformers um yes uh basically what i would say to anyone looking to get into generative AI is is look for the crossover between these fields because i you know whilst we bucket them up in i bucket up up in the book into now we're doing the GAN chapter now we're doing the transformer chapter what i would say in general is a lot of the powerful models out there are actually they've got components of all of them so like you mentioned there there is a type of model in the book um that that effectively has a transformer within it in order to do part of the er encoding of a piece of text but there is a GAN discriminator in there as well um and you know when you when you're looking at a lot of these multimodal models they've got diffusion in there they've got GANs in there they've got transformers in there so they're using the right tool for the job they're not like just saying well i'm going to use one model type and you know that's all i'm going to use in the model because a transformer is brilliant at modeling sequences a GAN is brilliant at determining fake from real diffusion models are fantastic working with very rich latent spaces that you can sample from and the best models out there use all of these techniques and i think they will do in the future as well nice really exciting what do you briefly as i imagine this could go for a long time but briefly what do you see as the future of generative AI yeah that's a huge question i mean in terms of maybe i just break it up briefly into kind of technological and societal technologically i think we'll continue to see the the field accelerate and i don't see any need nor i guess application of a pause i just don't think it's it's feasible to run something like this um so yeah the field will continue to evolve but i think we'll see more emphasis on the alignment side and let's on the power i think GPT-4 is plenty powerful for us for the time being and i don't think we're going to see GPT-5 be kind of like a a a huge kind of technological improvement over four but i think we'll see it improve in terms of alignment and i think we'll see it improve in terms of customer flexibility um and just like just the stuff that goes around productionizing a model like this like um user management and GPT for business i know is coming out and all of these things that make it a viable product in the real world that we have control over um so that's one thing and then society i think what we're going to see is is wide-scale adoption of these tools and i think like all good technology they will be baked into the point where you don't really know you're using it i don't think people are going to be for long going into chat GPT to type in their prompts but it will be baked into other tools in the market and we're already seeing this like with you know chat GPT integrations into different applications or or wrappers around it so yeah i think the future is bright i'm really optimistic i'm excited by it and i hope everyone else is too because um yeah it's just the best thing ever to happen to you at the machine learning field i think yeah as a regular listener you'll know that i'm a technical optimist and certainly there are issues that we need to stand out with any new technology but uh really incredible things the last few months have been mind-blowing for me GPT 4 is completely like i still every day i do something new with it where i'm like i can't believe how well you do this thing yeah i don't know it you feel but i i feel i'm amazed at the number of people that haven't heard of it yet i know i live in my little bubble of um like generative AI and data science and you know and yet you talk to a lot of people who just go oh yeah i think that was that thing that i saw on a bbc news article and it's just like they haven't even tried it and i i feel like i'm living in a really privileged position of having this access to this incredible technology before the rest of the world gets to see it it's amazing yeah i had a i was just at the open data science conference east in boston uh last week at the time of reporting so but a month ago at the time of this episode being published and uh i gave a brand new half-day training on nlp with llms and i focused a lot on gpt4 so how you can be using gpt4 to um automate parts of your machine learning model uh development um including you know things like labeling um but also just how you should be using it all the time in your life like it's insane to not be paying the twenty dollars a month like is it you like i saved so much time i was able to have so many more coding examples in that half-day training because whenever i ran into an error i was like oh man tell me why i'm making this error and just fix the code and it does perfectly um and but i had this really surprising conversation with somebody after i gave that training he came up to me at a at a drinks session and he said you know uh what would it take for me to train something like gpt4 but it works in arabic and i was like i was like i does that like you just just you know it'll you're you're looking for something to translate into arabic he's like yeah i'm like yeah it does that out of the box and then he reached out to me later on like didn't and said sorry i didn't ask that question right like i get that um you know it can translate into arabic but like what if i want to train an arabic version of gpt4 they can do everything and i was like it is yeah like what why are you messaging me just try it like yeah i said that like everything that you want to do in english you can just ask it in arabic and it'll i'll put an arabic no problem yeah it's amazing isn't it and like you say the barrier to entry is so low like just set up an account and you're away it's not even it's free like just give it a try even if you're running with like 3.5 or whatever it's uh yeah i feel like we're you know we're watching everyone walking around with candles well i'm holding a light bulb going like this this seems really useful yeah yeah um so yeah so really quickly um beyond raising your daughter and your newborn daughter and writing this book you also are the founder and you run a consultancy uh called applied data science partners so um i guess really briefly tell us what the consultancy does and i understand that you're hiring so let us know like you know this is probably listeners out there have been blown away by your you know your impressive depth of knowledge and your clear ability to explain things no doubt you have thriving consulting practice so there are probably people that would love to work with you so let us know uh what rules are open and what you look for in your hands sure yeah um so our consultancy applied data science partners uh myself and my co-founder my amazing co-founder ross for test jack started this six years ago with the vision really to deliver AI and data science in a way that's practical and sustainable for businesses because we found you know at the time a lot of the a lot of the practices were still very throw away and kind of proof of concept to you so we set up the consultancy really to base uh data science and AOA practices around best practice software engineering at the time so containerization or continuous integration and all of these things that you expect from software engineering we built this around data science um so yeah in terms of our client base we work with large private institutions uh all around the world but also public sector so we have such a broad range of work uh it's something different every month um which makes it so i think a really interesting place to work um yeah and you're right to say we're hiring we're always actively hiring and looking for the best people uh there's kind of a few different roles i would say that we hire for uh our bread and butteries data scientists so everything from people who are just finishing their degree we look for people who are hungry to learn and hungry to get stuck in and and really not shy away from difficult problems because we solve difficult problems every day for our clients um so that's that's you know the spectrum of people that we look for right up to obviously like leads and then people that can lead projects and and and conceptually understand what a client wants we've got data engineers as well um so it's a different track of our business um they work closely with our data scientists um to deliver solutions in a in a best practice software engineering way um and then our analyst as well so we hire people who um don't necessarily have a background in you know what it's traditionally called data science but are just very very good at explaining concepts to senior stakeholders um and so we got our our analyst as well internally um we do hire as well uh people like software engineers so web developers people who can build applications um as I say because our consultancy is growing so rapidly we're hiring for kind of all of these roles so if you have any of those particular talents we definitely want to hear from you um tell us why you think you'd be great fit for the company um because what we look for above everything are people who first of all are hungry to learn secondly attention to detail is absolutely paramount for our business um we like people who can dive deep into a problem and not get scared by the weeds um not everything is rosy in business consultancy you get messy data you get stuff that doesn't work you have to fix problems quickly um so we're looking for people who care about the detail and thirdly um just be a nice person like it's really easy uh just just be friendly uh be optimistic be positive and you'll find a DSP you meet like minded people who have the same attitude nice uh sounds awesome so yeah uh you're looking for people who are hungry to learn not sure how we are from difficult problems they have great attention to detail they're nice uh and yeah lots of great data roles out there are data analysts data scientists data engineers uh and what kind of like stack do you guys use Python I guess yeah Python for pretty much everything that we're building um so take through some tools and technologies so VS code is our IDE of choice um in terms of kind of cloud we're fairly agnostic we work with what the client wants but are recommended would always be a ZER um we work pretty heavily in that stack so looking for particularly engineers that have that on their CV um in terms of kind of machine learning models we like to say that we use the tool that's right for the job so we're not going to always go down the newer network route um you know so like actually boost does the job most of the time or any of the variants like GBM etc um but yeah obviously for some projects especially now we're getting a lot of work through on generative AI particularly gen AI strategy um we're obviously using a lot more deep learning than we have it have done um but for for fine tuning especially if we're fine tuning open source models so yeah I would say you know text act we're fairly um we don't use anything out the ordinary but also we're very much aligned with what the clients want and we're kind of tech agnostic uh in terms of platform so we use tablet and power BI for example um we're not going to sort of uh you know if the client wants tablet we're not going to say you have to use power BI and vice versa that makes sense awesome David this has been a sensational episode I have learned so much it's been so nice to get in the deep in the weeds with you and hear so much about your deep generative learning uh what's generally deep learning book uh yeah such a fantastic book couldn't recommend it strongly enough and oh oh I can't believe I didn't mention this at the beginning so our listeners who have listened to the variant get a bonus treat here which is that uh we've done this with a Riley authors on the show before when I when I post on LinkedIn from my personal account and so we've had some people posting on like at YouTube comments or on uh posts from the super data science account no this is to on my account on LinkedIn uh because we have to have in order for this to like work fairly it has to just be one and that's that's the one post that gets the most engagement each week when I announce these episodes so uh when I announce this episode which will be in the morning in New York time usually around 8am eastern um from my personal account on LinkedIn uh the first five people who comment will get a free digital version of David's book so generative deep learning could soon be yours for free um and uh yeah and something I don't know if I've mentioned this enough recently but uh you also if you have it to not win that contest that race then you can still get you can get a 30-day free trial of the oh Riley platform using my special code sds pod 23 sds pod 23 so either way you can access the book um you just don't have it forever uh with that uh 30-day free trial uh nice all right and then beyond your own book David do you have a recommendation for us? yeah actually something recently that has really caught my eye is something called active inference it's uh it's a concept there's originally um I guess laid down by Karl Friston uh one of my absolute heroes actually in generative modeling and it's the idea that the forward for your book he wrote the forward from my book which I'm absolutely privileged and honored to to say um he so active inference very briefly is just a way of describing how agents learn in a way that dresses action and perception up as two sides at the same coin it's a very elegant idea and at the heart of it is a generative model um I'll leave that as a a dangling carrot for anyone who's interested in this book because he along with his um his associates have written a book called active inference um and there's a subtitle uh the free energy principle in mind brain and behavior it's one of my absolute favorites it's a very complex topic that is explained extremely eloquently um and it's a very recent book as well actually it was only published last year um and it it's basically the the book you need on active inference if you're going to start learning about this fascinating uh concept and it's maybe something I would recommend uh once you start getting into generative modeling that you read because it's uh it's a really interesting kind of theory of everything for for intelligence and the mind and um yeah it puts the action into perception if you like wow very exciting and so for people who want to hear more about your brilliant thoughts David what are the best ways to follow you after this episode? As a few ways so you can follow me on LinkedIn um that's probably the best way uh you can find me on Twitter I'm David ADSP um uh yeah and by all means like follow our company as well that we post loads of interesting stuff about data and an AI so if you're interested in just general updates then feel free to follow apply data science partners on LinkedIn. Nice and uh you also have a podcast coming out soon don't you? Yeah that's right we're launching into the space of podcasts we can't pretend that we're going to be anywhere near your quality of a podcast initially but like we're we're going to be learning as we go um so the podcast is called the AI canvas and it's it's a podcast that focuses primarily on generative AI and its application to people so if you want to know about how generative AI is going to impact loads of different professions, law, teaching, art, music, creative arts, performance arts, uh we've got interviews lined up with people from a ton of different professions and their fears and also their um their great sort of hopes for the technology in the future because I think it's really important we talk with everybody across the spectrum not just those who are involved in the technical side but specifically those that are going to be impacted by the technology so we've had a few of the conversations already uh this blown my mind to how these people are able to talk so eloquently about the topic and we've had some fascinating conversations already so do follow us on that it's podcast.adsp.ai Nice all right David thank you so much for taking the time today brilliant episode and I look forward to catching up with you again in the future hopefully on air so that we can uh sample your wisdom which is very little noise in it. Oh thank you uh John it's been absolutely pleasure talking to you about such fun so thanks and uh thank you again. Boom water gripping and educational conversation in today's episode David fill us in on how discriminative modeling predicts some specific label from data while generative modeling does the inverse it predicts data from a label you talked about how generative modeling can output text voices music images video software and combinations of all of the above how autoencoders encode information into a low dimensional latent space and then decoded back to its full dimensionality how variational autoencoders constrain distributions to produce better outputs than the vanilla variety he talked about how diffusion converts noise into a desired output while latent diffusion which operates on dense latent representations is particularly effective for producing stunning photorealism such as in mid-journey version five he talked about how world models these super cool concept have these blend variational autoencoders together with auto regression and deep reinforcement learning to enable agents to anticipate how their actions will impact their environment he talked about how transformers facilitate attention over long sequences enabling them to be the powerful technique behind both natural language understanding models like bird architectures and natural language generation models like GPT architectures finally he talked about how GANs such as style GAN XL still produce state of the art generated images but GAN show particular effectiveness when combined with transformers in multimodal generative models all right as always you can get all the show notes including the transcript for this episode the video recording any materials mentioned on the show the URLs for David's social media profiles as well as my own social media profiles at superdatasigns.com slash six eight seven that's superdatasigns.com slash six eight seven if you like book recommendations like the awesome book recommendations we heard about in today's episode check out the organized tallied spreadsheet of all the book wrecks we've had in the nearly 700 episodes of this podcast by making your way to superdatasigns.com slash books all right thanks to my colleagues at nebula for supporting me while i create content like this superdatasigns episode for you and thanks of course to evon and mario natalie surge silvia zara and kuro on the superdatasigns team we're producing another profoundly interesting episode for us today for enabling that super team to create this free podcast for you we are deeply grateful to our sponsors please consider supporting this free show by checking out our sponsors links which you can find in the show notes finally thanks of course to you for listening all the way to the very end of the show i hope i can continue to make episodes you enjoy for years to come well until next time my friend keep on rocking it out there and i'm looking forward to enjoying another round of the superdatasigns podcast with you very soon