This is episode number 687 with David Foster, author of the book,
Generative Deep Learning. Today's episode is brought to you by
Posit, the open source data science company, by Anaconda, the world's most popular Python
distribution, and by withfeeling.ai, the company bringing humanity into AI.
Welcome to the Super Data Science Podcast, the most listened to podcast in the data science
industry. Each week, we bring you inspiring people and ideas to help you build a successful
career in data science. I'm your host, John Crone. Thanks for joining me today. And now,
let's make the complex simple.
Welcome back to the Super Data Science Podcast. Today, I'm joined by the
brilliant and eloquent author, David Foster. David wrote the O'Reilly book called
Generative Deep Learning. The first edition from back in 2019 was a best-seller while the
Immaculate Second Edition, which was released just last week, is poised to be an even bigger hit.
He's also a founding partner of applied data science partners, a London-based consultancy,
specialized in end-to-end data science solutions. He holds a Master's in Mathematics
from the University of Cambridge, and a Master's in Management Science and Operational Research
from the University of Warwick, both in the UK. Today's episode is deep in the weeds on
generative, deep learning, pretty much from beginning to end, and so will appeal most
to technical practitioners like data scientists and machine learning engineers.
In the episode, David details how generative modeling is different from the discriminatory
modeling that dominated machine learning until just the past few months. He talks about the
range of application areas of generative AI, how autoencoders work, and why variational autoencoders
are particularly effective for generated content. He talks about what diffusion models are,
and how latent diffusion in particular results in photorealistic images and video. He tells us
what contrast of learning is, why world models might be the most transformative concept in AI today,
and lots on transformers. What transformers are, how variants of them power different classes
of generative models, such as bird architectures and GPT architectures, and how blending generative
adversarial networks with transformers, super charges, multimodal models. All right, you ready for
this profoundly interesting episode? Let's go. David, welcome to the Super Data Science podcast.
It's great to have you here. I understand that you're actually a listener of the show.
Yeah, a massive long time listener. Thanks for having me on John. I really appreciate it,
and I'm really looking forward to getting into in-depth conversation with you about generative AI.
Pleased to be here. I'm glad that you've reached out to us about having an episode because
you have this amazing book that just came out. It's really exceptional. I wish somehow I could
have written this book. It's so timely and so comprehensive around generative AI models,
which are obviously the hottest thing right now in the world. There's nothing that people are
talking about more than generative AI. Whether they call it that, when people are talking about
high forms like chat GPT or mid-journey, they are talking about generative AI, and so I was
delighted that you reached out as a listener to be on the show. You're like a celebrity listener
out there. Thanks, David. Where are you calling in from today? I'm based in London. This is our
office here in London, in Old Street. Yeah, it's actually sunny here in the UK, which is the first.
Finally, someone's dawning on us, so yeah. Nice to see you to be talking.
All right. Well, let's rock and roll and get right into the content that we have planned for you.
There's so much to cover today because I know I'm going to learn a ton of filming this episode,
and no doubt our listeners are going to learn a lot about generative AI as well. So you just released
the second edition of your popular book. It's called generative deep learning, teaching machines
to paint, write, compose, and play. And so the first edition came out in 2019. It did very well,
and I know that this one is going to be huge. Can you explain the differences between generative
modeling, which is the focus of your book, and discriminative modeling, which is the probably
up until recently was the much more common type of machine learning?
Yeah, you're absolutely right. It was, and I think the reason for that is, first of all, it's just
a lot easier than generative AI. If you look back at the history of machine learning, the field has
been driven by discriminative modeling primarily because, first of all, it's really useful in business.
It's really useful to a ton of applications, and you've got a label data set, and there's a very
clear outcome that you want to drive. You want to drive predictive accuracy against that label.
With generative AI, first of all, the application isn't perhaps as clear, or at least it wasn't when
the field was in its infancy. But also, secondly, it's really difficult to determine how well
you're doing, because it's kind of subjective as to whether a piece of text or a piece of art is good.
There's no such label that you can use to determine that. So in terms of the differences,
like, discriminative modeling is all about being able to predict a specific label that you're
given about an input, and typically you're moving from something that is high-dimensional,
like an image or a block of text, or highly structured data, for example, through to something
that's low-dimensional, like a label, or a continuous variable, maybe a house price or something
like this. Now, generative AI moves the other direction. It's saying, can we start with the label
and move back to the data? And so it really focuses on whether the model has understood
what patterns are present in this data set, so that it not only can do something, like,
collapse the dimensionality from an image to a label, but it can say, here's a label, dog, cat,
boat, ship, go and find me the data that would produce this label. I produce me an image that looks
like a ship. And why is this difficult? Well, the reason is because when you're moving to this higher
dimensionality space of, say, pixels or word tokens, there's a lot more that can go wrong.
Array is very, very good at detecting something in an image that looks off, or something in a paragraph
that just grammatically doesn't make sense. And so we really have to try hard when we're building
models like the ones that we've seen such as GPT or the diffusion models that I'm sure we'll
come on to later to make them good enough to be plausible. And so it's like finding a needle in a
haystack, right, to find that one image of a boat that looks real. We are working in maybe like
a thousand dimensional space. When we're collapsing stuff down in terms of discriminative modeling,
we've got to collapse maybe to just one dimension. And that's a lot easier. So yeah, I would encourage
anyone who's kind of getting started with machine learning to start with discriminative modeling
because even though generative AI is the hype, you've got to know the fundamentals and a lot of
the techniques that come up in generative AI are still fundamentally based in good old-fashioned
discriminative modeling. But they have often within them a slant that makes it like you're
predicting something in a higher dimensional space. But you're still using the same concepts like
loss functions, like modeling a density function, for example. And so discriminative modeling will
give you that basis. But if you just want to get started, you start there, but you can move pretty
swiftly onto generative AI, which is the current hype. Yeah, and speaking of swiftly, I mentioned
how your first edition came out in 2019, which is just four years ago, the field has changed dramatically
since. So yeah, like, run down for us how different the content is from your first edition to the
second edition that's nearly released. It is a totally new book. I got to be honest with you. I sat
down with the publisher and they said, do you want to write a second edition? And this was about
the time, this was basically this time last year. So maybe slightly earlier. So this was before
like Dali too. It was before anything with stability. And I kind of sat down and thought, yeah,
actually, I think this is about the right time to write a second edition. There's a lot of
change, but ultimately, I can move some stuff around. I can move some chapters around. I can update
the refresh the examples, refresh the content. And at the moment, I signed that contract to say,
I'd write the second edition. It all went nuts. And like, Dali too was released. And then suddenly,
like, there was just this explosion of large language models and text image models,
which is, first of all, incredibly terrifying. If you've just agreed to write a second edition,
and I realized through the writing process that I needed to completely rewrite the poll book.
So there is so little content that is the same. I would almost be, I think there's basically none
of it is the same. It's a new book effectively. And I'm proud of that because it means it's current.
It means it's up to date. And I can honestly say, I'm really proud of it. It's something I think
takes you from beginner through to understanding the entire landscape of Geno and AI models. It
doesn't just focus on one model type or one, you know, what's currently in Vogue. It tries to take
you on the journey from, let's just lay the fundamentals down in the foundations through to,
okay, now let's talk about stability and stable diffusion or Dali too or mid-journey. Let's really
get it to grips with what these models are doing. And obviously, GPT and the Open AI series.
So yeah, I'm really proud of it. And I think I feel privileged to be in the position where I can
I can write this book. I think hopefully lots of people will get a lot out of it. And I'm really
excited to see it in the market. I wouldn't be surprised if this edition became like a standard
in the field. And based on what's covered in here, how well you covered it, it's so comprehensive.
And the kind of praise that you got in the outside of the book kind of backs me up. You got
François Chaudet, the creator of Keras, is writing about how great he thinks the book is.
You've got the head of strategy at Stability AI, the company behind stable diffusion.
You've got senior people at Microsoft Azure. You've got people from Aluther AI, which is like in
recent episodes, five into Friday episodes, I've been talking a lot about the open source,
large language models that Aluther have made available. You've got Ashwarya Srinivasan,
who is this extremely famous content creator who works at Google Cloud. I mean, yeah,
so I'm just kind of backing myself up. I'm quantitatively now. I've said, I've given so many
of these. So yeah, I think your book has a lot coming out. You're going to say something,
I'll just go with you. No, yeah, I feel really privileged that these people have taken the time
to leave a review and to actually read the book and say that they think it's a useful
addition to the library. I think when I look back, I'm standing on the shoulders of these
giants really. I mean, I'm just reporting on their incredible work. So I wouldn't be able to write
this book without what they've done, particularly someone like Francois Sholle, who's basically
created the library that I'm using throughout the book to build practical examples of generative
models. So yeah, really privileged. And then you used open source LLM's from Aluther to just
write all the content. I was, yeah, I missed the chat GPC stuff. Yeah, it's the stuff by a year.
Like if I was starting the right in the book now, maybe it'd be pretty easy. Yeah, well, I joke.
I think, you know, obviously, there was actually a really interesting discussion on the last
week in AI podcast, which is hosted by Jeremy Harris. Yeah, and Andre, I can't remember his last name,
but that Jeremy's been a guest on the show a number of times. And they were recently talking about
how they, you know, for, for like online content that's like listicles, like BuzzFeed type stuff.
That's very easy to automate. But New York Times journalists, where you have to be, you know,
doing investigative reporting, you know, you could be working on one story for months and really
digging into things and interviewing people, like that kind of job isn't going to go away.
You can be augmented of it, you know, you can help with making sure that you're doing everything
around it, like, correctly, and, and, you know, suggesting some small parts of what you're doing.
But with a book like yours that is so technical, so advanced, so cutting edge,
while these tools could be augmenting, you're writing, absolutely. It couldn't actually be
generating all the, like, yeah, it can't be, it can't be generating all the content. Not yet.
Exactly. Yeah. So language generation, like text, as well as audio. These are some of the
examples of generative AI images you talked about with like dolly to what other application areas
are there? Yeah, we cover the lots in the book. So there's, for example, music is a field that I
find particularly fascinating. I'm a musician myself. I can see what you guitar there and
in the background on the YouTube video. I'm really surprised actually that music generation
hasn't really taken off in the same way that language generation has because, you know,
many ways you think it's perhaps a little bit easier because there's so many genres of music and,
you know, like, we've got to arrange these audio waves in such a way that's pleasant to the year,
or as words have a grammatical structure and there's very strict rules about what we want to see.
But, you know, I sort of think to myself, why is that? And I wonder if it's put in part because
of the lack of data that's available. There's obviously a ton of text data available on the web.
Not as easy, perhaps, to find music data in such quantity. Perhaps it's also driven in part
by necessity and large language models are also extremely useful. So yeah, we cover it in the book,
though, so music generation. I'll just quickly interrupt you on the music thing. I think that
you're absolutely right. I think it's the, I don't think it's the posity of data, although there is
obviously a lot of language data out there. There is a lot of music data as well. I think that you
hit on it right at the end there, which is that very few people are employed in creating music,
but almost all white-collar workers, our, our lingua franca, like our, the medium that we,
that we intake as well as output is in text. And this became even more obvious through the pandemic
when you saw so many jobs could be done remotely, where it's just like emails and Slack messages,
and so it's text in, text out, for a lot of what we do. So I think that's why it's, you know,
something that's talked about more, but it is interesting. There has been an explosion in
generative music, so Spotify apparently has a thousand tracks, a hundred thousand tracks uploaded
to it every day, a hundred thousand tracks a day, and almost all of that is AI-generated music.
And the reason why that happens in such, because you think, well, what's the point? Why are people
wasting server time uploading that? Is that then they also have bots that listen to those fake
tracks, which brings in royalties for these people. But Spotify is starting to crack down on that.
Anyway. Yeah, I can imagine. Yeah, I think it's interesting to see where this goes, because I know,
for example, acquaintances, I guess, with the VP of Audio Stability AI, and he is first and foremost
a composer. So he's not someone coming at this from kind of machine learning perspective.
Firstly, he's someone who commented to this as a composer, so he cares deeply about the rights and the
the authenticity of the music that's being generated, but seeing the potential for a different
kind of music that we're listening to in future. So yeah, it's been exciting to see how platforms
like Spotify jump on the on the bandwagon here. Absolutely.
This episode is brought to you by Posit, the open source data science company. Posit makes
the best tools for data scientists who love open source period, no matter which language they
prefer. Posit's popular RStudio IDE and enterprise products like Posit Workbench, Connect and Package
Manager, these all help individuals teams and organizations scale R and Python development easily
and securely. Produce higher quality analysis faster with great data science tools. Visit Posit.co
that's POSIT.co to learn more. All right, so I interrupted you while ago. You were going to
transition away from music to another application area for generative AI. Yeah, sure. So we cover
music in the book, but also other modalities, especially cross-modalities. So we're talking about
things like text to image and also interestingly kind of things like text to code, which I guess
is another kind of language model, but very specific kind of language model. But also in the
final chapters, how reinforcement learning plays a part when we're talking about things like
world models where there's a generative model at the heart of the agent, which is trying to
just simply understand how its environment evolves over time. And then layered onto that is the
ability for the agent to use this generative model to understand what its future might look like
and therefore hallucinate different trajectories through its action space. So yeah, we might come
on to this a little bit more detail later. Yeah, we got it all covered in the book. Awesome. Yeah,
there's so many exciting topics to come from this episode. Yeah, so application areas that I've
now, I think, jotted down relatively comprehensively. You got text generation, voice, music, images,
video, code, multimodal models, tons of different areas, really exciting times.
So in what way do density functions serve to distinguish these different generative AI techniques
from each other? Yeah, that's a great question. So I would say, if I just sort of briefly talk about
you know, the how we cover this in the book, so that the first section of the book is we call
methods. And this is where we I'm laying out the six fundamental families of generative AI model
and the second half is based on applications. So like, what can you do with them? Now the six
families of model are differentiated by how they handle the density function. So let me give you an
example. The first split that you can kind of make is between those that implicitly model the
density function and those that explicitly model it. And what I mean by that is imagine the
density function is basically like a landscape over which you're trying to move to find images
that are more likely to be real than others. And the images that are most likely to be real are
say at the bottom of the valleys and the least likely to be real are at the top of the mountains.
So you're always trying to move downhill in this in this in this model. And you're trying to come
up with the landscape that truly reflects how real images are produced. So we're kind of like
postulating that this landscape really does exist and that we need to find a model like an
abstraction if you like a reality that captures the true nature of this. So if you imagine the
different dimensions of this landscape are the pixels in an image, then there are some configurations
of pixels that are in the valley by they are producing very realistic images and there are some
configurations of pixels that are on the mountains and they aren't very realistic. So the question
always becomes firstly, how do you model this landscape? What does it actually look like in this very
really high dimensional space? And secondly, how do we navigate it? How do we move downhill to find
images that look real? So implicitly you can model this by something like a GAN where you don't
actually write down an equation of what this model looks like, but you you play a game between
what's known as the generator and the discriminator and generator.
Two quick we jump in for sure as far as they don't know that term GAN, it's a generative adversarial
network. Yeah exactly, generative adversarial network and GAN and you're basically playing a
game here between the generator that's trying to create images that look real and the discriminator
that's trying to pick between those that are real and not. And so at no point in that process,
do you write down an equation that says yeah this is what I believe the density function to be,
you're implicitly modeling it through this game and that is in contrast to pretty much every
other kind of model that does in some way try to try to create this density function that's which
we call usually called PFX. So in this other set of models there are different ways of dividing it
up and one of the ways for example is to say okay we can approximate it in some way and we're not
going to try and find it perfectly but we're going to approximate it. So variation ought to
encode us do this and some other model types as well. On the other side you can also find models
that try to model it really explicitly such as your autoregressive models where you basically
place some constraints on how the generation is produced so autoregressive models always look to
produce one sequence step ahead so something like GPT as a good example of this where you're just
predicting the next word or token at the time and you can write down an equation that says like
this is what I believe the landscape to look like because I am I'm restricting it to just
predicting the next word so if you wanted the equation would be huge but obviously you can write
down you know what that looks like and then you got some other types like normalizing flows where
you enact a change of variables on the landscape and you try to you try to morph the landscape into
something that is easier to sample from. You've got energy-based models which are the fundamental
root of diffusion models which again we can talk about later again this is basically like saying
how can I come up with a function that tells me how to move downhill in this landscape
and then yeah I think that covers it that's our six kinds of model yeah so they all kind of
try to model this density function slightly differently but ultimately it's a fundamental part
of generative iIs understanding what we mean by a density function and we cover that in the
first chapter of the book. Sweet just why we've kicked off with that here so something that's
happened recently at the time of recording is that Jeff Hinton who is perhaps the single most
important person in the history of deep learning and deep learning is essential to all of these
generative techniques that you've just been describing indeed your book is called generative deep
learning i'm not really aware of contemporary generative approaches that don't use deep learning
correct yeah you do pretty much now um so Jeff Hinton sometimes called the godfather of artificial
intelligence but probably like more accurately the godfather of deep learning and he won the
Turing Award with Joshua Benjio and Jan McKess so this is like the equivalence of a Nobel prize
for computer science and he was at Google for a very long time he recently left at the time
of recording this at least and he cited significant concerns about the misuse of generative AI
is the key reason for him leaving he wanted to be able to express himself more
more clearly he's actually clarified since that he doesn't think Google is doing a bad job
but that there's present concerns here and that he needs to be able to speak really about him so
do you agree with Jeff Hinton uh yeah what do you think about this whole situation okay there's
a few things to unpack here um so first of all massively respect Jeff Hinton's work i think a lot of
us wouldn't be doing what we're doing without his fundamental breakthroughs in the field around
things like back propagation obviously in the early days of deep learning um so yeah it's worth
listening to what he says first of all because i think he's got valid points and he puts them very
eloquently across in his interviews first of all i would say it's important to note here
that the the difficult position to take in this is that we're going to be fine and the reason i say
that is because it's very hard to prove somebody wrong that says AI is an existential threat because
if it hasn't happened yet then they can just say well it hasn't happened yet but it will happen
so you're kind of always in this position of like well how do i how do i show that this argument i
don't particularly agree with how how do i show that i don't think it's an existential threat and
that we can put things in place to prevent the threat from happening or that it's just not
viable for it in the first place so i think you've got a first of all thing really hard if you're
going to come out and say i think AI isn't an existential threat and i and i have been doing a lot
of thinking about this you know listening to arguments on both sides and i think there are
hugely valid points to be made but ultimately i've come down on the side of not thinking it's
as greater threat as perhaps the likes of Jeff Hinton are putting out there
and i think one of the reasons and one of the criticisms i perhaps would make of the argument that
it is is that i i don't like the idea of the the just waving the hands and saying that the AI will
want to take control i think there's a huge leap here from saying that we have large language
models which now predict the next word very very accurately and of course can be changed with
tools and on all of those things to then saying that those same language models will have
wants and desires and long-term aspirations to achieve a particular goal i i really don't think
that a model which is ultimately interpretive the in these models whilst they look as if they're
doing very clever extrapolation i believe ultimately are still confined by the dataset that they
are trained on i don't i don't think that and i you know i might be wrong with this but i just
don't think that they're going to have the capacity to want to eliminate us and that is ultimately
what he's saying and to be clear you know this is very separate from saying bad people would do
bad things with AI and i think they will there's no question i mean we see it with every technology
that you know bad people if they want to do bad things with the technology they will
and i agree with him there that that we need to be extraordinarily cautious that we don't let that
happen and put the regulation in place to ensure that it doesn't but there's a huge leap to then
say like the AI itself is going to want to dominate us just because it's more intelligent than us
or apparently more intelligent than us i think we're downplaying our own capabilities here
you know the example i would make make as a counter example perhaps is if you trained a large
language model on all scientific data or just all data up until say 1910 would it come up with
general relativity and i i just don't think it could i don't think it can make that extra
relative leap that says given the data i have available to me at the time i can run this thought
experiment myself and want to run the thought experiment to come up with something as profound
as relativity i can't see that happening and therefore it makes me it leads me to the conclusion
that we've got we've got something worth sort of fighting for against this AI and we shouldn't
just lay down and say yep we're now on path to existential annihilation because we've built
something that can predict the next word very very well i'm optimistic basically
did you know that anaconda is the world's most popular platform for developing and deploying
secure python solutions faster anaconda solutions enable practitioners and institutions around
the world to securely harness the power of open source and their cloud platform is a place where
you can learn and share within the python community master your python skills with on-demand courses
cloud hosted notebooks webinars and so much more see why over 35 million users trust anaconda
by heading to super data science dot com slash anaconda you'll find the page pre populated with
our special code sds so you'll get your first 30 days free yep that's 30 days of free python
training at super data science dot com slash anaconda yeah there's a lot of different ways we
could go with this and i'm i'm not going to let us i mean we could literally spend this entire
episode talking about this stuff but we have a lot of technical stuff that i'd like to get into
with the gendered AI that you specialize in so i'm not gonna i'm not gonna drag this out too long
there there are interesting things where so yes today models like gpt4 they're predicting the
next word they're they're not in another cells of risk but if you you know we have tools like
auto gpt that were built on it where auto gpt could potentially be given a large amount of resources
including a lot of its own gpt4 agents and we could give that auto gpt a broad task like here's
a million dollars increase the amount of money and and one person might say
increase the amount of money but also don't break any laws whereas another person might
not give that qualifier or even without breaking any laws it might figure out a way you know
that takes advantage of some people to to generate more money in the bank counter so it and while
auto gpt today might not be too sinister you know maybe we're with with how crazy things that
become just in the last year like so you talked about signing your book contract a year ago
and the incredible progress that's happened over that year if somebody had asked me a year ago
whether i thought a model like gpt4 could exist in our lifetimes i might have said i don't know
that's really good yeah so i don't know so what is even just scaling like
that that huge innovation has come about through just scaling the same architecture transformers
and so you know scaling that another ten times or another hundred times before it gets
prohibitively expensive to train you know there's like if we can't do that many orders of magnitude
before we're talking about like a hundred billion dollars train a model which which so like this
probably also going to be scientific breakthroughs beyond just the engineering breakthroughs
that we're doing today on scaling bigger and bigger so anyway so i just so it seems like
i can get why people including Jeff and ten are concerned about the existential risks but
tying kind of more immediately and more into the kinds of concepts that are covered in your book
he also expresses concern about just you know fake content misinformation which you alluded to
there you know that that is the immediate risk like with the tools we have today anybody who
wants to misuse them can and they can do things incredibly powerfully you know just just tying
lawyers up with like a specific example i read yesterday was i think this was in the economist they
gave this example of how a person could create a thousand page document as to why like a nimby
someone who doesn't want um so not in my backyard nimby yeah some of you a nipiest uh they could
create this thousand word proposal for government officials to read about why they don't want uh
you know electrical wires visible from their back window and a human then
probably is gonna have to read that and respond to it so there's just there's all these interesting
things like and that's not even a misuse really the technology um but there's there's so many
this scale now that we can create language ad uh it it's going to cause problem and so
it doesn't seem like you're too concerned about it so i guess yeah why aren't you that concerned
about the immediate risks or do you already have in your mind ways that we can overcome these risks
perhaps with AI itself yeah so i would say the immediate risk i'm slightly more worried about it's
the existential risk that i perhaps take these over played um the immediate risks of disinformation
and the ability for large language models to create a huge amount of noise in our world whether
that's creating work for people like you know lawyers reading the document that you just mentioned or
just the fact that it it might nullify the power of things like social media platforms if we can't
really determine you know what's real and what's fake as well as democracy itself if we're now
influenced by uh a media content that isn't correct or isn't isn't real i think is more of a risk
and the way i would like to see this handled is first of all education so i think
we are going to have to get used to a world where we need to be a lot more vigilant of what's real
and what isn't real um i think we've been extraordinarily privileged actually to live through
the start of the internet era being relatively free of fake content and i think that has generated
a huge amount of um a huge amount of worth in things like for example programming where before
i would have to go and buy the book on python if i wanted to learn how to do something now i can
just go online and i know i'm going to find a article written by a human um that tells me exactly
how to do what i want to do so there's a huge amount of value that's been created by that and i think
that value is now being condensed into the model such as GPT4 which is going to be even more powerful
than like me trawling through hours and hours of stack overflow content to find out how to
do something in pandas which is what i usually end up doing um so you know on the one hand it's
going to actually improve efficiency like this but also like you say i think we just need to be
extraordinary careful that we don't let this thing you know run away with itself um and i just
you know humans are incredibly slow to react to new technologies like this we often need some sort
of event before we go i yeah we don't want that happening anymore and nobody really knows what
this event is going to be um i was talking to a a i i p lawyer earlier and she's along the same
lines that she's like it's very hard to get people to take notice or listen before something happens
that kind of we go yeah that's the thing we didn't want to happen so i think this is in line with
kind of like some of Sam Orman's comments and and also Jan Lecun's comments around like how can we
start legislating against something that we don't we don't know and it's you don't you don't want
to kind of stifle innovation you don't want to stifle research just for the just because you're
worried that something might happen you know otherwise you just legislate everything um
so look i don't have all the answers but i just i i'm optimistic and i i would hope more i would
like to see more people optimistically trying to come up with solutions rather than just
kind of pointing out that there's a annihilation around the corner which i just don't think
is credible at the moment yeah and i think that there can be i think that a i itself can be used
to solve a lot of these issues so Jeremy Harris whom we've already talked about he is a
he has a lot on his show last week in ai about ways that we can be mitigating some of these risks
and one thing that he talks about regularly and that he talked about even on our show so we had
him on or a gbt4 risks episode it's episode number 668 and in that episode he talked about how we
can be using ai to be monitoring ai because it's much faster than us so we can't have people
monitoring for slight aberrations but we could train ai to be trying to keep it in line so that's
like the existential risk thing that's i guess a leading approach today for how we deal with that
and even with misinformation stuff i mean we can have misinformation detectors um that are
automated and i'm usually pretty uh skeptical about the crypto hype and blockchain in general
but a real life application of the blockchain that was first brought to my attention by Sadie St.
i believe in episode number 537 of this show was uh that you can be using the blockchain
to verify that a document is real so if there's a source that you trust like the New York Times
or the economist or whatever and then uh an image or a video it can be tagged i don't know
there's a knowledge of very well but you can you can verify on some blockchain they're like okay
this actually really came from that trusted source mm-hmm yeah attribution is going to be a
critical thing that we have to we have to care about going forward and i think what's important
is that we don't sort of say we don't make it black and white this is ai generated this is not ai
generated because ultimately it is a gray zone if i use a ai tool to generate a structure of a
document but then i fill in the blanks or i extrapolate i don't really want to be having to label
that as ai generated because ultimately it's had my eyes on it i've overseen the process
it's a bit like if i use a tool you know like a spell checker i don't have to declare
yes i spell check this document um i i just put it out there because it's had my eyes on it
but what i think we need to label is anything that is ai generated that has had no human i look
over it and that's where i think we might need to start saying like okay if this if this content
has been produced and no human has had any part of production of that content i think people
should know about that and i and i think it's important that we can distinguish or at least label
in some way any content that that has gone out unverified because that's where you might start to see
the the problems and i go back to my example there if you know stack overflow um if there's content
on there that is an answer to a question that has been ai generated i kind of want to know if i
was reading it like take this with a pinch of salt because it might not be something that someone
has actually produced um it might still be useful but it's not human generated it's ai generated
nice yeah so there are risks but we can mitigate most of the risks and i think it's good that people
are calling these risks to our attention and so hopefully we can get ahead of them to some extent
and the most audacious uh issues can be tackled upfront um so let's move back to technical stuff
so um one of the fundamentals of generative ai is autoencoders so we talked about density functions
earlier um let's talk about autoencoders these are a really key concept in generative ai and so
there's this idea of encoding information so um you know let's let's take the example of
a text to text model so um this is like the chatgbt experience we provide text to
to the model it encodes that text and then it gets encoded into something called a latent space
and then there's a decoder that takes that latent space information
converts it into new text that then in this example what i'm giving chatgbt
provides text back out to us so encoding text latent space representation decoding um i think
altogether this describes an autoencoder so uh yeah maybe fill us in a bit more on what these
terms mean and what important role they play in generative ai systems yeah cool great question
i i'll take it back actually to an example with images because i think it's slightly easier to
um to sort of visualize uh for your listeners so um let's imagine we've got an image and it's in like
a thousand and twenty four pixels so high dimensional space every single one of these pixels has
three color channels so you've got a lot of numbers basically to describe that picture um and as
we've previously mentioned there is some density function that describes why that image is very
likely to be a true image and other noisy images aren't now what autoencoders look to do is say
can we map this high dimensional space of the image domain to a latent space what is known as a
latent space of a lower dimension so you could even map to a latent space of two dimensions then
it's very very easy to visualize you're just imagining a plane uh and then on that plane there are
mountains and valleys and and that determines whether some of those points in the latent space are
likely to be generated in some art now the reason why this is useful is because it forces the model
to make generalizations over the pixel space so that it can compress that information into the
latent space it's a bit like if someone said to you um I always give the example of like cookie
jars or or biscuit tins which are cylindrical um how many numbers do you need to describe
the shape of that biscuit tin the answers to you need to know the height and you need to know the
diameter of the uh the the cross sectional circle if you know those two numbers you could reproduce
the biscuit tin so even though this thing exists in three dimensions and it's you could view it
from different angles and and come up with different you know pixel pictures of it actually you can
describe it using two and in that latent space uh you could basically move around it to produce
different kinds of cylinder and the same thing exactly the same thing is true with models like
diffusion models or even uh a very short to encoders you're basically saying to the model
find me a load-dimensional latent space where if I choose any point within that latent space
over some distribution like a normal distribution uh centered on the origin
I am pretty likely to find something that is uh truly a real image and then so what the decoder is
trying to do is move back from the latent space to the pixel space can you take these two numbers
and recreate the biscuit tin so if you join those two things together you've got what's known as
an auto encoder because it's trying to effectively compress the information down to something
small and then expand it back out again to the original image it's auto encoding itself
nice great explanation I love that three-dimensional biscuit tin cylinder represented in two
dimensions that's such a crisp way of describing how this latent space can can contain information
like that awesome um so there are different kinds of auto encoders we've got variational ones
for example uh which are more popular today so how do variational encoders differ from
traditional ones what unique capabilities do they are yeah so the problem with vanilla let's
call them vanilla auto encoders so not variational is there's a few problems first of all um
if you just let the model map to any old latent space like you just say uh take the pixel space
and I just want you to find two numbers that kind of represent what that image is so that you can
decode it the problem is it's very hard to sample from that two-dimensional space because okay so
let's say the point a hundred a hundred is that a valid image what about two hundred two hundred
like two million two million you know like where in this where in this still free vast space should
I be sampling and so what you end up with is a very difficult latent space firstly to sample from
without much structure it's got no incentive to kind of pull similar concepts together
because ultimately it's it's unconstrained what a variational auto encoder does is it makes a very
slight change to the loss function and it effectively says you've got to include a term which
makes sure that the points when you map them into this latent space are as close to a standard normal
distribution as possible by a standard normal distribution what I mean is a normal distribution with
a mean of zero and a covariance or standard deviation of one so we know how to sample from this
object is it's really common and we we know exactly how it works and what in by doing that
what happens is first of all everything gets compressed to something that looks like a normal
distribution and that helps us in two ways first of all it means that there is a degree of continuity
in the latent space so you can move around it and be pretty sure that anything within this normal
distribution is going to be something that's likely to be a real image and you know if you move to
the extremes then you're going to find something that's less likely but we can understand what
this distribution means so we can sample from it really easily and just make sure that if we're
choosing random points from a normal distribution standard normal that we're going to be able to
decode these points to a real looking image so it's a bit like the glue the variational auto encoder
is a bit like the glue that glues everything together and makes it a true generational model that
we can sample from and not just this abstract auto encoder object which doesn't really
it's not very easy to work with basically nice the future of AI shouldn't be just of a productivity
an AI agent with a capacity to grow alongside you long-term could become a companion that supports
your emotional well-being paradigm and AI companion app developed by with feeling AI
reimagines the way humans interact with AI today using their proprietary large language models
para.ai agents store your likes and dislikes in a long-term memory system enabling them to
recall important details about you and incorporate those details into dialogue without LLM's typical
context window limitations explore with a future of human AI interactions could be like this very
day by downloading the para.app by the Apple App Store or Google Play or by visiting para.ai on the web
great explanation crystal clear so variational auto encoders they allow us to constrain distributions
to standard normal which leads to better behavior in the auto encoder we get better results
precisely yeah and the term in the loss function to do that's called the KL divergence
Kubik libel divergence and it's the glue that makes it the first kind of generative model I
would recommend everyone starts with nice yeah that KL divergence that's big in information theory
yeah it's a way of measuring the difference between two distributions so like if you got your
distribution of points and you want to compare it to the standard normal you could use the KL divergence
to do that the beautiful thing about this is that it's actually got a closed form solution for
a standard normal which means that you don't actually need to do any sampling to work out what
is value is you can just write write down the answer which is super powerful cool all right so
we've got kind of the key terms now under our belts for generative AI we know about
density functions we know about the application areas we know about auto encoders so
let's talk now about that big breakthrough the captured public the public's imagination
so even before you signed your book a year ago there was already a lot of hype around dolly
so this was released by open AI the same company that released Jaguar T this is a text to image
generator and the original dolly while miles behind the dolly too they came up shortly after
you signed your book deal and even the original dolly for some kinds of requests it created
stunning imagery like being able to on their website for example there were examples of
you being able to say I want a shark you know to to walking a crocodile or whatever
and it could create a cartoon of that and you know compared to dolly too or mid journey it wasn't
that many pixels it was definitely better at cartoony type stuff relative to photorealistic stuff
but still this was like the first time that I and probably most people were able to have this
unbelievable creative outlet of being able to take any text that comes to your head
and that automatically generate an image so that dolly model it leverages diffusion
and your book has an entire chapter devoted to diffusion so can you explain what diffusion is
and how noise can be employed in the generation process yeah sure let me start with diffusion then
so dolly actually is made up of a few components diffusion is used in a few of them so it's definitely
a core component of dolly too so yeah great place to start is this first of all explain what
diffusion is so diffusion the best way I can describe it using kind of a metaphor is imagine
that you've got like a set of TV sets all linked together in a long line and the first TV shows just
random noise so just complete random static and the last TV in that sequence shows an image from your
your data set now if you want to move from the image in your data set on that television to the
the random noise it's very simple you can just add tiny tiny bits of random noise to that image
in tiny steps just kind of Gaussian noise which basically means noise sampled from a normal
distribution and eventually over enough time steps you won't be able to tell what that image was
it's got it's basically as good as random noise so you've kind of moved from the image domain of your
data set through to the noise domain which we can sample from and with generative AI where you're
always trying to get to is can I sample from this thing because if you can sample from it that means
you've got this random point that you now just need to decode and so you know we talked just now
about encoders and decoders the adding noise is a bit like encoding it's not quite the same because
it's actually it's not a learned model that does this it's just noise addition but what the
the beauty of the diffusion model is that it learns the reverse process it learns how to undo the
noise and get back to the original image now you might say like well how on earth does it do that
like how do you just out of random noise find an image but you can kind of think to yourself well
if you do this in enough in small enough steps then this is kind of possible because you can say
to yourself well let's imagine your data set was just images of houses okay outdoors so
most of the time the upper pixels will be blue because they're the sky and you're going to have
some kind of maybe greeny pixels down the bottom so to get from random noise to an image you might
train a model to say well let's try and keep some of the green pixels at the bottom and I you know
I think they're the ones that need to be adjusted in such a way that they're slightly more green
and the pixels at the top I want you to adjust those in such a way that they stay roughly more blue
than the other pixels in other parts of the image and it turns out that if you do this in
enough time steps and in small enough steps the model through taking what it already has and
making a slight adjustment that makes it slightly more like an image can make random noise turn
almost like before your eyes magically into back into something in the image domain
and the the the way that the diffusion model actually works the nuts and bolts of it it's
something called a unit model which it doesn't try to unlike a various should also encoder which
kind of yeah tries to move from say the the latent space back to the original pixel space in the
decoder this unit model just simply maps the image to another variation of the image with
slightly less noise that's what it's trying to do and yeah if you do this over enough time steps
then it turns out you can train a pretty good model to learn how to decode noise back into the
original image domain so yeah that that's how they work diffusion models are all about units and
they're all about adding noise through a forward step and then trying to remove the noise through
a backward step nice and so I guess that's how stable diffusion works as well so that's the so
behind mid-journey at the time of recording mid-journey version five is the state of the art it
creates amazing photorealistic images and so this same kind of approach is in behind there it's
probably just scaled up right probably more like yeah larger model architecture and more more
training data yeah and the beauty of stable diffusion is in something an advancement that they
made called latent diffusion and this is where and all of these ideas are kind of tying together now
that we talked about because what latent diffusion does is it works in the latent space so there's
actually an initial part of the model that tries to compress the image down to something that isn't
pixels anymore but it's like it's it's a latent space of concepts effectively and then latent
diffusion works in the latent space the diffusion model just works on this and then there's a decoder
effectively that sits after this that takes the denoise latent space back into pixel space so
yeah what they realized was that you don't need to work on the pixel space itself because you've
got a lot of redundant information you can work in a much smaller and faster latent space that's
the beauty of it that's why it's so good nice that makes perfect sense so the distinction between
latent diffusion this newer technique that powers same mid-journey version five relative to
the diffusion that's been around for all these years yeah all these months is that it allows for
diffusion on the latent space which as we've talked about earlier in our discussion of how we
use an autoencoder for example to go from an encoder into a latent space and I mean you need to
decode that later the latent space there like your 3d biscuit tin in how that can be represented with
just two pieces of information and similarly here when we're doing diffusion on the latent space
we're doing diffusion on more compressed information and so it's more computationally efficient
easier to scale up we get better results yeah perfect exactly that nice cool and related topic is
clip models so what are clip models and how are these leveraged in these kinds of text image
tasks that we've been talking about like dolly and stable diffusion yeah cool so a clip model is one
part of dolly too and specifically I'll come into exactly which part and how it's used because
clip itself isn't a generative model clip itself is actually it uses a technique called contrastive
learning to effectively map pairs of text and images so you can imagine so you've got a data set
where you've got loads of pairs of images and their corresponding descriptions so let's say you
got a picture of a field with a tractor and then you've got a text description that says this
is a field with a tractor in on a sunny day okay so what clip does is it tries to learn a model
that can match the image to its matching text description and the way it does that is it
trains two different kinds of transformer which you can come onto the details of a transformer
for the text side basically says can you encode this text description into a vector
and the transformer on the image size which says can you encode this image into a vector
and then what it's doing is taking these two vectors and quite simply just calculating cosine
similarity between them and what you want is you want true pairs to have a very high cosine
similarity score and you want mismatched pairs to have a very low similarity score and that is what
the clip training process does it tries to find this kind of like identity matrix of
along the diagonal you get very high scores because these are the matching pairs along with the
diagonal if you can imagine the images on the rows and the texts on the columns and on the off
diagonal you want this to be as small as possible because you don't want these things to be regarded
as similar so it's a bit like a recommendation algorithm you know like is this image
recommended to be with this text and so this isn't generative right we're not going to be producing
more images through this one of the cool things about this I think because OpenAI released
this clip model standalone as well like this and so one of the cool things about this
approach and it follows on from what you were just describing is that this allows you
to have an image classification algorithm that didn't necessarily have the label that you'd like
to extract in the training data so you so when we were 10 years ago the state of the art
and not until very recently the state of the art image classification so with models like Jeff
Hinton's Alex net that came out in 2012 that was trained on the image net data set which had
tens of thousands of different labeled categories cats horses it had tons of different kinds of dogs
because they use that as like they wanted the model to to be able to demonstrate that not only
is it good to classifying a wide range of images but also for a specific like category of images it
could distinguish fine details and be able to distinguish a Yorkshire Terrier from an Australian
silk Terrier even though these are extremely similar looking dogs and so the state of the art was
that you needed to I guess going back to one of our first topics in this conversation talking about
discriminative models where we were discriminating down to specific class labels and even if it was
tens of thousands of labels you still you could not use a model trained in that approach in this
discriminative approach to be able to guess a label that's outside of the 10,000 labels that's
been trained on but with clip we get exactly that so with clip you can just say he could just ask
it to label images that it's never seen before in class categories it's never seen before but it uses
yeah it uses this approach that you just described to map it to any natural language
yeah precisely and it you know it's the reason it can work is because it's encoding everything
into the same latent space it doesn't matter if it's not a label in the data set you can make it
a label by doing by pushing it through the encoder whether it's an image or a text.
Right so it's a latent space the the meaning that is embedded in this latent space we can extract
that visually or linguistically exactly and that's what Dali to excels at it basically takes the
text embedding from your input so say you've written something about I want to see a cat riding
escape board then it takes that text embedding and tries to predict what the image in the corresponding
image embedding looks like that's called the prior and then the final step takes the image embedding
and uses diffusion to generate the image so it's like a three-step process text goes through the
texting code to create the text embedding and that's that's just the clip text embedding you've
then got a prior which sits in the middle that says now go and predict me what the equivalent image
embedding looks like in the latent space of the image model and then just decode it I mean I say
just there's a lot of work that's gone into that but that is how Dali 2 works nice okay super
cool so this clip approach great not only for associating natural language that wasn't in the label
training data but also great for allowing Dali 2 to be so much more effective than its predecessor
Dali and yeah so I guess we already talked about I was going to ask you a question about how clip
can be used for zero-shot prediction but I think we've had to cover that so this idea of zero-shot
prediction is using a machine learning model typically a large language model to be able to
do some task that it wasn't trained on and without any training examples at all so you just
read you know you take the model weights as they were trained and you say do this task so you know
you know is there is there a skateboard in this image and it can answer that question even if
it's never been trained to do that precisely that's exactly it yeah even if you never sort of showing
it you've never given it that task before it can have a good go it's sweet all right so all right
we've got lots of great foundational generative AI knowledge now under our belts really cool topic
that we alluded to earlier in the episode is world models so you've got a chapter in your book
dedicated to it what are world models and how can a model learn inside its own dream environment
yeah I love this I love this topic this is so fascinating to me and it's actually the reason I
started writing the book in the first place was a paper in 2018 by David Haayou instrument huba
called just simply called world models and it's effectively a it's like a collision between two
of my favorite fields which are generative AI and reinforcement learning and in the paper they
describe how you can build a agent an agent is in reinforcement learning something that takes
actions within the environment and the agent has within it the very ashore to encoder that we've
just talked about and what that's doing is it's trying to collapse down what it's seeing in the
example in the paper it was a car racing and around the track it's trying to collapse that down
into a latent space which it can predict chronologically so it's now trying to model how its future
looks given its latent understanding of what it's seeing and the action that it's just taken
and this is where everything collides for me because it's like you've got the VA the very ashore
to encoder creating the latent space of the environment and understanding what it's seeing
you've then got an auto regressive model they used an RNN recurrent neural network in the paper
which tries to predict auto regressively how that latent space will evolve over time given its
actions and then you've got reinforcement learning which is an entirely different field which then
says how do you how do you take actions that maximizes the reward given the environment that you're
in is in your own hallucination of how this latent space evolves and the latent space of course
includes how the reward evolves over time and what kind of episode reward you're going to get so
I love this field because a world model for me is it encapsulates everything about machine
learning that we've learned so far there's discriminative stuff involved but also a generative
component a reinforcement learning component and I think this is a really powerful concept
in teaching agents to behave as in an environment with their own sort of generative understanding of
how that world operates feels very close to how we do it as humans you know when when we're learning
a new topic we're not it's not really something that we expect the environment to give us a nice
package that reward function for we seem to be able to have an inherent understanding of how the
world operates and then layer on top our actions onto this understanding so if I'm shooting a
basketball through the hoop you know I I kind of know what's going to happen because I can imagine
what the action is going to do to my latent interpretation of what I'm seeing and so it makes me
learn I mean I'm still terrible at it but it theory should make me learn a lot faster because I have
I have an internal internal representation I'm not just operating on the pixel space of my my
eyes yeah so well models are the reason I wrote the book really so I've got a lot to add to them
super cool all right so world models blend variational auto encoders auto regression deeper
enforcement learning and to allow machines to visualize to imagine to dream some time steps
into the future as to like what the most likely outcomes are given in a current state and this
allows it with the deeper enforcement learning component to then take actions that allow to achieve
some objective and just break down a few of the terms that you use there from reinforcement learning
you talked about a reward function so and you also talk about agents so in a reinforcement learning
paradigm so reinforcement learning has been around for decades and reinforcement learning is a
is a class of machine learning problem really where you want an agent you that could be a person
or it could be a machine to be able to take a series of actions so a really big example of
deeper enforcement learning in recent years is the AlphaGo algorithm by Google D-Mind which was
able to be the world's best go player so this kind of thing where you have a board game where
the sequence of actions and you want the agent to be able to to predict what likely actions are
going to lead to winning the game of go or winning a video game could be an Atari video game was
very popular a few years ago for training these deep reinforcement learning algorithms and oh
yeah I should say that a reinforcement learning algorithm is a deep reinforcement learning algorithm
when we use deep learning to solve the reinforcement learning problem exactly and so I think
that ties together all the terms oh it'll reward was the last one there where so in reinforcement learning
we let's say we have it playing a video game then we provide it with the pixels on the screen
and that's like the state of play but in addition to that we have a reward function which in video
games is often really easy which is why Atari video games were so popular a choice for
tackling with deeper enforcement learning problems because they have an inbuilt score like
Atari games like all of them have a point score that we're trying to maximize and so we feed that
reward to the algorithm and it learns okay if I take this action if I press right on the joystick
or left on the joystick is that likely to increase my reward in the future or decrease it or keep it flat
and so reinforcement learning algorithms are trying to maximize their reward and so your point
there was with most reinforcement learning approaches in fact as far as I was aware until this
conversation all reinforcement learning approaches we had to have this reward function made explicit
that the algorithm is trying to maximize so if we go outside of the video game scenario once we're
say teaching an algorithm to drive car we'd have to come up with we'd have to manufacture some
function like you you get one extra point for every meter traveled exactly towards a destination
but you lose a thousand points if you hit a pedestrian and so what you were just saying now is
fascinating to me because I think you said that with these world models we can have a deeper
enforcement learning model learning real world problems without needing to specify explicitly what
that reward function is yeah it's a it's a case of the world model itself doesn't need the reward
function the the world model is simply trying to understand how its actions can be used to effectively
model and predict how the environment will move in future then the power of it is that you can
layer on top of that a particular task and of course that task would have to have a reward function
but obviously this is a lot faster than just from scratch learning a reinforcement learning task
from scratch with a reward function it's almost like the world model gets you 80% of the way there
because you have an inherent understanding of the physics of your environment before you say to it
now try and drive the car fast and so in the paper for example what they do is they they
actually just train the world model completely task independently so there's no reward they just say
take some actions observe what happens so drive the car forward drive the car left drive the car
right brake and just see how what your observation does like don't worry about going fast just drive
and see what happens like randomly which is feels like what a baby does when it's you know crawling
around on the floor and my eight month year old is doing this hopefully more and more every day
until the point where we wanted to definitely not do this you've been you've been raising a new
born baby the entire time you've been reading this book yeah it's a bit mad I don't I sort of blur
to be honest with you I've yeah I think I think that's been sacrificed to sleep so yeah that's
the how it is but yeah I'm delighted to have a new daughter but it's yeah she's actually the
she's dedicated in the book the book's dedicated to my daughter say yeah I was I was lovely as
always yeah the lovely as most vector of them all Alina that's the one yeah exactly so she'll
be embarrassed by that in about 16 years time I think but hopefully by the end of January I would
yeah maybe the height would have died down cool all right so fascinating area now the final
topic area that I want to get into at least related to your book is GPT so to some extent
I grew up with this I was like should we even just be starting the episode with the GPT stuff but
I think that by going through these kinds of foundational concepts it will allow us to speak
in you know it will be able to get more into the weeds on GPT and how that relates to
generative AI generative deep learning then we could have if we just started with it so GPT
generative portrayed transformers they have become by far the most widely known transformer model
in fact I recently learned that open AI is trying to trademark GPT so these three letters
generative pre-trained transformer so the generative obviously like everything we've been talking
about in this episode so far it generates something in this case it generates text at least for
now that's all that it does I'm sure that'll change soon and pre-trained meaning that it can
do the kind of zero child learning that we described so you it can perform lots of kinds of tasks
it's trained on so much data and it has such rich encodings of meaning that we can ask it to do
something that it's never encountered before that nobody has ever thought to ask a machine or
a person to do before and it can do it at least in terms of GPT4 magnificently and then so that's
G generative pre-trained P and then transformer T so David what is a transformer?
Yeah so transformers came into the world in 2017 seems like a lifetime ago but it's actually
think about it's only sort of six five six years and what they are based around is this concept
of something called attention and to understand transformers you first of all got to understand
what attention is because the whole transformer architecture is at its hearts like the large
majority of it is just how these attention mechanisms work and how you build them up together
into what's called multi-head attention so let's talk about attention first attention basically
is a different way of modeling sequential data that is the complete opposite of the way recurrent
neural networks do it so recurrent neural networks says oh yeah I'm going to take each token one
at a time in sequence and I'm going to update my latent understanding of what this sentence or
stream of tokens means so far and then I'm going to get to the end of the sequence and I will try
to use that latent understanding to predict the next token because I've built up enough understanding
as part of this vector to do so okay attention takes this a different way and it says what you need
to do instead is care about all previous tokens within your context window equally don't try to
maintain a hidden state because as I'll explain in a minute there's a ton of problems associated
with that but instead I want you to look at these previous tokens and first of all make a decision
about where you think the information lies that you need so instead of trying to like incorporate
all information from all tokens the first step is to simply say where do you want to look and part
of this model is about it building up an understanding of where it needs to look for information
so an example would be like the elephant tried to get into the car but it was too
okay big right so the missing word there is something to do with its size now what are we
using to do that the word elephant is clearly important car is important because we need to understand
what it's trying to get into but say it was the pink elephant then the color pink is just irrelevant
to this whole scenario having said that if we change the context slightly and we say the pink
elephant was trying to hide then suddenly the color becomes all important and like a pink elephant
is probably harder to hide than a elephant you know that's a different color darker color so
like the attention mechanism says first of all come up with a way of combining what you're trying to
do which is known as the query with all previous context tokens which are known as the keys
and a little bit like clip that we just talked about it's constantly comparing the key and the
query and then pulling through a certain amount of information from that token which is called
the value and combining it in a clever way using weights multiplication through weights
into the next latent understanding which is past of the next layer and so on and you build
enough of up enough of these layers and you get such depth of understanding of the entire context
of the sentence that you can mimic intelligence it turns out and that's what GVT4 does
so yeah that's basically how attention works and transformers really you don't need to know a lot
much a lot more about them they are just there's a few extra layer types like layer normalization
positional encoding and so forth which tells you how to how to how to basically tell you where
in the sentence a particular word is but ultimately what you got to know is it's all about attention
attention is all you need you might say precisely yeah I mean that punchy title is still
yeah it's one of the biggest memes I think in all of deep learning which is cool
so there are different kinds of architectures that rely on transformers in different ways so
GVT relies heavily on the decoder part of transformers so earlier in today's episode we talked
about encoding and decoding in the latent space so encoding takes say text or in your
your analogies for images but whatever it takes it text so like tokens of natural language or
pixels of images and it converts it encodes that into the latent space and then we have the decoder
part of an autoencoder that decodes that that lower dimensional representation into some desired
output it could be again be text it could be an image it could be video whatever could be code
so similarly transformers encode and decoder they can but in some architectures like GVT we're
reliant more on the decoder part whereas other architectures like Bert which came out a few
a few years earlier but it's still enormously useful for a lot of applications it only has an
encoder on its transformers so why would somebody want to encode only yeah what are the key
these key differences between these encoder based transformers versus GVT right this is the
biggest misunderstanding I come across with people when they're talking about transformers is this
differentiation between encoders and decoders and everything in between there are some architectures
that have both the very first transformer actually and I think this is where the confusion comes
from was an encoded decoder architecture which means it had both and so people now think that all
transformers are still based around this initial architecture and they're not like you rightly pointed
gpt is just decoder only they drop the encoder so what's the difference well
there is one difference basically that you need to know about and that is something called masking
a encoder like Bert doesn't care where it pulls the information from in the sentence to
have a contextual understanding about a particular word it can look forwards in the sentence
and it can look backwards so let's say I wanted to understand and come up with an embedding for
the token elephant in that previous example it can look into the future of the sentence and pull
information from future context in order to to to come up with a realistic embedding for that word
a decoder can't do that because if you want your model to to produce and generate and go into
the future you can't be reliant on future information to do that because it doesn't exist yet
so the only difference is that a decoder simply says mask future information from every step
of the process don't ever pull information from the future only use where you're currently
at to determine the next token and that is why you can use a decoder model for generation like gpt
but you can't use an encoder model like Bert Bert is for natural language understanding it's not
natural language generation that's the difference cool yeah so n l u n l g so
Bert an encoder only architecture we use it for natural language understanding so we can take
natural language and we can encode it into this space and then we can do useful things with that we
could train a discriminative model to be able to do interesting things with that encoder we could
be using it to classify text so you know is this is this have a positive sentiment or a negative
sentiment that kind of thing whereas these decoder only architectures like gpt specialize again
in sequence and generation yeah and the thing you should use to determine which one you need is
like if you want to if you want to build something on top of it like a discriminative model like you
say then you've got to be looking at things like encoder architectures if you want to produce a word
like the next word in sentence look at decoder now there's some examples like gpt 4 we're actually
you can do pretty good discriminative stuff using a decoder model because you can just get it to
output the predicted token so decoders are kind of ruling and dominating at the moment because they're
they're just incredibly powerful generalist learners yeah but yeah but you might be able to more
efficiently like if if you want to be encoding the language to do to do a classification task
probably better you could be probably more computationally efficient using an encoder only
architecturally but definitely and there's small versions of these things like distil but which
you can fit on smaller hardware so yeah I think you know at first but for the call if we're
approaching this kind of stuff is always to go for the encoder models first and see how they do
because you're in so dangerous territory with decoders because you don't actually know what it's
going to produce next you know it's um whereas with an encoder you've got the vector so you can do
what you want with that nice uh and so we talked earlier about music and how that's kind of one
of the more exciting areas for you and it it while we do have some isolated cases of well-known
music generation by AI like there was a song which actually I candidately haven't listened to
not really my genre but there was Drake and the weekend two of candidates best known musical
artists it's actually wild to me is like I'm Canadian and it's wild to me like how in like comedy as
well as in actors in general and in music how like Drake was like the most dominant person globally
in music for years and he's Canadian and then he's replaced by the weekend who's also Canadian but
and it says so somebody took it upon themselves to generate an AI generated track where Drake and
the weekend appear together and if my memory serves me they like they sing about being a love
of Selena Gomez um yeah someone like that to be I'm also in your boat I haven't listened to
it as not my genre either so yeah I think that's correct obviously I heard of the story um and
briefly on the Canadian thing I think there's a guy called David Foster who's really famous as well
like a musician um so he's like every time you google me you just come up with the Canadian David
Foster which I quite like to be honest there you go I can hide behind him all the musicians and all the
David Foster's you need a track already again um so yeah so how can we be using transformers for
music generation um I think they can play a key role in doing it well right yeah definitely I
think you know the first part are cool for everybody doing any sort of generative tasks these days
as transformers and music's no exception um in my book we cover this so there's kind of we go
through the process of single track music where you're looking to generate a single stream of notes
um because that in itself has problems because you have to care about not only
pitch but you have to care about duration and unlike tokens uh text tokens where you're just dealing
with a single integer and words come in discrete units like it's one at a time there's no such thing
as duration in music you've got to care about not only where the note is pitched harmonically but also
how long it is so there's a modeling choice to be made there about how you do that uh you can there's
a few ways of doing it you can code up the duration as it's own token you can um you can basically
model both streams in parallel and and model it as almost like a dual stream of tokens but
ultimately um you use the same ideas that you do from text modeling so you still got attention
where it's looking back at previous notes and deciding what note to come next you know and it
would make sense harmonically if like you're in the key of D and you've got things that are all
notes that are also following in the key of D um so it's the same idea you know there's a grammar
music just like uh language um but then also we talk about polyphonic music which means music that
has more than one track at once so you've got a ton of challenges there like what do you do about
uh parts that just drop out for a few bars do you have to model how do you model that if like two
of the parts continue and two of the parts drop out you know how do you model that it's no longer
one stream of tokens but you've got like maybe a four stream token if you've got a quartet of
musicians um so there's there's different ways of approaching it basically uh one of the first
attempts was something called music and uh this was back in the day i think when gans were all the rage
and you know it was looking at how you can actually model um polyphonic music as a picture so you
imagine something um it's something called a piano roll which is basically where you you kind of
draw the notes out and you can imagine um you know one of these like music boxes where it's got
punch cut it's like a punch card thing and you can almost see the music being fed in as a
as a picture and then it you know you spin the little crank and it kind of makes some
ballerina dance or something on top but you can see that the the the way in which it's being
generated is effectively a picture of the music that's being fed through so you can model it that way
but obviously transformers are now obviously yeah making waves in music generation even polyphonic
music so yeah lots of different options but it's always a modeling choice that you need to make
out from as to how you approach it super cool i'd love to hear about uh all this music stuff i'm
really excited about it something that i and i think this is the first time that i'm saying it
publicly but something that i'm really excited about doing is um generating music where i'll be
involved so like you pointed out if people have there's a guitar behind me which people have watched
the video version of the show before they'll see the guitars always there and i actually can play
guitar and singing there was an episode um it was the year end episode a couple of years ago um
episode number five thirty six uh where uh i ended it uh in a song i played on episode uh i
can't play guitar very well but i i'm competent at like rhythm guitar to accompany my voice
and so well i have this idea for attracting really big name guests like i'd love to have Jeff
hinting on the show we've had emails with him back and forth but he's always too busy and so
something that's like i'm hoping we'd get his attention or if it doesn't get his attention
i think at least it'll get lots of other people's attention is uh performing a song about Jeff
hinting and i could use these kinds of so i haven't yet experimented with the generative AI tools
for music very much but uh i have this idea that i could it could enrich this song rating process
because i could have drums at bass in the background and i generated um so yeah anyways look
cool idea i'm you've got to find something that rhymes with hinting i think i think he's struggling
i've got the uh uh g4 will help me out i'm sure oh that's true yeah get shot um and uh yeah and then very
last uh technical topic for you is uh gans so generative adversarial networks we talked about them
really early on in the episode a few years ago they were the way for generating things and so i
suspect i haven't read your first edition but i suspect your first edition was really heavy on gans
yeah definitely uh yeah it's uh i i sort of see gans as a lot in many ways the trial blazer because a
lot of the techniques and the way in which we approached generative AI was founded through the
GAN movement and it you know there was like a GAN a week at one point and it was kind of a running
joke that you know which GAN are we going to be seeing this week to do some niche thing and
you know i hear people now saying gans are dead and i don't know why you've included them in the
second edition and it should just all be about diffusion models and transformers and that's it
and i you know i first of all the GAN discriminator is still used in so many really powerful models
um the the concept of a discriminator constantly operating over the top of whatever you're
using as the generator to distinguish real from fake and using that in the loss function
is something that's still very much alive today uh take a model like as model called uh vq
GAN um so vector quantized GAN and uh variations of this um is still one of the most powerful
um image generation models out there you know it's it's not the case that diffusion models are
eroding the world just yet and that there are and style GAN excel for example is still
credibly powerful and is dominating a lot of the leaderboards so look i i never like to chase the
chase the latest like thing and say that this is it and we've we've all innovation has stopped
but i i hope what people can sort of take from the book and also take into their own learning
is that it's good to have a general understanding of what's come before because you never know
what's going to come next and where it might come back into fashion um so it there are super
interesting idea that i think it's going to be around along well and in addition you have a bit
in your book about combining GANs with transformers um yes uh basically what i would say to anyone
looking to get into generative AI is is look for the crossover between these fields because i
you know whilst we bucket them up in i bucket up up in the book into now we're doing the GAN
chapter now we're doing the transformer chapter what i would say in general is a lot of the powerful
models out there are actually they've got components of all of them so like you mentioned there
there is a type of model in the book um that that effectively has a transformer within it in order
to do part of the er encoding of a piece of text but there is a GAN discriminator in there as well um
and you know when you when you're looking at a lot of these multimodal models they've got
diffusion in there they've got GANs in there they've got transformers in there so they're using
the right tool for the job they're not like just saying well i'm going to use one model type
and you know that's all i'm going to use in the model because a transformer is brilliant at
modeling sequences a GAN is brilliant at determining fake from real diffusion models are fantastic
working with very rich latent spaces that you can sample from and the best models out there
use all of these techniques and i think they will do in the future as well nice really exciting
what do you briefly as i imagine this could go for a long time but briefly what do you see as the
future of generative AI yeah that's a huge question i mean in terms of maybe i just break it up
briefly into kind of technological and societal technologically i think we'll continue to see the
the field accelerate and i don't see any need nor i guess application of a pause i just don't
think it's it's feasible to run something like this um so yeah the field will continue to evolve
but i think we'll see more emphasis on the alignment side and let's on the power i think GPT-4
is plenty powerful for us for the time being and i don't think we're going to see GPT-5 be kind of like
a a a huge kind of technological improvement over four but i think we'll see it improve in terms
of alignment and i think we'll see it improve in terms of customer flexibility um and just like
just the stuff that goes around productionizing a model like this like um user management and
GPT for business i know is coming out and all of these things that make it a viable product in
the real world that we have control over um so that's one thing and then society i think what we're
going to see is is wide-scale adoption of these tools and i think like all good technology they
will be baked into the point where you don't really know you're using it i don't think people are
going to be for long going into chat GPT to type in their prompts but it will be baked into other
tools in the market and we're already seeing this like with you know chat GPT integrations into
different applications or or wrappers around it so yeah i think the future is bright i'm really
optimistic i'm excited by it and i hope everyone else is too because um yeah it's just the best
thing ever to happen to you at the machine learning field i think yeah as a regular listener you'll
know that i'm a technical optimist and certainly there are issues that we need to stand out with any
new technology but uh really incredible things the last few months have been mind-blowing for me
GPT 4 is completely like i still every day i do something new with it where i'm like i can't
believe how well you do this thing yeah i don't know it you feel but i i feel i'm amazed at the
number of people that haven't heard of it yet i know i live in my little bubble of um like
generative AI and data science and you know and yet you talk to a lot of people who just go
oh yeah i think that was that thing that i saw on a bbc news article and it's just like they haven't
even tried it and i i feel like i'm living in a really privileged position of having this
access to this incredible technology before the rest of the world gets to see it it's amazing yeah
i had a i was just at the open data science conference east in boston uh last week at the time
of reporting so but a month ago at the time of this episode being published and uh i gave a brand
new half-day training on nlp with llms and i focused a lot on gpt4 so how you can be using gpt4
to um automate parts of your machine learning model uh development um including you know
things like labeling um but also just how you should be using it all the time in your life like
it's insane to not be paying the twenty dollars a month like is it you like i saved so much time
i was able to have so many more coding examples in that half-day training because whenever i
ran into an error i was like oh man tell me why i'm making this error and just fix the code and it
does perfectly um and but i had this really surprising conversation with somebody after i
gave that training he came up to me at a at a drinks session and he said you know uh what would
it take for me to train something like gpt4 but it works in arabic and i was like i was like
i does that like you just just you know it'll you're you're looking for something to translate into
arabic he's like yeah i'm like yeah it does that out of the box and then he reached out to me
later on like didn't and said sorry i didn't ask that question right like i get that um you know
it can translate into arabic but like what if i want to train an arabic version of gpt4 they can
do everything and i was like it is yeah like what why are you messaging me just try it like yeah i
said that like everything that you want to do in english you can just ask it in arabic and
it'll i'll put an arabic no problem yeah it's amazing isn't it and like you say the barrier
to entry is so low like just set up an account and you're away it's not even it's free like just
give it a try even if you're running with like 3.5 or whatever it's uh yeah i feel like we're
you know we're watching everyone walking around with candles well i'm holding a light bulb going
like this this seems really useful yeah yeah um so yeah so really quickly um beyond raising your
daughter and your newborn daughter and writing this book you also are the founder and you run
a consultancy uh called applied data science partners so um i guess really briefly tell us
what the consultancy does and i understand that you're hiring so let us know like you know
this is probably listeners out there have been blown away by your you know your impressive depth
of knowledge and your clear ability to explain things no doubt you have thriving consulting practice
so there are probably people that would love to work with you so let us know uh what rules
are open and what you look for in your hands sure yeah um so our consultancy applied data science
partners uh myself and my co-founder my amazing co-founder ross for test jack started this six
years ago with the vision really to deliver AI and data science in a way that's practical and
sustainable for businesses because we found you know at the time a lot of the a lot of the practices
were still very throw away and kind of proof of concept to you so we set up the consultancy really
to base uh data science and AOA practices around best practice software engineering at the time so
containerization or continuous integration and all of these things that you expect from software
engineering we built this around data science um so yeah in terms of our client base we work with
large private institutions uh all around the world but also public sector so we have such a broad
range of work uh it's something different every month um which makes it so i think a really
interesting place to work um yeah and you're right to say we're hiring we're always actively
hiring and looking for the best people uh there's kind of a few different roles i would say that we
hire for uh our bread and butteries data scientists so everything from people who are just finishing their
degree we look for people who are hungry to learn and hungry to get stuck in and and really
not shy away from difficult problems because we solve difficult problems every day for our clients
um so that's that's you know the spectrum of people that we look for right up to obviously like
leads and then people that can lead projects and and and conceptually understand what a client wants
we've got data engineers as well um so it's a different track of our business um they work
closely with our data scientists um to deliver solutions in a in a best practice software engineering
way um and then our analyst as well so we hire people who um don't necessarily have a background in
you know what it's traditionally called data science but are just very very good at explaining concepts
to senior stakeholders um and so we got our our analyst as well internally um we do hire as well
uh people like software engineers so web developers people who can build applications um as I say because
our consultancy is growing so rapidly we're hiring for kind of all of these roles so if you have any of
those particular talents we definitely want to hear from you um tell us why you think you'd be
great fit for the company um because what we look for above everything are people who first of all
are hungry to learn secondly attention to detail is absolutely paramount for our business um we like
people who can dive deep into a problem and not get scared by the weeds um not everything is rosy in
business consultancy you get messy data you get stuff that doesn't work you have to fix problems
quickly um so we're looking for people who care about the detail and thirdly um just be a nice
person like it's really easy uh just just be friendly uh be optimistic be positive and
you'll find a DSP you meet like minded people who have the same attitude nice uh sounds awesome so
yeah uh you're looking for people who are hungry to learn not sure how we are from difficult problems
they have great attention to detail they're nice uh and yeah lots of great data roles out there
are data analysts data scientists data engineers uh and what kind of like stack do you guys use
Python I guess yeah Python for pretty much everything that we're building um so take through some tools
and technologies so VS code is our IDE of choice um in terms of kind of cloud we're fairly agnostic
we work with what the client wants but are recommended would always be a ZER um we work pretty heavily
in that stack so looking for particularly engineers that have that on their CV um in terms of
kind of machine learning models we like to say that we use the tool that's right for the job so
we're not going to always go down the newer network route um you know so like actually boost does
the job most of the time or any of the variants like GBM etc um but yeah obviously for some projects
especially now we're getting a lot of work through on generative AI particularly gen AI strategy um
we're obviously using a lot more deep learning than we have it have done um but for for fine
tuning especially if we're fine tuning open source models so yeah I would say you know text
act we're fairly um we don't use anything out the ordinary but also we're very much aligned with
what the clients want and we're kind of tech agnostic uh in terms of platform so we use tablet and
power BI for example um we're not going to sort of uh you know if the client wants tablet we're not
going to say you have to use power BI and vice versa that makes sense awesome David this has been
a sensational episode I have learned so much it's been so nice to get in the deep in the weeds with
you and hear so much about your deep generative learning uh what's generally deep learning book uh
yeah such a fantastic book couldn't recommend it strongly enough and oh oh I can't believe I didn't
mention this at the beginning so our listeners who have listened to the variant get a bonus treat
here which is that uh we've done this with a Riley authors on the show before when I when I post
on LinkedIn from my personal account and so we've had some people posting on like at YouTube comments
or on uh posts from the super data science account no this is to on my account on LinkedIn uh
because we have to have in order for this to like work fairly it has to just be one and that's
that's the one post that gets the most engagement each week when I announce these episodes so uh
when I announce this episode which will be in the morning in New York time usually around 8am
eastern um from my personal account on LinkedIn uh the first five people who comment
will get a free digital version of David's book so generative deep learning could soon be yours
for free um and uh yeah and something I don't know if I've mentioned this enough recently but uh
you also if you have it to not win that contest that race then you can still get you can get a 30-day
free trial of the oh Riley platform using my special code sds pod 23 sds pod 23 so either way
you can access the book um you just don't have it forever uh with that uh 30-day free trial uh
nice all right and then beyond your own book David do you have a recommendation for us?
yeah actually something recently that has really caught my eye is something called active inference
it's uh it's a concept there's originally um I guess laid down by Karl Friston uh one of my
absolute heroes actually in generative modeling and it's the idea that the forward for your book
he wrote the forward from my book which I'm absolutely privileged and honored to to say um he
so active inference very briefly is just a way of describing how agents learn in a way that
dresses action and perception up as two sides at the same coin it's a very elegant idea
and at the heart of it is a generative model um I'll leave that as a a dangling carrot for anyone
who's interested in this book because he along with his um his associates have written a book called
active inference um and there's a subtitle uh the free energy principle in mind brain and behavior
it's one of my absolute favorites it's a very complex topic that is explained extremely
eloquently um and it's a very recent book as well actually it was only published last year um
and it it's basically the the book you need on active inference if you're going to start learning
about this fascinating uh concept and it's maybe something I would recommend uh once you start
getting into generative modeling that you read because it's uh it's a really interesting kind
of theory of everything for for intelligence and the mind and um yeah it puts the action into
perception if you like wow very exciting and so for people who want to hear more about
your brilliant thoughts David what are the best ways to follow you after this episode?
As a few ways so you can follow me on LinkedIn um that's probably the best way uh you can find
me on Twitter I'm David ADSP um uh yeah and by all means like follow our company as well that we
post loads of interesting stuff about data and an AI so if you're interested in just general updates
then feel free to follow apply data science partners on LinkedIn. Nice and uh you also have a
podcast coming out soon don't you? Yeah that's right we're launching into the space of podcasts
we can't pretend that we're going to be anywhere near your quality of a podcast initially but
like we're we're going to be learning as we go um so the podcast is called the AI canvas and it's
it's a podcast that focuses primarily on generative AI and its application to people so if you
want to know about how generative AI is going to impact loads of different professions, law,
teaching, art, music, creative arts, performance arts, uh we've got interviews lined up with
people from a ton of different professions and their fears and also their um their great sort of
hopes for the technology in the future because I think it's really important we talk with everybody
across the spectrum not just those who are involved in the technical side but specifically those
that are going to be impacted by the technology so we've had a few of the conversations already
uh this blown my mind to how these people are able to talk so eloquently about the topic
and we've had some fascinating conversations already so do follow us on that it's podcast.adsp.ai
Nice all right David thank you so much for taking the time today brilliant episode and I look forward
to catching up with you again in the future hopefully on air so that we can uh sample your wisdom
which is very little noise in it. Oh thank you uh John it's been absolutely pleasure talking to you
about such fun so thanks and uh thank you again.
Boom water gripping and educational conversation in today's episode David fill us in on how
discriminative modeling predicts some specific label from data while generative modeling does
the inverse it predicts data from a label you talked about how generative modeling can output text
voices music images video software and combinations of all of the above how autoencoders encode
information into a low dimensional latent space and then decoded back to its full dimensionality
how variational autoencoders constrain distributions to produce better outputs than the vanilla
variety he talked about how diffusion converts noise into a desired output while latent diffusion which
operates on dense latent representations is particularly effective for producing stunning photorealism
such as in mid-journey version five he talked about how world models these super cool concept
have these blend variational autoencoders together with auto regression and deep reinforcement learning
to enable agents to anticipate how their actions will impact their environment he talked about how
transformers facilitate attention over long sequences enabling them to be the powerful technique
behind both natural language understanding models like bird architectures and natural language
generation models like GPT architectures finally he talked about how GANs such as style GAN XL still
produce state of the art generated images but GAN show particular effectiveness when combined
with transformers in multimodal generative models all right as always you can get all the show notes
including the transcript for this episode the video recording any materials mentioned on the show
the URLs for David's social media profiles as well as my own social media profiles at superdatasigns.com
slash six eight seven that's superdatasigns.com slash six eight seven if you like book recommendations
like the awesome book recommendations we heard about in today's episode check out the organized
tallied spreadsheet of all the book wrecks we've had in the nearly 700 episodes of this podcast
by making your way to superdatasigns.com slash books all right thanks to my colleagues at
nebula for supporting me while i create content like this superdatasigns episode for you
and thanks of course to evon and mario natalie surge silvia zara and kuro on the superdatasigns team
we're producing another profoundly interesting episode for us today for enabling that super team
to create this free podcast for you we are deeply grateful to our sponsors please consider supporting
this free show by checking out our sponsors links which you can find in the show notes finally
thanks of course to you for listening all the way to the very end of the show i hope i can continue
to make episodes you enjoy for years to come well until next time my friend keep on rocking it out
there and i'm looking forward to enjoying another round of the superdatasigns podcast with you
very soon