711: Image, Video and 3D-Model Generation from Natural Language, with Dr. Ajay Jain

This is episode number 711 with Dr. Ajay Jain, co-founder at Genmo AI. Today's episode is brought to you by the ZERV Data Science Dev Environment. By graph-based, the unified data layer, and by model bid for deploying models in seconds. Welcome to the Super Data Science Podcast, the most listened to podcast in the data science industry. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. I'm your host, John Crone. Thanks for joining me today. And now, let's make the complex simple. Welcome back to the Super Data Science Podcast today. We've got the brilliant machine learning researcher and entrepreneur Dr. Ajay Jain on the show. Ajay is co-founder of Genmo AI, a platform for using natural language to generate stunning state-of-the-art images, videos, and 3D models. Prior to Genmo, he worked as a researcher on the Google Brain Team in California. In the Uber Advanced Technologies Group in Toronto, and on the Applied Machine Learning Team at Facebook. He holds a degree in computer science and engineering from MIT and did his PhD within the world-class Berkeley AI Research Bear Lab, where he specialized in deep, generative models. He's published highly influential papers at all of the most prestigious machine learning conferences, including NERPs, ICML, and CVPR. Today's episode is on the technical side, so we'll likely appeal primarily to hands-on practitioners, but we did our best to explain concepts so that anyone who'd like to understand the state-of-the-art in image, video, and 3D model generation can get up to speed. In this episode, Ajay details how the creative general intelligence he's developing will allow humans to express anything in natural language and get it. He also talks about how feature-length films could be created today using generative AI alone, how the stable-difusion approach to text-to-image generation differs from the generative adversarial network approach, how neural nets can represent all the aspects of a visual scene so that the scene can be rendered as desired from any perspective, while a self-driving vehicle forecasting pedestrian behavior requires similar modeling capabilities to text-to-video generation, and what he looks for in the engineers and researchers that he hires. All right, you ready for this horizon-expanding episode? Let's go. Ajay, welcome to the Super Data Science Podcast, related to have you here, where are you calling in from today? I'm calling in from San Francisco, California. Nice, a popular choice for our AI entrepreneurs, for sure. And particularly, unsurprising, given that you were suggested to me by an amazing recent guest, Joey Gonzalez, in episode 707. Wow, what an episode. That guy knows an unbelievable amount and acts so quickly, talks so quickly, and in such detail, at really A-plus episode. And we're grateful to him for suggesting you as a guest as well. Joey is incredible. He and I have worked on some projects together while I was at Berkeley. Yeah, and so we'll get into some of that Berkeley AI research shortly, but let's start off with what you're doing today, and Genmo. So you've co-founded a generative AI startup, Genmo.ai, and in that company, you provide an approach to creative general intelligence, which is in a term that I have used before. I don't think so. Can you explain the vision behind Genmo and its role in advancing this creative general intelligence term? Yeah, absolutely. Happy to talk about that. So Genmo is a startup I've been working on since December. We wrote the first line of code on Christmas and shipped it right after my PhD. Genmo is a platform and research lab where we build the best visual generative models across different modalities, whether that's imagery, video, or 3D synthesis. And the goal of Genmo is to allow anybody to express themselves and create anything. Now, that's a huge, huge lofty vision. How are we going to break it down? How are we going to get there? We see creativity and creative content production as a very detailed, actually really reasoning deep process, which involves many steps, many tools, many people working together. It also involves a ton of creativity and inspiration. And so we build tools and models that kind of get out of people's way and allow them to express themselves in intuitive interface. And so what I mean by that is that we try to make the software not the bottleneck in your content production. So let's say you want to make a feature-length film or you want to make a movie. We want to produce tools that will allow you to produce clips in a controllable fashion. We also make software for 3D synthesis. So if you want to get down into the weeds and get a little bit more control in your pipeline, actually manipulate the underlying assets going into the product, whether it's a game or a video, you can get that out of the platform too. And so Jenmo is kind of, it's an all-in-one place where you can create all this visual content. We also are primarily in a research and development phase and doing a lot of work at the cutting edge of visual-generative media. Very cool. And so yes, you give an example there that you'd like to be able to create, say, film clips, probably in the media term. And I've got some questions for you about kind of big picture and kind of feature-length films later. But just to kind of give our listeners a sense of what this creative general intelligence means, I guess. So the idea is that, yeah, as you express, people should be able to ask for anything and you just provide it. I know that a lot of your initial work is with visual things, particularly 3D renderings. But do you envision that eventually the creative general intelligence that you're working towards would be able to create other kinds of modalities as well? Like maybe there would be natural language stuff or audio stuff. Yeah, yeah, absolutely. So let me break down what we mean by creative general intelligence a little bit. We see a really big opportunity in creating software and AI that can understand any type of media, whether it's visual imagery, video, audio and text. The reason this AI needs to understand and consume that media is then it can understand the user's intent. It can understand the visual world that we humans live in and kind of level up there. So I think a lot of the AI models that are available today, like Chatchy BT, are a little bit deprived of sensory inputs. They're text-to-text models, right? They can observe this really signal dense, meaning rich, textual format and produce it as an output. The people we are multimodal creatures, right? Our sensory density and touch, first of all, is under-appreciated, incredibly dense. Sight and sound, taste, smell. At Genmo, we focus a little bit more on the visual, the sight side for consumption because there's a lot we can contribute there. Long-term, these models should be able to understand many different modalities. And I think one of the interesting things is that we know how to build AI that can be very general purpose. So with similar substrates, with similar architectures, similar learning algorithms, and similar pipelines, we can build software that can, once it's learned one modality, expand into another one, and start to consume data from there and learn how to reason over it. That's on the input side. So part one part of creative general intelligence is building artificial intelligence that can take in any modality and understand it correctly. The next piece is to be able to cooperate with the user in order to create visual content. And that's where the creative element comes in here. So you've already seen that language models have a lot of capability and creativity. They can write poems and stories and jokes, they can even understand them as well. A lot of the work we do is around really high fidelity visual creation, and I'm sure we'll get into that. Yeah, exactly. That fidelity is key to it. And that's something that we, in the past year, there's been an enormous explosion in. It was in spring of last year, Northern Hemisphere spring. I was giving a TED format talk, and we've had a few open AI guests at that time. And so I was able to get them to create some images using dolly too. So there wasn't public access yet, but I could provide them with a prompt. They would create an image, and I would put it in my talk. And I had them pretty small on my slides. And some of them look pretty compelling when they were small like that. But now just a little over a year later, we have stunning, high resolution, photo realistic images. Thanks to the kinds of approaches that you've helped bring forward. So we'll get into that moment as well when we're talking about your research, this high fidelity stuff. But yeah, let's kind of focus on Genmo a little bit more right now. What caused you, given the tremendous research experience that you have, the tremendously impactful papers that you've had, what was it on Christmas that caused you to say, you know, now's the time for me to be jumping in, creating my own AI startup? Absolutely. So I've wanted to get products out to the public for a long time. One of my original motivations for working on AI was to kind of enable new products that weren't possible before. Were really the model enables a new kind of interface and new capability. And the work we were doing in creative visual models felt extremely ripe for distribution. And actually, you know, so we got started because it felt like it was taking too long for technology to get out of the lab and into the industry. The second reason was in order to integrate across different modalities more effectively than we could do in an academic setting. And so in terms of that first bit about it takes too long to get technology out. While it seems like generative AI is this very recent half year, one year revolution taking the world by storm. A lot of these advances have been developing for many years, decades really. And what was observing the academia was it would take sometimes three years to six months before advances we did in the lab started to get into the hands of creators where they could start actually using these models in their pipelines, right? Genmo, we want to tightly integrate the model development in the research lab with product and enabling people to access that very quickly on a matter of weeks. So in terms of Christmas timelines specifically, you know, I mean, what better time to get to hack and to fill the product. So we launched with a video generation product that we had queued up, but there was a ton of engineering to do to actually scale up the systems. Yeah. And so what are the kinds of applications for this? So with Genmo, you're able to create and people can, our listeners can go to Genmo.ai right now, try it out, right? Yeah, it's absolutely. You can go to Genmo.ai, sign up for a free account and start creating images, videos and 3D assets right away. And so what kinds of people would want to do this creatives, I guess? I mean, I guess it would be anybody that wants to be creating images. So in any kind of scenario that you could be using mid-journey, you could equally be using Genmo, but then Genmo is also going to be useful tapping into that general aspect of this creative intelligence. It isn't just text to 2D image. It's text to 3D image. So I guess it's probably relatively easy for people to imagine the kinds of applications like for us on the podcast, we sometimes want to have a YouTube thumbnail for an episode topic. Like we recently had one at the time of recording on Lama too. And so we used mid-journey to create this Lama that's at a computer. And so that kind of thing is fun. But yeah, you already talked about video accretion as well, which is something that is just starting. The same way that I was describing 18 months ago, when Dolly 2 came out, it was a little bit cartoony. It was in most cases obvious that this was generated, particularly if you looked at it in a higher resolution. And now 18 months later, everything's photorealistic. How much further are we away from that being the case with video, where it goes from being a little bit cartoony, a little bit obvious to being really slick. And then it's yeah, so that's kind of what that's one question. But then beyond that, the kind of the 3D renderings that you're doing, who is that useful for? Yeah, absolutely. It's been amazing how quickly you've been able to expand the quality in the visual space. And a lot of the methods haven't actually changed, to be honest. It's the stuff that data scientists are used to doing, data processing, data cleaning, really careful analysis of metrics and the iteration loop that every ML engineer is familiar with. And in terms of the integration of video and 3D, I actually see this as very naturally coupled. We, as humans, take and draw a visual sensory media, and then we're the ones who do the decomposition into different assets, right? Like a mesh is not a natural thing. A mesh is a common graphics representation of a 3D asset. It's a representation we use in the computer to store an object. But then the day that representation is used in different media forms, whether you're using for a computer-generated imagery in a movie like Doon, whether you're using it in a game, whether you're using it for industrial design like in solid works. So we all have these different representations of the physical world that humans have created in order to be able to manipulate them and to be able to efficiently render them into compelling visual media. A video is, in some sense, the rawest form of visual experience, right? You know, it's easy to capture, we constantly consume it, and it's kind of like a rendering of the real world, if you will. And so there are many approaches that we could take to directly generate video, and we work on a lot of those at Genmo, and that will advance very quickly over the next few months. But there's also along that pathway to high fidelity video generation, there's opportunities to give people a bunch of value by synthesizing out these interpretable assets that they can inject into their pipelines. Tired of hearing about citizen data scientists fed up with enterprise sales people peddling overpriced, most driven data science tools. Have you ever wanted an application that lets you explore your data with code without having to switch environments from notebooks to IDE's? Want to use the cloud without having to become a DevOps engineer? Now you can have it all with Zerb. The world's first data science development environment that was designed for coders to do visual interactive data analysis and produce production stable code. Start building and Zerb for free at zerb.ai. That's zerb.ai. That makes a lot of sense. So basically the kind of idea here is probably most people have seen this kind of thing where yeah, you have these like meshes, so you can imagine like some shape like a vase. And that vase has this particular, a vase is actually kind of simple because often it's going to be, you could imagine maybe like a picture is kind of like a better example because the picture has like a handle maybe just on one side. And so you could, you could use Genmo to, you know, say create me a vase with one handle or with two handles. And then it can create this, this mesh. And I kind of imagined it like, I don't know why it's like black and green like green like lines around like this black object. And so it kind of, it gives you the sense, even though you're looking at it in 2D, if it's like rotating, it becomes very clear that this is a 3D shape. And you can see, OK, when it spins around that one handle, you know, comes out and pops out of me and then it goes back around the left. And then, oh, here it is on the left. Here it is on the right. It's just like it spins around. So you can create these 3D assets like the vase and then production company can use that in their pipeline for video production where you have this like the do an example you gave, you know, and tons of films today, especially action films have these 3D renderings. And so, yeah, so it can be there. Characters can walk around it and it will have this sense of being quite real because it's been 3D generated. It's not this video asset. It's this 3D object that is downstream converted into a video asset. Yeah, absolutely. And it's one of the things that we face a lot in machine learning is choosing the right representation to express the world. And so when you choose these representations, a lot of it is due to what's easy to process and what's easy to store and learn over. I try to take a tack during my PhD and also at Genmo to not just build the representations that are convenient for the modular, but also the representations that people actually want. And so oftentimes for many people, they want these assets that they can important to their existing software and manipulate even if, you know, even if we're still early in generative AI. Yeah, yeah. And then what you were saying is that these kinds of these 3D structures, then they make it easier to create photo realistic video, video realistic video, where because you could have a collection of these objects and then they're combined together to create a particular kind of scene. And they can, I can see why that's, yeah, that's like a really nice pipeline. But it sounds like you are also working on kind of skipping that intermediate step. And eventually, I guess maybe through kind of, I guess the model will kind of have these latent representations of those just inherently where you don't need to specifically dictate, you know, that you want this base of this particular look to be in this particular part of the scene. And do that separately, you can just, you can ask in natural language for a scene with a vase in the corner, whatever next to the couch, and all of those objects, they are, yeah, they don't need to be rendered as this discrete 3D step because somehow the latent representation of the model just handles that seamlessly. Yeah, so absolutely, we're going to be moving towards a future in which the, as the model grows, as the model improves in its capabilities, it's able to learn these representations of the world itself. And so we can already see this with some of the video models that Jen was working on. We have some things coming out, this should be out by the time of this podcast, there's, where these videos synthesis models will learn objects which are geometrically consistent. What I mean by that is, it's synthesizing pixels directly, you know, pixels straight out of stack of frames, but the objects observing those pixels appears like it's the same 3D consistent object across the course of that video. So imagine it's a person rotating, as they rotate, you would expect certain physical properties to be preserved such as conservation of mass, right, the amount of stuff in the frame should be about the same. If you had taken the approach today of an interpretable pipeline, which synthesizes the 3D assets, you import that 3D character asset made on Genmo, into Blender, rig them, animate them, you get that conservation of mass for free, right, they obey those physical properties that we expect because they've been built into the pipeline. And this is extremely useful because it allows you as a creator to directly take the reins, the AI is just there to help you in steps of the process. As we build models, there'll be people who don't know how to take that asset, that mesh and take it into Blender, they don't know how to rig it and animate it, but they still want to express themselves. For them, we're building these end-to-end trained video models, text in or images in, video frames out. Today, you sacrifice a bit of control by doing that, but it's a lot easier to use and over time will become the highest fidelity option. Very cool. Yeah, that's exactly what I was imagining. Thank you for articulating it so much better than I could. And kind of a tangential question that came to me, as I was thinking about this, is there have been examples recently in literature of multimodal models being able to do better in one modality because they have information from another modality. So like an example of this would be a model that is trained on natural language data, as well as visual data, might be able to describe a visual scene better than one that doesn't have any of that visual data training. And so, yeah, maybe talk a little bit about that and how that kind of feeds into this idea of creative general intelligence. Absolutely. I think this is a trend we see across machine learning as an industry, not even specific to visual, that let's let the model learn, let's let the model learn its internal structure from data, and let's train on as much data as possible, that is diverse. In particular for visual modalities, there's a lot of synergy across different formats. So training of video models is extremely expensive. By exploiting image data, we've been able to improve the efficiency of training and sampling those models quite a lot, making it easy for us to offer it for free to the public to get started. So it's a great text data is extremely useful. So it's one mode of control, which is a big reason to use text, your users like to communicate in natural language. Another reason is it actually helps the model learn the structure of the visual world, because it's being described in the training data and categorize all these different categories for you in that text will input. In the future, in building towards creative general intelligence at Genmo, where we are building a research lab that is working across all these different modalities. A lot of people ask us, why don't you focus on one? Why don't you just focus on 3D synthesis? Why don't you just focus on video generation? And I tell them, we are building it in the way we think will be the most scalable long term, which is building models which can actually understand and synthesize the full 4D world, full 4D meaning 3D and time, with natural language controls in addition to visual controls from when people want to get really down into the weeds. That makes perfect sense, and it's very exciting. So going back to a question that we touched on earlier, or we touched on this idea, it seems like we're now not far off thanks to technologies like you're developing a Genmo to having consistency for video clips. I don't know, maybe you can give us a little bit of insight into, is it like a few seconds, can we stretch, I don't know, what kind of time horizon can we get to now where we have consistent video and how realistic do you think it is that maybe in our lifetimes we'll be able to have feature length films with dialogue, sound effects, music, all of that created at the click of a button with a natural language description. So it's happening today. So a few months ago, when we got started, we had what I would call a flicker gram, something that it's more of a flickery style of video trippy and psychedelic, and people love this for a lot of effects, perfect for water, gas, cloth, repelling or hair flying. And what we exposed as a product was that you would upload a picture, you'd select a region of the picture, and you'd paint it in as a video, right? So you select a region of your personal photo or your AI generated photo, and then animate that particular region of the photo. And the people could do things like add butterflies or make the water ripple, make the water fall flow. I think there's actually a whole subreddit called cinema graph that does things like this, these mostly still photos were parts of it. And this got extremely popular, right? And this is one very low level effect, subtle animation, we had some knobs where you could dial it up, dial up the chaos, but you can lose coherence. A couple of months after that, we released a model for coherent text to video generation, where you put it in a caption, you get a coherent clip out, but it would only be two seconds long, low frame rate, think about eight frames per second. And so this is, you know, interesting, but it wasn't yet at the quality that's actually really truly useful for people. So they, in fact, preferred our first animation style, which could do subtle effects, but it could do it really, really well. We're continuing to improve that technology, that's really just a quality problem where people want coherent motion, but they also want high fidelity, and that means high resolution, high frame rate. I was coming soon from Genmo does 24 frames per second or more, really buttery smooth video, high resolution, full higher than 1280 p video synthesis. It's still short clips, so about four or five seconds, but we're moving that bar up over time. And really the bar is extremely high, you know, reality is a high bar to meet. In this industry, one of the things that I love, love about working on generative models is that the world just sets a really high standard for how high, how good we can get. Yeah, very cool. And then so what do you think about this like idea of in our lifetimes having feature length films that you can, you could say, you know, you could just type in, I want a podcast episode. Yeah, yeah, Sean Crowe and Ajay Jane, it's got to be 75 minutes long, and they're going to talk about creative general intelligence, render it please. With the human in the loop, we can get there, you know, arguably today, probably in a couple months much better, because I was watching, I was watching this TV show dark. I don't know if you've heard of it. It's slightly dystopian time travel or into TV show and just really beautiful cinematography. And I was kind of carefully observing some of the cinematographic effects. And I realize that almost all the clips in each of these TV episodes is only a few seconds, you know, max 10, maybe max, max 30 seconds in a single cut. And then all these little clips are stitched together and that actually, you know, that has benefits allows the director to have some control over the story allows you to zoom in on aspects that are of focused, what's happening in the narrative. 30 seconds is quite conceivable that Jenmo is going to get there. And there's another problem that's probably on your mind, which is, okay, yes, each of these clips is 10 to 30 seconds, but they're all of the same scene, right? Like, you know, the content needs to be consistent across that film. That is a little bit of a problem for these visual generative models today. It's also part of the reason why we go for this integrated general type of architecture, because we can imagine having a model, our same model, that producer video clip, not just condition on the text that you've typed in, I want, you know, a TV show about time travel, but it also conditions on the previous clips. And so at test time, just in the forward past, this model can preserve the identities of the subjects. They know that it's John, they know that it's me, a J in the past clip, so I should produce the same identity, they know the style should be consistent and so on. I would say we're not there yet, it would take a lot of manual work and hand holding by the user to produce at that TV show. I think that in terms of using these tools to produce clips that human put together with their great vision, that can happen by the end of the year, I think. Now, that's an ambitious timeline, but I think it's going to get to the point where we can start to have, you know, YouTube level quality. And full text in to the full movie. It's a little tough to say, I think that's actually, it's actually something that I leave more to the users to build those kinds of pipelines. Yeah, we expose the individual tools. This episode is brought to you by graph base. Graph base is the easiest way to unify, extend and cache all your data sources via a single GraphQL API deployed to the edge closest to your web and mobile users. Graph base also makes it effortless to turn open API or MongoDB sources into GraphQL APIs. Not only that, but the GraphBase command line interface lets you build locally. And when deployed, each Git branch automatically creates a preview deployment API for easy testing and collaboration. That sure sounds great to me. Check GraphBase out yourself by signing up for a free account at graphbase.com. That's g-r-a-f-b-a-s-e.com. Yeah, that's obviously a much bigger stretch. There's that consistency over very long time windows and creating a good story and all the suburbs. But I don't know, it's like, if you'd asked me that a year ago, I'd say maybe it's not possible in our lifetime. Today, you asked me that, I'm like, that's going to happen. Like, I don't know how long it's going to take, but 10 years seems like a long time in AI terms now. Oh, yeah. So, yeah, very, very cool. So, a lot of your work today is inspired by research that you've done in the past. You're a prolific for how young your career is and how young you are, things like your age index. So, like, this marker of how many citations you have across how many papers you have is very high. And so, perhaps the most well known of all of your papers so far is on denoising diffusion probabilistic models, DDPM. That came out in 2020, and this paper laid the foundation for modern diffusion models, including Dolly II and the first text to 3D models, like the kinds of things that you've been talking about so far at Shenmo. So, can you explain DDPM, these denoising diffusion probabilistic models, and how they relate to other kinds of approaches, like generative adversarial networks, GANs, which were the way that people were generating images and videos primarily until recently. And perhaps, I realize I'm asking lots of questions here, but perhaps most importantly is tying all of this into stable diffusion, which people like are probably, many of our listeners are probably aware of in mid-journey, that's the approach that they use, and how it relates to your own work. Yeah, I'm happy to talk about the DDPM project. Let me give a little bit of context to lay the landscape. So, this was a paper that I worked on in 2019 and 2020, early in my PhD at Berkeley. I did my PhD with Peter Biel, who I believe has been on this podcast before. And I was very excited about the promise of generative models, but it felt like we had an incomplete picture, an incomplete set of tools. We would take a lot of hand-holding, a lot of really careful tuning, data-securation, to even get something to barely work in visual generative media. Let me give an anecdote. So, there was a project I was working on for in-painting. And the idea here was taking models like GPT that generate one word at a time, instead training them to generate one pixel at a time in an image. If you did that kind of thing, these models would only work at low resolution, and secondly, they wouldn't be able to edit. They wouldn't be able to do something like replace, you know, some object in an image. So, let's say you have that vase with a handle on a table, and you want to remove, clean up the background. So, even if we could train these visual models to generate one pixel at a time, they would work at, let's say, 32 by 32 pixels, little tiny, tiny images. But we wouldn't be able to use them for manipulations. So, I did a project in that space where we tried to scale it up and we tried to make them more flexible at editing. We succeeded to some extent to both of those. But there was seem to be a hard limit where, you know, to get to the interesting levels of resolution, like 256 by 256 pixel images, still pretty low resolution. But you can make out what's in the image. The model would take many minutes to sample an image, because it's going one pixel at a time. You can kind of imagine as you scale up the resolution, you're quadratically expanding the amount of time it takes to generate that image. So, compute became a problem. The second problem with using GPT-style models to generate images is that they would kind of become unstable and stop generating coherent content somewhere along the line. So, if you're sampling thousands, tens of thousands, hundreds of thousands of pixels, the errors would accumulate where you would screw up one pixel, the model would sample something wrong. It's supposed to sample a blue pixel, but all of a sudden it samples a white pixel. That error would start to propagate. And the model would soon start, start losing the capability to understand what was going on and produce more of like a blurry mess. In addressing that computational challenge of tens of thousands of pixels, we asked the very obvious question of, can't we just generate multiple pixels simultaneously? Gans, the generative adversarial networks, were able to do that. We started to work on Markov chain Monte Carlo or MCMC samplers that could in parallel synthesize all of these pixels. That project started to evolve over time into the denoing diffusion probabilistic model's paper. All right, nice. So, yeah. So, you're tackling this problem. You're realizing that maybe by predicting multiple pixels at once, you're going to get better results in single pixels at a time. You realize that Markov chain Monte Carlo, MCMC sampling, can be effective for that. So, how does that, how does that thing go on? Has it lead to this kind of stable diffusion approach? Yeah. So, once you start to sample multiple pixels at a time, the landscape is actually wide open for the architectures you can use. The GPT style models, these ones that generate one token, one pixel at a time, have some constraints in them. They are only allowed to look at the past because if the model could look at the future, the rest of the pixel, the rest of the sentence, it would be able to cheat, trivially, just repeat its input. And so, you have to constrain the architecture. With DDPM, we realize that we could train a model which can see all the pixels simultaneously. But now, there's another problem that, how do we actually synthesize an image? So, with our aggressive model, it's clearly generated one pixel at a time. You look at the past ones, you generate the next one. In DDPM, we have a slightly different architecture where you start with pure noise. So, the image is all noise. And the question we ask is, if we learn a denoising autoencoder and an autoencoder that can look at this noisy image and strip out a little bit of noise, people use this for video processing and like fine green noise removal, is that denoising autoencoder actually capable of generating an image, not just removing a little bit of noise? Turns out the answer is yes. We found this, my collaborator, Jonathan Ho, found this project from Joshua Soldixteen from 2015, called the diffusion probabilistic models paper. That builds a Markov chain that maps from Gaussian noise, then iteratively denoises it in order to produce a clean sample. There were a lot of architectural limitations in that diffusion probabilistic models paper from Joshua, that early work in 2015, it didn't produce that high quality samples. So, in the DDPM project, we ended up making a lot of improvements to that framework, greatly simplifying it, changing the way we parameterize the model, coming up with a new neural network architecture that worked a lot better, that neural network architecture is actually extremely similar to the architecture that stable diffusion uses. One of the key things we did was reweighting the loss function. And so, not to get into too much technical detail, but essentially when you do this denoising process, it turns out that there's most of the things that people care about are high level structure. What is the object? That there's an object kind of in the middle of the image. I'm looking at you right now, John, on the screen. There's a person in the middle of the image, there's a guitar in the background. These are high level semantic concepts. Those are most salient to people. And so, part of what we did was reweight our loss, so that would focus the model on learning these high level things. Denoising really high noise levels. So, images that have extremely large amounts of noise in them, learning how to remove them. You might ask, removing noise from image, how does that allow you to synthesize an image? The answer is, it's sufficiently high noise levels. You actually need to fill in the missing content in order to remove noise. So, let's say you have an image that is so noisy, you can barely make out what it is. You can kind of make out that there's a circle in the middle of the image. If you're able to actually strip out that noise, you're forced to learn what that circle might actually be, that it might be a face. So, this formed the foundational foundation for DDPM, where this insight that learning how to remove noise from an image could allow us to synthesize an image. A lot of that project was improving the architecture to make it actually happen. Allow that model to learn this. Because in order to learn to denoise, you need to learn everything about what an image can look like. And that's hard to pack into a neural network. One of the interesting things that happened with that project is that it turned out to be a very stable loss function in the end. Well, you do, as you take an image during training, you add noise and you learn how to remove that noise. Nice. Yeah, so this makes a huge amount of sense. This idea of iteratively removing noise, while using a cost function that prioritizes the kind of salient objects in the image. And does this kind of stability I'm wondering, this is generally something that I know nothing about. I'm mostly working in the natural language world. So, like my machine vision stuff is relatively weak. Does this kind of stability help reduce hallucinations? Or is that something that is a big issue in machine vision? Yeah, so I think hallucinations are definitely a problem in visual synthesis. They take a slightly different form. So text in natural language is really dense on meaning. So there's stylistic things that it's coherent in English. It abides by the grammar, right? So that's kind of what you might think about, at least low-level details visually, that objects aren't at the very ends of the color spectrum super saturated. Maybe it's changed smoothly, images change smoothly. There are great sharp gradients around object boundaries. Those gradients are smooth, like a curve or a line. These are low-level things akin to the grammar of language. These higher-level semantic concepts are what we start to notice when we talk about hallucinations. The model is saying things that are coherent in English. It's perfect text, but it's incorrect. And it's just made up, right? Well, with a visual-generative model, we have exactly the same problem. Oftentimes it's actually a boon. A hallucination is a good thing, because it allows people to create stuff that has never existed, and never could exist in the real world. So fantastical combinations of two characters. You see people do mashups of animals and vehicles. So like a furry car or a bat swimming or some things like that. People with their hair on fire to create incredible artistic effects. So these are hallucinations in some sense that they don't exist, but they are what people want a lot of the time. So we need to build general-purpose models that are able to generalize to those different semantic concepts. Another form of what might be called hallucination is that the model is getting some of these semantic ideas wrong, getting some of that grammar wrong. So let's say it generates one thing we saw when we scaled up to high resolution faces, is that the model would generate an eye with one color, like blue, and another eye with a different color, like green. And so this can happen in reality. But in the vast majority of the training data, people have the same color eyes. And so the model is actually underfitting that distribution and hallucinating something that shouldn't be correct. Very cool. I like that analogy between the grammar of natural language and the details of visual imagery. That makes a lot of sense to me at an intuitive level. Awesome. Yeah. All the stable-division stuff, obviously making a huge impact in your own work, as well as in these kinds of popular tools like mid-journey. And it's a while to see how this is going to continue to get refined with the kinds of things that you're doing these 3D representations, longer video, really exciting, the area that you're working in. It must be really satisfying for you to kind of have a new architecture and then you're like, like, you probably hit your head on the wall, your team for weeks or months at a time, sometimes trying to get like something to work. And then you finally crack it and you're like, wow, you get these stunning, like, this whole new level of visual capability. Yeah, you're really working in an exciting nexus of AI. Yeah, it's been fascinating. And seeing this explosion of creativity that's been enabled by these models has been extremely satisfying, very happy about it. I think, let me give one anecdote about it about this. So once we had identified some of these architectural building blocks that were really scalable, things were actually very easy to extend and to build in new settings. So here's an anecdote. For that DEPM project, we built a model that could synthesize faces on the celebrity face data set. It generalized more in the wild faces out of the box. These are things that GANs could do, though. GANs like style, GAN, we're able to synthesize high fidelity faces, even higher fidelity than we were doing. However, what they couldn't do is be able to be transferred out of the box to a new setting. They needed months of engineer time, tuning the hyper parameters, calibrating it. They're extremely unstable. What we did is we took that same model, same architecture, same laws, same hyper parameters and just swapped the data set for 30,000 face images to 1.5 million images of different settings like Cats was one of the data sets. Another data set was churches. And another one really interesting was like a million image data set of bedrooms, L Sun bedrooms. As opposed to the interior of people's bedrooms. Deploying machine learning models into production doesn't need to require hours of engineering effort or complex homegrown solutions. In fact, data scientists may now not need engineering help at all. With model bit, you deploy ML models into production with one line of code. Simply call modelbit.deploy in your notebook and model bit will deploy your model with all its dependencies to production in as little as 10 seconds. Models can then be called as a rest endpoint in your product or from your warehouse as a SQL function. Very cool. Try it for free today at modelbit.com. That's m-o-d-e-l-b-i-t.com. I've seen that even from years ago with relatively early GANs in the first year or two of GANs. I remember that data set as being a really mind-blowing moment for me because I could show you can move through the latent space of the GAN that was creating these bedrooms. And it would slowly shift the perspective on the bedroom or slowly change the color of the bedroom wall. And so that was actually one of the first layout. That data set was really important for me in realizing how crazy things were becoming in AI. That's awesome. Yes, these latent space interpolations are so fascinating. They reveal some of the interior structure of the model. And I think it's also very humanistic because an interpolation is the first step to a video in some sense. A video is very particular kind of interpolation. Right. If you want to look at it. It's like an interpolation over time. Yeah, interpolation over time. Where each frame is similar to the last one. But they change in a particular structured way. And so those are bedroom interpolations that you probably looked at. They don't change in these physically preserving ways. The conservation of NASA is definitely not there. One of the interesting things about the ELSA and data set is that it started to generate photos of art on the walls. And so we generate these pictures of someone's bedroom, right? And people have art on their bedroom walls. That they purchase online and hang up. And so you could actually see in the model of the samples there would be little pieces of artwork just hanging on the wall of someone's bedroom sampled from the image. And this bedroom doesn't exist. That art doesn't exist. But it would be there in the image nonetheless. And the second thing that's really interesting is that that just worked out of the box, right? Same hyper-rounders. It worked well enough. I did a little extra tuning. I doubled the model parameter count after we submitted the paper. So then in our rebuttal when the reviewers come back with the critical feedback, you can say, oh, actually, we are now better than the GANS. We forgot to include that experiment. Nice. Yeah, very cool. Yeah, that makes a lot of sense. That anecdote is super helpful for me to be able to understand how this kind of technology allows you to move more quickly than folks who are working on GANS. Where, yeah, the hyper parameters on that are really tricky. Because really quickly for our listeners, when you have a generative adversarial network, the adversaries, you have two different neural networks. And one of them is like this discriminator that's judging the quality of the work and trying to figure out which ones, you basically have the state, you have your real image dataset, and then you have this generative neural network that creates fake images. And that discriminator is tasked with trying to figure out which ones are the real images and which ones are the fake ones. And then you back propagate through both of them together where you only change the weights in the generator so that your generator starts to be able to figure out what kinds of images do I need to create to do the discriminator into thinking that they are real. And it's getting the hyper parameters right on like the learning rates of like that discriminator network versus the generator network are really tricky. Yeah, I've certainly fought with those myself. I've done GANS stuff. I haven't done this more, yeah, these more stable approaches that you're working on. So in the same year that you published this denoising diffusion probabilistic models paper, this DDPM paper, you and your brother, Paris Jane, you co-authored an article called Contrast of Code Representation Learning. And that's another big paper of yours. How does that paper relate to everything that we've been talking about so far? Yeah, so this was hard getting back to some of my research origins where I used to work on compilers. Out of chance and undergrad found a professor working on compilers and started working in that area. I got very interested in performance engineering. I got to Berkeley and I was working on these language models and visual synthesis models. And I realized that code was a ripe area where, you know, if we could learn your own networks that could understand code, we would be able to use them all over in our developer pipeline, scratcheronage, you know, fix bugs for us, detect issues, write types automatically, summarize complicated hairy code bases because researchers don't write the cleanest code like you get an AI to summarize some of that, it would be helpful. And so Contrast of Code was a step in this direction of kind of a new way to learn representations, neural representations of software. It was a very small community at the time. This area has exploded in popularity with the advent of co-pilot, open AI code X, models that can synthesize code. But it's a little bit of an old field with a lot of work. And so Contrast of Code, this project I worked on with Paris, who is also my co-founder by the way, we were working on a method that could represent code in a way that's more robust than past approaches. By robustness, what I mean is that, let's say you take a function, right? That needs to do something like sort and array. There are many different ways to implement that same function. That function can be implemented with different algorithms. It can be implemented with different variable names. It can be implemented with comments or without comments. But for the purpose of neural network, these should have the same function out, right? At the end of the day, they implement the same functionality. And so Contrast Code was a method that would be structured to learn the same representation, regardless of how you implemented the function, as long as it had the same function out. In doing that, we were able to demonstrate a lot more robustness against little changes that people would do. Little changes that would completely change the behavior of your learning algorithm. Very cool. And so, yeah, so an early effort at these kinds of code suggestion to holes that, yeah, now are abundant and make it so much easier and so much more fun to write code. Like, it's been game changing for me. I can do things so much more quickly. Yeah, particularly with slating older libraries that are already embedded into the model weights of like GPT-4. And it's just so easy to be like, I got an error. Just please fix it. And you actually, you do get really helpful learning steps along the way. So you can really now just dive into like, you're like, oh, cool. There's this package that I want to learn how to use. I've got this project that, you know, would allow me to learn that package. And you can just dive right into it. You can just, you know, you have this instructor that is able to really help you out. And in a really friendly way, I love how friendly GPT-4 is with me. When I make mistakes, it's so like, oh, I can see why. You did that. Good effort. If you just make this time. It's tough to work with people, right? I think opening out is not a great job calibrating some of the expectations of the model against people's expectations. Yeah, yeah, for sure. It's so friendly. But yeah, related to this, how do you think these kinds of code reading or I guess even natural language reading models? What are the implications for security? Or security analysis or the potential vulnerabilities in adversarial settings? There are tons of vulnerabilities. I mean, you can dramatically affect the behavior of your code or your language model by changing the phrasing of your question or by introducing extra tokens and text into the context. And it's not always clear that scaling the model addresses all these vulnerabilities. Some of them are pretty fundamental. That there will always be avenues to attack and affect the behavior of the model. If it wasn't possible to affect the behavior of the model, it wouldn't be able to understand what you write. Because fundamentally, the rest of your code base or the rest of your dogs expresses some meaning that the model needs to understand. So it has to be sensitive to that. However, there are certain classes of changes that we can make our models more robust to. Things like this expressing different functionality. It's subtle perturbations to the data. We can make our models more robust to that, little by little. Some of this is through data by training on more diverse data. Some of it is architectural in cases where you can make a little bit more robust. I think a lot of the work we did at ContraCode on loss function changes that would make the model more robust, turn out to not really be required at scale, by scaling up our data, a simpler loss function can learn similar things. Because by seeing 100 different implementations of merge short, the model will automatically learn similar representations even with the GPT objective. If you have a smaller data set or you want to fine tune your model and specialize it to a new circumstance like your enterprises data, objectives like ContraCode's objective become pretty useful in that you can handle having a much smaller data set. You have to still retain that robustness. Yeah, using it like that. Are there people using this like commercially now or are there open source implementations that are easy for people to use if they want to be doing that kind of fine tuning on their own enterprise data with ContraCode? Can they do that today? So we have an implementation of ContraCode. It's research where. So we haven't that proposed to our main couple of years. That's on GitHub. It's possible that people are using some of these ideas but I don't personally know. One of the core things there is around a data augmentation approach for programming languages where we do data augmentation by recompiling the code into a different format automatically. So if anything is being used out of it, I would imagine something similar for code language models could be used in the industry. Where you take your data set which is even though it's a small fine tuning data set, there's off the shelf compiler tools which will rewrite that code into 100 or 1000 different ways and you could train your model on that augmentation set. Very cool. It's amazing how many different areas of generative modeling you've touched on in these relatively few number of years that you've been doing research as wild. Another area that is super fascinating to me, this is something that you were doing before Berkeley, is at Google Brain you worked on generative neural radiance fields. So the short form for that is like Nerf, like the Nerf guns. So N, capital N, lowercase E, capital R, capital F, so neural radiance fields. And so these generative neural radiance fields, generative Nerf, these allow 3D scene representation, which obviously we've talked about a lot already in this episode. How did the Nerf stuff lead to the kind of 3D object generation stuff that you're doing today? Yeah, I'd love to hear a bit more about that. Let me give a little bit of background on Nerf. So Nerf is a Berkeley homegrown paper that came out around our DDPM paper as well. What a Nerf is, is it's a representation of a 3D scene in neural networks weights. So with normal neural network, you picture a model which can take in data and then output some predictions and generalize to new data. So during training times, train on certain input output pairs and then can generalize at test time. A Nerf is actually extremely overfit. It's basically it's like a JPEG where you take some neural network representation of a scene and you pack in visual content for a particular scene in the world into that network. So you can sort of imagine, let's say you have this 3D environment, a room, an indoor house. You can represent that with a bunch of photos. You can represent it with these meshes that I talked about. The meshes are human-grade representation. A Nerf would automatically learn a representation of that scene that stores the colors, the wall is white, the guitar is yellow, but it also stores some elements of geometry. At this coordinate XYZ, there's a lot of matter. There's something here that's absorbing light. In this area, this other area, XYZ is free space. So that's really it's a look-up function mapping from XYZ coordinates in space to color and to density, telling you the amount of matter in that space. That's what a Nerf is. I think a Nerf kind of like that JPEG. It's just a representation of a 3D scene that allows you to interpolate really nicely to new camera poses. So unlike a JPEG that expresses a single perspective, with a Nerf if you train it, you can now run it at a different perspective and interpolate between the different input images. And so how does this connect to generative media? Yeah, like how does that create? I mean, it's obviously the kind of connection there. So you're using a neural network to store information about a scene so that regardless of where in the scene or what angle in the scene you want to render, all the information that's needed is there in the Nerf representation. And so it's actually it's pretty obvious to me how that's useful for the kinds of applications that we were talking about much earlier in the episode where if you want to be rendering scenes for a film or you want to be rendering 2D images of some 3D space, this is going to allow you to do that. I guess my question is how does that Nerf work relate to the kind of work you've been doing more recently at Genmo, for example, or at Berkeley? I suspect that there's some kind of connection, some kind of continuity and improvements over the years. Yeah, so I think all of this is connected to this vision of creative general intelligence. These are different instantiations what I see is that general purpose creative model. Some of that model learns how images work. You know, by denoising images, it learns what's the content of visual worlds but doesn't know anything about motion. It doesn't explicitly know anything about 3D genometry. We also trained models on video. Those models know more about geometry because they see objects moving, they see cameras moving. It knows about how objects move, so it learns some interesting things. Then we develop a lot of algorithms that can take these general purpose visual priors that have learned how the world looks and how the world works in an abstract level and distill them down into something low level, like a nerf. I call nerf low level because again, it's just storing the contents of the scene. We need these really powerful generative models that learn how the world works and then a powerful algorithm that we develop at Genmo to distill these visual priors into this interpretable 3D representation. Right. And so it's kind of like a post-processing step to take this foundation model we are developing and then extract out not an image, but extract out a 3D scene. Very cool. That's awesome. So yeah, so they're related in the sense that we're still talking about, yeah, reconstruction of a real world scene, but the nerf stuff doesn't actually generate. It's a map for regenerating something that has already been conceived. But the Genmo stuff that you're working on today, yeah, Genmo could equally output pixels or this nerf representation and that nerf representation would have much more flexibility in the sense that somebody could take that nerf representation and render it however they like. So you'd kind of have, it's actually kind of interesting in Tizen, maybe with that idea that you were talking about if we wanted to have a bunch of different shots in the same scene in a film or TV show. And you want to have that consistency seen over scene and this kind of nerf representation could be perfect for that. Yes. Yes, absolutely. By synthesizing out the sample that's 3D consistent, you're 3D consistent by default. When you go ahead and render a camera trajectory and a camera trajectory rendered out is a video. Now there's still a gap here, I think, when we come to video and motion where these nerves are static. They don't actually express motion. But this is some of the things we work on at Genmo. We build a foundation model and then for different customers that want a particular format, we build these algorithms that can extract out that really high fidelity version. Very cool. And so kind of going even further back into your career history all the way back to five years ago, which seems like, yeah, forever in AI time, you were working at Uber. So I guess that was an internship. Yes. But you were there for a while. It was like a nine-month internship or something. And so you were working in their advanced technologies group, ATG. You were a machine learning researcher there working on self-driving cars. And specifically, you were forecasting pedestrian behaviors. So this is super cool. So it's obvious to me why this is so important. If you want to have a self-driving car, you need to be able to predict if there's somebody walking on the sidewalk, and they're walking parallel to the sidewalk, like they're in the same direction as the sidewalk, that's a very different kind of signal to somebody standing on the sidewalk and walking into the road. And so I can, it's without having spent any time on this kind of self-driving problem myself, it seems intuitive to me that that kind of recognition, having an AI system, they can recognize that, and notice it in advance, and say, okay, you know, that's 200 yards away, that pedestrian, but that pedestrian behavior of stepping into the road, that's a very different kind of signal to, you know, dozens of other pedestrians that are just walking along the sidewalk. Yeah, this is, this is like really approving ground for developing these really capable foundation models that know how people behave and to synthesize video. At Uber ATG, we weren't interested in synthesizing pixels, but we were interested in forecasting behavior, and the reason you need a forecast behavior is to be able to plan. There's multiple steps in a self-driving pipeline, there's the sensory inputs, there's perception of how the world is at this instant, where are all the objects, what is their spatial relationship, what is that object? Then there's the problem of forecasting how are those objects going to interact and behave, whether those objects are static, like the traffic light, a car, or whether they're people interacting with each other. Once you have this complete picture of the past, the present, and the future, now you can run planning algorithms and robotic stacks to try to predict a safe trajectory for your vehicle to navigate that world. But predicting the future is really critical, right? I think people do this inherently. When we walk down the street, you see someone coming towards you, you don't want to bump into them, you need to be able to predict how they're going to behave, because if you assume that they're just going to stand still, there's no problem. We built generative models. Honestly, not that different from GPT, that could predict how people will behave over a course of time, whether they're going to stay on the sidewalk, whether they're going to turn to avoid you, whether they're going to cross the street. This was a really interesting problem space that got me hooked on this problem by asking the future and learning generative models of the world. But it still remains to this day a very important research area and one of the core problems in self-driving. No doubt. So when we think about this, when you're thinking about pedestrian behaviors, does that involve, yeah, I guess I actually know, I just kind of answered my own question, because I was thinking, does that include if they're in a car? I mean, I guess like if they're stepping out of a car, that starts to then kind of become, it like bridges this world, but I guess you'd otherwise, you'd probably have some kinds of models of vehicle behavior where even though there's a human in the vehicle, they're not a pedestrian because pedestrian by definition is somebody like walking. I think whatever the Greek or Latin root is right there and like the pet part of the word. But yeah, there's this interesting transition state between being in a car and having that vehicular behavior and then becoming a pedestrian, which actually, which might be particularly tricky to like bridge. But yeah, I don't know. I haven't really asked a question, but you might have something interesting to say anyway. Yeah, yeah, yeah, yeah. I think that's some great observations. It feels this arbitrary distinction, right, between different categories of objects. Why are we having different interns working on different categories of objects? Does it make sense? Well, yeah, I would argue it doesn't make sense. It makes sense because the challenges are different. It makes sense because a pedestrian moves much slower, so you need architectures that are much finer grain, look at a smaller regional world. It makes sense because pedestrians have social interactions. There was a more agile, but vehicles have constraints on how they can move and they operate over longer distances. So there's some practical reasons why they're different. But conceptually as a person, we share a lot of the same machinery for understanding whether a lot of the same underlying neural machinery for understanding whether a bikeer is going to behave a certain way whether a pedestrian is going to behave a certain way. A swimmer, a kayaker, or a car, like a lot of that machinery is shared substrate. And this gets back to creative general intelligence that we should be learning models, foundation models that can understand all these categories of objects and people and behaviors in one unified way. Then if we need to extract a certain subset of that capability, like we just need a really good pedestrian predictor, that can be read off of the same underlying model. And I think this is the direction the industry is moving forward. It's a more general purpose model. But the transition is slower, because they have different practical constraints I need to be met. For sure. Yeah. There's safety issues in that kind of world that are different from in your world. If a pixel gets rendered incorrectly, maybe you just don't use that sample and you generate another sample. But in driving, it's not like you can't generate another pedestrian. One chance. Cool. So yeah, so we've covered this really interesting arc from the Uber ATG stuff, where you were forecasting pedestrian behavior and even though you're not rendering pixels as an output, the model needs to be able to represent the world in a similar kind of way to when you're rendering video like you are today. From Uber, you have the Nerf stuff, where you're, you have these neural representations of 3D scenes. At Bear, yet the Contracode project, the stable, well, I mean, I don't mean to say, the diffusion model stuff, where you're more stably rendering visuals, with that kind of algorithm relative to Gans. And yeah, all of that together brings us nicely towards the creative general intelligence stuff that are truly groundbreaking work that you're doing in Genmo today at a state of the art. Over all of that time, over your entire research career, is there a paper that you're most excited about or most proud of? Like, is there some kind of research contribution that really stands out to you? Maybe one that we haven't covered yet? It's a good question. I would say, you know, there's the impact metrics. The Dinoys' Intifusion Probulosic Models paper was an extremely impactful project. And I'm proud that the community has taken notice and grown around it. That took a lot of time because I think, it made me realize when something is really new. It's one thing to move a benchmark. It's another thing to change the way people work. And it took quite a lot of time to change the types of methods people used in practice. So that's obviously one of the most impactful things. Another thing I'm proud of is this work in 3D arc that we talked about. So we talked about Nerf. That was the sequence of two years of trying to make 3D synthesis better and better. And I culminated in this project with a Google Brain Dream Fusion, where you could put in a caption and get out of high fidelity logic to the fusion model. In terms of one project that I like that I think is a little underappreciated. I had a hard time staying away from compilers. That's partly why Clontorfield happened. There was another compiler project in there. Around a checkmate. What was the flashy name? Checkmate breaking the memory wall with optimal sensory materialization. This is another project with my co-founder and brother, Paris, that he had. And the idea here is that as we're training all these neural networks, getting bigger and bigger, learning all these things about the world, it's getting really hard to train them in a systems perspective. And so that project tried to make it so that us researchers with just a couple of GPUs at the time before the start-up in academia could actually train big models. How do we do that? So we learned a system, we trained a system to reduce the memory consumption of these neural networks a lot. So you could take a model which would take, let's say, 80 gigabytes of GPU memory and train it with only 16 gigabytes of GPU memory. Now your lab can save tens of thousands and hundreds of thousands of dollars on hardware. That work has had some impact. Now it's a very common technique for people to use. This idea of reducing the memory consumption with some of these algorithmic techniques. But because it's a, it's a systems paper, I don't talk about it too much. Focus more on the. But this is something that immediately seems super interesting to me. So I mean, I'm at a small start-up. We're trying to, you know, spend as little as possible but train the biggest models that we can. So an approach that we use regularly, especially for models serving, which can be way many, many more orders of magnitude, more expensive than your training because you only need to train the model once. But once it's in production, hopefully you have a lot of users. And they're going to be calling that API a lot and you're going to end up spending tons of money on inference. So like 99% of your cost in training and running a model is going to be on the running part. So you're specifically talking about the training problem here. But even there, like there's, you know, I would love to be not having to rent as many beefy GPU servers in Google Cloud and be able to train more cheaply. Because that also, it allows you to then iterate more quickly, iterate more like recklessly. Like if you're not like too worried about an experiment maybe being throw away, you can do more experiments. And some percentage of those are going to end up being really high reward. Risks that were taken. So yeah, so like you mentioned that some people are implementing this kind of stuff. What can I be doing today to be training with a training models with a smaller memory consumption footprint and saving some money? Yeah, there's a lot of things. So some of this technology have trickled down into deep learning frameworks. Your local deep learning framework has some functionality. What we had in that project was a way to optimally select certain layers to compute twice. When you're training, you typically run each layer once during a forward pass. You run each layer once during a backward pass to actually compute updates. And we said that you can recompute some of these layers, run them twice. So it increases the computational cost. But in doing so, you don't need a store. The output of that layer can delete it. And it turns out that GPUs have gone in a lot faster. But their memory footprint, the amount of memory they have, has increased more slowly. So that's a perfect trade-off to make. So even if you do have a high memory GPU, you might not be utilizing it fully. You might want to train with a bigger batch size. So it's still worth it to make this trade-off and recompute some layers. You can double your batch size, triple your batch size. You can use half the GPUs that way. If you want to get started there, PyTorch has a function called checkpoint. Checkpointing. Not storing the checkpoint of the model, but rather it's a function that you can wrap certain layers of your model with to recompute them during the forward pass. The downside though is you have to manually select which layers to recompute. And that takes a little bit of black magic. Our project, again, was research ware that we never ended up fully integrating with deep learning frameworks would select those layers where you automatically in the optimal way. Oh, wow. Wow, it sounds like a big opportunity there. Maybe there's some listener out there who can take advantage of the open source research ware we've already made. Uh-huh. And incorporate that in. Yeah, that'd be really cool. Wow. Awesome. So, yeah, those are some great practical tips. Like, yeah, increasing your batch size to make sure you're taking advantage of the hardware that you be hardware that you have. Yeah, and then this PyTorch layer checkpoint. It sounds like a great idea. Um, more generally getting into some general questions now. It is mind-blowing to me the breadth and the impact of the achievements that you've had in your career already. What kinds of, yeah, what kinds of tips do you have for our listeners who would like to become tremendous AR researchers or AI entrepreneurs, like you are. And, you know, maybe something that would help us frame your guidance on that would be, how did you end up deciding to do this for a living, what you've been doing? How did this all come about? Well, I wanted to be a designer for a little bit back in college. But like a graphic designer. Yeah, like a graphic designer. I went to, there's this design studio called Ideo. Um, big consultancy. They had a Vance in Massachusetts for us in college. And I started to do some of their events. I started to take drawing classes and architecture classes, media design. I really loved it. Um, then I tried to get a job as a designer internship at Figma, actually. In case there's a Figma listener out there. I could only get a job as a front-end engineer. And I loved that stuff, loved doing it. But I thought, you know, this isn't going to get me to be able to express myself visually, as much as I would like. So I ended up focusing more on research to make it easier to make graphics. I think what I always found helpful getting started in research was to have kind of a toy problem, kind of a challenging problem in my mind at any given time, um, whether that would be something like, how do we make high fidelity imagery? Let's say that's the problem. I would go into the classes I was taking, or go into the conversations I was talking with other researchers with that framing in mind. So I could recast different tools I heard about, then figure out how I could apply it to that problem of choice. And 99% of the time, the things you hear about aren't useful for your problem. But at that 1% of the time that it is useful, it could be the maker break, right? So, and it also kept me, you know, motivated through all this stuff that wasn't connected to my core problem, to wave through all these classes in the PhD and so on, by being motivated by a core problem, which was getting human level high fidelity synthesis. Another tip would be to do, I think, to actually implement the difference between something not working completely failing and then getting stunning results. It's sometimes a couple of implementation decisions and really well tuned software. So if you're an engineer and you're a little bit intimidated by the AI, the stats and the math, do know that really good engineering skills are critical to making these kinds of systems work. And so it's worth investing in those and worth getting your hands dirty, because AI software is just another type of software. Awesome tips, yeah. For sure, being able to, yeah, really get your hands dirty with the software is going to make you a much better AI researcher or data scientist for sure today, especially as data sets get bigger and bigger, our models get bigger and bigger, the kind of, even the DevOps around being able to train these models gets more and more complex, the better your software skills are, the better off you are for sure. And that relates to things like the kind of hiring you're doing. I know that you're hiring engineers and research engineers and when I asked guests on the show if they're doing any hiring, they're almost always hiring engineers. But open data scientist roles, like this kind of standalone data scientist, those are relatively rare. So, yeah, I couldn't agree more. I also, before I, I've got a question for you about the hiring that you're doing, but before I get there, I think that there's an interesting thing that I just want to point out here, which is kind of cool to call it explicitly, which is that you wanted to become a graphic designer. You couldn't get the job that you wanted. And so now you've created an AI system that does the job automatically. That is like a, you're now able to, like you've now created AI systems that already exist today and people can be using for free. But in the coming months and the coming years are going to get even more and more and more powerful that allow people to take a natural language input and do graphic design. So, yeah, you've really showed them. Well, the way I like to think about it is that let's people, the reason I didn't get the job is because I probably was not qualified. I was doing all this coding, right? And I think what it does is that let's people like me who want to create, which just don't have the skills, even after many classes, to be able to start to create, whether it's a hobby or whether it's for work. And that's what I think is the beautiful thing. It's technology and art are closer than most people think. And by leveling up the technology, we enable new forms of art. We also enable new forms of work. But yes, it was easier for me to write code to make visual content for me to actually create. You don't want to see some of those early proto drawings. Great. All right. So, yeah, so back to the hiring thing. Yeah, what do you look for in the people that you hire? What makes a great, I mean, what you're doing at Genmo, to be like right at the cutting edge there where the things that you come up with at your startup are driving what's possible in the world in 3D rendering and video rendering. I'm sure there's tons of listeners that would love to be working in a company like that. What does it take? Yeah, definitely. We are ramping up or hiring a lot right now. So there's a bunch of opportunities. You need to be able to work in a fast-paced startup environment. You need to be excited by some of these visions. I think one of the core things we look for at Genmo's people who are able to see an ambitious feature and make forward progress on it. Break it down into steps. And so we set vision, but we trust our people a lot in order to solve really hairy problems. I think that's one of the things that research does. But engineers are very familiar with this. You know, seemingly instrumental problem, figuring out how to break it down and break through those walls. We are hiring for product engineers, front-end engineer, full stack, infrastructure engineers. On the research and development side, we are also recruiting for our oceans team. Our oceans team is responsible for a large scale data infrastructure. In particular, curating and improving the quality of data sets. That we use in our models. We are hiring additionally for research scientists. I know a lot of research scientists from my network, but if there are people looking for research scientists roles in this podcast, we are actively growing our research team as well. Awesome. All right. Ajay, this has been a tremendous conversation for me to be able to enjoy. I'm sure our listeners have as well. Before I let you go, do you have a book recommendation for us? Yeah. I worked in Kevin Murphy's team back at Google, and I have to give a shout out to his book. Oh, really? Probably a holistic machine learning. Yeah. Having you edition out this year. Great guide, great textbook. They have one of the figures from our paper. So I have to recommend it. I also like the code breaker. The bio pick on Jennifer Dow9, the CRISPR development. It points out a lot of the subtleties and technology development. The people behind the technology, the scientists, and as well as the impacts of the work they do. And it's a huge topic to keep in mind as we work on AI. Nice. Great recommendations. Kevin Murphy's an earlier edition. I haven't read this year's edition yet, but an earlier edition was certainly helpful for me in my machine learning journey, and excellent machine learning textbook. Cool. You were able to work with him. It happens in this industry. It is smaller than you think, but it's wild to think like those kinds of names like Kevin Murphy to me. It's just like a name. But to you, it's like this person. He's a great guy. Awesome. So yes. So if people want to be able to follow your thoughts after this episode, where should they follow you? Yes. So if you want to reach out about the rules, you can email it high at genmo.ai. You can follow me on Twitter, and say, Jane, I post stuff. Genmo.ai is our handle on all social media platforms as well. Nice. All right. Thank you, OJ. This has been such a great episode. I can't wait to see where the Genmo journey takes you next. Maybe in a few years, you can pop back on and fill us in on those full-length movies that you're rendering. Love it. We can generate the visuals. Sounds great. All right. Take care. Truly incredible what OJ has accomplished in his young career already. There's a terrifically impactful future ahead for him, for sure. In today's episode, OJ filled us in on how his generative models can create cinema graphs by allowing you to automatically animate a selected region of a still image, how his diffusion approach laid the foundation for well-known text-to-image models, deli-2, and the first text-to-3D models. He also talked about how 3D generations are useful for video editing pipelines today, but increasingly, models will be able to go effectively from natural language directly to video pixels. He talked about how in the coming years, we'll likely be able to render compelling 30-second video clips, making it even easier than it is today to stitch together a feature-like film from generated video, and he filled us in on how generative neural radiance fields, nerf, enable neural networks to encode all of the aspects of a 3D scene so that perspectives of the scene can be rendered from any angle. As always, you can get all the show notes, including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs for a Jade's social media profiles, as well as my own, at SuperDataScience.com slash 7-1-1. Thanks to my colleagues at Nebula for supporting me while I create content like this SuperDataScience episode for you, and thanks, of course, to Ivana Mario Natalie, who are just Sylvia Zara and Kirill on the SuperDataScience team for producing another Horizon Expanding episode for us today. You can support this show by checking out our sponsor's links, by sharing, by reviewing, by subscribing, but most of all, just keep on tuning in. I'm so grateful to have you listening, and I hope I can continue to make episodes you love for years and years to come. Until next time, my friend, keep on rockin' it out there, and I'm looking forward to enjoying another round of the SuperDataScience podcast with you very soon. Thanks for watching, and I'll see you next time.