I think we're going to do something that we've never done before.
All right, welcome everybody to the latest paching it out episode with Leo.
He's a member of the Codex team and also a very prolific PhD researcher.
He's worked with the EF in the past on various different grants.
He's here with us to talk about data availability sampling.
I'll hand it to off to you Leo.
Hi, thank you for the introduction. Nice to be here.
It's a deal.
We're talking about data availability sampling.
The obvious question is what is sampling and then the follow-up immediately to that is when do we need it?
Right, so in the context of the centralized system, which is what we're talking here,
the context of codex and web tree and peer-to-peer networks.
We need to be aware that not all the players in the network are first good players.
Also, there are other players in the system that even if they had good intentions,
they might not achieve what the objective that they are supposed to achieve or provide the performance
that they are expected to provide to the network.
Because of that, it's necessary to guarantee that a certain number of items or rules,
depending on what you are designing.
For that, sometimes you will need sampling.
You are developing a decentralized storage system.
The nodes that are going to store the data in this centralized system are just random nodes that can join at any moment at any time.
You need to make sure that they are behaving properly and that they actually are storing the data that they are supposed to store.
The best way to do that is something.
That means you're going to go to ask them what the data that they are supposed to have.
There are many ways of doing that, but I'm not going to get there yet.
Basically, you need to make sure that the data that they are supposed to store, they have it, and everything is fine.
There will be some nodes that are not malicious, but they are for x-rays on the go-of-line from time to time,
either because of network conditions, bad internet, hardware failures, or many other attacks.
Perhaps at that moment, they will not be available for relaying the data that they are supposed to have.
There are other nodes that are just malicious and they are just trying to attack the system.
Those are the things that you have to take into account, and that's why you need something.
Nice.
If I were a student in your classroom, my follow-up is just, I'm a seek for understanding.
You do the data sampling to make sure that the data is there.
If I could distill it down, we've got to take samples of different nodes and what data they have to see if we have the complete set of data we need.
Exactly.
You take a data set and you want to distribute it among a group of people, and then you take your, let's imagine, it's a book.
It has 20 chapters and you're going to divide it into 20 people.
What you do is you just give a chapter for each person, and then from that to time, you ask a certain person, can you give me a proof that you still have the chapter that you are supposed to have?
That proof can be, in many ways, one of the ways is to show me the chapter, so I just show it to you.
That's a way to do it.
Now, there are other ways.
For example, you could imagine, instead of sending you the whole data, the whole chapter through the internet, because maybe it's too heavy, you can do maybe just a hash of it.
A hash is like a cryptographic signature, and then you send me the hash, and I know that that's the hash of that chapter.
I say, okay, yeah, he still has the hash.
I assume that everything goes well, and that means you still have the data.
Now, a hash is not really a good proof, because you could take the chapter, calculate the hash, and then discard the chapter, and then you just store the hash.
And so a hash, per se, is not a really good proof, but there are many other types of proofs.
And sampling is just a way of, you know, analyzing, you know, statistically, how many people should I check in a system where with a certain number of assumptions, and to make sure that I have the data that I will recover the data that I'm supposed to have.
Okay, I'm tracking. I can keep rattling off questions left and right, Jesse, like a machine good, but if you want to hop in, you gotta let me know.
So to solve the sampling issue of like sampling the entire chapter of the book that you're referencing, you mentioned sampling hashes, or sampling like some sort of cryptographic proof that they have the fraction of the chapter maybe.
So can you go a little bit into how that chapter is specifically turned into maybe smaller sub chapters or like using some sort of maybe encoding scheme.
So sampling is, you know, is one part of of asking, you know, the set of players in your network, whether they still have the data that they are supposed to have or not.
But then, you know, what happens when one of them don't like, you know, one of them just disappear.
And it turns out that now you cannot read the book because there is a chapter missing, right.
So what what we do, you know, in computer science and is to use different techniques for enhancing the data or like augmenting the data so that even if a few players, a few members of this group or a few notes in the network.
This appear, you still can't recover the data, right. The most simple way to do that is just replicate in the data, like just double the number of, I mean, duplicate each chapter and then you give one chapter to it person, but now you, if you have 20 chapters, you don't need 20 people, you need 40 people because you need, you know, you will not give two times.
The same chapter to that person because then if that person disappear, then you still lose the chapter.
So you need more people, okay. And then if you want to tolerate multiple people missing, then you will have to replicate more times.
So you can replicate it three times, four times, five times. Now, this is extremely simple, fast and not complex. So it's easy to implement.
Now, the problem is that it's really, you know, space hungry. It takes a lot of space, right, you requires a lot of storage.
And if you really want to tolerate multiple failures or multiple missing chapters, then you will require a lot of storage and a lot of people helping you in this protocol.
So instead of that, there's another system that we use that is called erasure coding. It's just a fancy term to, you know, to refer to matempath to polynomials and, you know, mathematical equations and systems of, you know, multiple equations that help us, you know, solve us when something is not known in this in this system.
But I think there will be another podcast in the future, maybe that I'll explain more in detail how a ratio code scores or, and perhaps we can leave those details for later.
But it's basically just imagine that you have a machine, right, in which you, you give it data as input. And then on the other side, you get the data plus some extra data.
And what happens is that with this extra data that you get, you can recover a certain number of pieces in the original data, right.
And this is, this is used everywhere today. Like if you want to communicate with satellites, you need that, right, because, you know, when you are communicating with satellites, you have to pass through the atmosphere and in the atmosphere, there is a lot of, you know, perturbations and noise.
And so you might lose small parts of the message that you are sending. So if you just send a message, you will never get, you will never be able to communicate with satellites, which is annoying because, you know, then we don't have GPS and then we don't know where we are going with driving.
And then that's why we, we need to improve this message with this some kind of encoded data so that even if some parts of the message are missing, we are able to reconstruct them with this mathematical magical box that allows us to encode and decode the messages, right.
And so that's what we do. We take this data, we encode it, we produce some extra data, and then we divide all this in, in little boxes or blocks, and then we give to each node in the network at different, at different blocks.
So imagine you have this same chapter, sorry, same book with 20 chapters, you divide it into 20 chapters, you get all the chapters to this magical mathematical machine, and then you get 10 extra chapters that don't look like anything, you know, they are just not readable.
But it doesn't matter, the thing is that if you lose one of those chapters, you will be able to reconstruct it based on the encoded data that you got.
And so now you don't need 20 people, but you need 30, because you have 20 original plus 10 encoded chapters. And so you, you know, you disperse these chapters into 30 people, but now you can tolerate any 10 chapters missing in the entire set.
Okay, and you only need 10 more people. If you wanted to do the same with replication, you will basically need to multiply the number of chapters by 10, so 20 times 10, that means 200, you will need 200 people to do that.
In this case, which is need 30 is much more efficient in terms of space. And, and then now you are, you have a system that is much more reliable in that is much, much harder to lose data.
And then you apply sampling. Now you need to ask people from time to time, hey, you still have to chapter.
And then, but the question here is, do you need to really question everyone or not, or is it sufficient to question only a few members of the group, whether they have their data or not.
And the answer is we don't need to ask everyone. And that's what we call sublinear sampling is like, if you apply some statistical computations, you can be sure that, or you can have a very high probability that you will still be able to recover the data.
If you got a sufficient number of positive answers when you are doing, when you are doing sampling, I don't know if this just makes sense.
No, it actually does. I'm like, I'm a little bit like kind of fascinated a little bit because I've asked and I feel like you've explained it in a way that I finally can understand.
So let me try to run it back. So essentially what a razor coating does is it just provides some like redundancy in the message. So I'm like, OK, I'm sitting in a recipe.
But you know, I want to make sure everyone who downloads this recipe is going to get all the ingredients. So some of these ingredients I'm going to encode and just attach it to the message.
And so I guess that encoded extra bit, you're saying you can, you can adjust how much or how little is encoded to increase or decrease the storage needs of the message.
So if I bring it down to only a small amount of it is encoded, then I need more participants because I need to ensure that redundancy is there.
But if a large amount is encoded, then I can have less participants.
So it depends what type of redundancy scheme you're using. So if you are using encoding, you just need as more number of participants.
If you are using replication, which means you duplicate or triplicate the data without any mathematical encoding, then you will need a large number of participants.
OK. But yeah, exactly what you said is that you can tune exactly one, you know, how much you want to produce as encoded data in order to calculate how much, you know, to know how much space extra you will need.
And you can play with this parameter. So that's that's really convenient.
Nice. You realize like when I'm driving my kid around and you ask me how GPS works now. Now I now I get to sound like that, that, that dad, like I know all of this works, child.
So you raise your coding. No, I hope you that's in good form.
OK, I did have a question. It was probably a little while back. I know there's different ways to sample. So are there different ways to sample that.
Right. So this goes into, I mean, one of the questions is how much you need to sample and how often you do this.
For example, in the case of a storage system, if you apply randomness, which is the way the best way to do it, if you randomly decide who you're going to sample, then we just a very few number of samples you already know.
They already have a very high degree of, you know, assurance that the data is there. If you got enough positive answers. So for example,
if, if, if, if you, you know, you're a teacher in a, in a, in a school, and you have a class of a hundred kids, right.
So do you imagine you say, OK, for tomorrow, I'm sure you, you give a homework for tomorrow, right. And so you say, OK, for the first half of the list.
I want to check the homework of the first half of the list. And then next week, I'm going to check the homework of the second half of the list, right.
So what happens is that the next day, most likely, the first 50 kids in the class did their homework and the last 50 kids in the class didn't do the homework because, you know, why would you do it if they were told that they will not check it.
And then for the next week, the opposite will happen, right. You, you will have the first 50, they didn't do the homework and the second 50, they did the homework, right.
Now, if you tell them instead that you're going to, instead of paying, I'm going to pick the first 50 and then the second 50, if you say, I'm going to pick, I'm going to pick 50 kids randomly.
Then, you know, they cannot play with this anymore because then they don't know if you're going to be selected or not.
So most likely, you will have a large number of kids that actually do the homework because they will afraid that, you know, they get picked.
And then maybe a few ones will say, I'm just going to play my chances, you know, and then maybe some of them will not write.
But then, okay, let's imagine, so you pick a number of kids and it turns out that, you know, the first week, every time you ask, like, 50% of them did the homework and 50% of them didn't know to the homework.
So you know that, you know, the average, you know, homework delivery is 50%.
The next month, that decreases to, let's say, 25%.
And let's imagine that gets improving.
If you ask a certain number of times, let's say, 30 homeworks.
And every time you get all 30 homeworks done, then you know that the probability that there is 99.999.
The probability that everybody is doing the homeworks, right?
Because if you pick, you know, 30 random times, random homeworks, and nobody, and all of them were corrected, and all of them did the homework.
Then, just basically, you know that, that, that yet, that we have a 99.99% probability that everybody is doing the homework.
It's, it's like, it's like, it's like, it's like, is that doing a, uh, uh, throw in a coin, you know, you have 50% tell them, um, uh, uh, 50% face.
So, um, if is, is really hard that you throw the coin 30 times and you always get the correct answer. Right.
So the only way that happens is because you know that the system, uh, is, is, is, is, is producing these results.
And so that's what we do, with sublinear sampling,
we try to minimize the number of questions that we have to ask,
and still have a very, very high probability that the question,
the answer that we're getting is positive,
which means that everybody in the network
is actually storing the data that they have to store.
But then, yeah, as you say, there are other types of sampling,
like for example, you know, is what type of answer do you want to get
for when you are requesting if the nodes have the data?
One of the ways is saying, like, please send me the entire data,
and then I verify.
Another way is to send me a proof that you have the data,
and in this case, we're talking about a cryptographic proof.
Now, the problem is, if you want to store these proofs
on a dashboard or the internet for everybody to see that,
to check that the nodes are still storing the data,
then you can not really ask them to send you the entire data set
as a proof that they still have it, right?
Particular, if we are talking about the centralized system and Web3,
most of the times, you want to store the proofs into a blockchain.
And the storing data in a blockchain is extremely expensive,
at least today.
And so because of that, you cannot say, well,
I'm going to store my proofs in the blockchain,
because that would be just prohibitively expensive.
And then what you need to do is to,
is a way to create cryptographic proofs that are small enough,
and that you can pause that in the blockchain.
And then everybody can see that, oh, yeah, my data is still there
because two hours ago, the protocol asked this node
and all of the 20 chapters that I store are there.
So my book is still safe, right?
And that's how we do it.
However, we have to ask this kind of questions.
So the sampling is not done just once,
because that's not enough.
You need to do it at a certain frequency, right?
You need to say, okay, I'm going to check,
I don't know, two times per day or four times per day,
if to check if my data is there.
And I assume that this is good enough for me.
So every six hours, I ask and I check.
But then if you have to do this check for the entire system
and you have a lot of data,
then you are talking about storing a very, very large number of proofs.
And that's where things become complicated.
When you have to store a very, very large number of proofs,
then you have to use other kind of techniques
in order to see if you can compress those proofs all together
into a small one that is still includes all of the others
and you can use them.
And there are techniques to do that.
In particular, we call them
succinct non-interactive arguments of knowledge.
Snarks, for short.
And is used to have a mathematical to the resist today
to compress proofs into very, very, you know,
packed things that really packed a lot of proofs together
into one small one.
And that's what we use today, for example,
in a distinct context.
That's what we're planning to do.
So one of the questions I have,
you talked about the recursive compression scheme that you use.
But one of the questions I have more is,
how do you determine the frequency of repair that's necessary?
To maintain, you know, I guess, a certain number of nines
for the storage.
Right. Right.
So, yeah, good point.
So once you are, so we talk about, you know,
erasure coding and enhancing or increasing the data,
then we talk about sampling.
Now we talk about compressing those proofs into a small one.
Now, what happens, sorry, what happens if you have to
act when a data, you know, is simply gone.
You know, so there is a node that, you know,
for some reason, burnout and they didn't provide this,
the verification, the proof that they had to provide.
You know that this node is gone.
You ask several times in a row and there is no answer.
So the data is gone.
In this moment, you have to repair the data.
And for that, you use
the same encoding technique that you used to augment the data.
So this same algorithm, you use it at backwards
in order to regenerate the original data
using the coded data.
Now, when you do this,
you can do it every time that you see something missing.
You can do it, right.
But remember that in our example, we say,
okay, I take my 20 chapters and then I create 10 more
that are encoded data.
And so now, if I lose one chapter,
I still have nine orders, which means I still can
lose another nine chapters and I will be still
be able to recover my original data.
Right.
So you can play with this,
with these thresholds.
You can say, okay,
instead of repairing every all the time,
because, you know,
repairing is not free.
It requires a lot of proofs.
And it would be,
it would, you would overwhelm wherever you're storing the proofs
in terms of, yeah.
So that that has a certain cost.
And so instead of doing that, I can say,
I can say that threshold is saying, okay,
maybe from this point,
I will start reconstructing the missing data.
And one of the interesting properties of this algorithm
is that when you reconstruct,
when you do the decoding,
it doesn't matter how many data,
what how many chapters in the book are missing,
because it will always regenerate the whole 20 original books.
Right.
So if you had one chapter missing,
or you had five chapter missing,
for the same calculation,
you will get all of them.
As far as you did,
there's a caveat though.
There's a caveat there, right?
In terms of like adversarial erasure,
and yeah, and excluding just enough data
so that repair is not possible.
Exactly.
So you have to be careful with that.
And so you have to make sure
you don't get too close to the, to the dangers on
and that you still have a certain buffer
and some protection there.
So, you know, we, we play with, you know,
tuning these systems so that,
you know, on one side,
we don't encode too much overhead
by repairing everything that is perhaps
not necessarily all the time.
But also, we don't go to the other extreme,
where maybe we are getting too close
to actually lose data.
And that is catastrophic,
because this is definitely the case
that we want to avoid.
So how does this,
how does the sampling kind of differ
from some of the other data,
data availability sample sampling
that exists?
I know Ethereum is given some
like house codex is different and better.
Right.
So, um, yes.
So we are working with the Ethereum foundation
on a project,
on a research project together.
And basically, we are trying to help them,
you know,
model the data availability sampling strategy
for the next duration of Ethereum.
And the reason why it was a good project
for codex to participate on
is because many of the strategies
that we are using in site codex
are actually used in,
or a plan to be used in Ethereum.
Right.
For example, I was talking about the ratio
coding.
They are planning to use a ratio coding
where I was talking about sampling.
They're using to use sampling.
We're talking about data repair.
They're using to,
and this is all normal,
even.
This is not a coincidence
because the probability,
the problematic is the same
in that what we want is to protect the data somehow.
And to do this in a system
that is decentralized peer-to-peer network
and that nodes go down from time to time,
and other nodes are not nice and play bad.
And so there is just a limited number
of mathematical tools that can help us,
and we all try to use those tools
and the best ones for the task.
So that's why this collaboration makes sense
because we are using the same tools.
Now there are significant differences.
For example, in the context of Ethereum,
what we are working on is data availability.
We want to make sure that the block
that is going to be produced is available.
But we don't need to guarantee
that the block is available forever.
We just want to make sure that the block,
at some point in time,
became available to the network,
and that's it.
In the context of codecs,
the story is different
because in codecs,
what we do is to guarantee data durability,
which means to sign a contract,
with a storage provider,
and you say, I want you to store this data
for the next three years, or whatever.
And we need to make sure
that the data is available for those next three years.
So that means that,
and we go back to the frequency of sampling
that we were talking before.
In the context of codecs,
we need to keep testing
that the data is available
every examant of time.
And that means that the frequency
of sampling is completely different
between Ethereum and codecs.
In codecs, we have to think
in long term,
in Ethereum, we don't.
So that's our theory.
Ethereum, they dump the data in two weeks.
Like more exactly.
Yeah, it's a big sort of three weeks,
or whatever the timeline
is a different scope,
and that is a significant difference.
And that leads to another
point is that
in Ethereum,
you can actually do sampling on your own.
And you say, okay, I'm going to check whether this
piece of this block is available,
and I participate in sampling.
And when I find that something is not available,
I just raise this alarm to the network,
and I say, hey,
I didn't find this piece of data.
Or I try to repair it if I want.
There are different actions you can take.
But that's, you know,
you don't need to store these proofs into a blockchain.
Or at least,
not in the same way that you need to do it in codecs.
And that leads to a difference in how much data
that you need to store inside the blockchain
regarding the proofs of the system.
And also because in codecs,
we are looking at long term and data durability,
then we really need to make sure
that the amount of space that we are
using for proofs storing is sufficiently small.
Then we have to go and look into snacks
and waste to compress these proofs into a very,
very small space,
which is what I was planning before.
And Ethereum, we don't necessarily need all that.
There are parts of it that we need,
but in a different way.
There is data, as I say,
there is no data durability in Ethereum.
So, you know, you don't have for us
a specific set of data.
You don't need to provide proofs for three years.
You need to provide proof for a very short time.
Another thing that I think is different
is that, you know, the way you encode the data
is, you know,
you can be, you know, there are many ways to do it.
And then that is really goes into
how you want to design your protocol.
And, you know, there are many, many flavors.
For example, in Ethereum, we are using
to-dimensional read-solomon encoding.
To-dimension on means that you take your data set
you divide it into small blocks,
and then you put those blocks into a little square
to-dimension on a square.
And then you take every line
and you
augment it with a pressure coding.
So, you produce a longer line for every line.
And then you do the same for every column.
And then you just do the same
for the extra columns and the extra lines.
And so, what happens is that you end up with a larger square
that is kind of protected to-dimensionally.
It's just a pattern to protect the data.
In the case of codex, we are not doing exactly that.
We are also using read-solomon encoding,
but we are using somehow, you think,
it's a one-dimensional way of a read-solomon encoding.
We also have a two-dimensional structure,
but it's not exactly the same as in the case of Ethereum.
So, that's different.
So, something else that we're also using,
that we are planning to use for Ethereum
is KCG commitments.
KCG commitments are basically just a way
to demonstrate that the
erasure-coded data that you produce.
So, the coded data that you produce is actually correct
and is part of this line that I'm encoding.
And then I can just send you a piece of that data
with the proof, the KCG proof,
and then you have a very easy way to verify
that the data that I'm sending you is not just garbage,
because that could be a possible attack.
Right, I could say, I'm going to encode the data,
but I'm not really encoding anything.
I'm just producing garbage information.
And then I just just spread this information,
and then everybody believes that everything is fine and well.
But when you want to reconstruct,
you cannot, because the data that was dispersed
at the very beginning was actually not really useful.
And so, the KCG commitments
help you to guarantee that what you are spreading
on the network is actually a real encoded data
that will allow you to reconstruct in case of failures.
So, it sounds like a technical,
like undertaking, a huge technical undertaking.
What about, for the end user, and maybe businesses,
if they decide to participate in this network,
how do they do so with protections?
Like, how was their privacy protected
and the other users' privacy protected?
I guess what I'm saying is,
if I'm electing to use my hard drive to store stuff,
how do I know somebody's not putting some bad stuff
on my hard drive and how am I protected?
Like, I don't want to,
like, that's their data.
Not mine.
I know it's stored on my system,
but it's not my stuff.
You know what I mean?
You're talking about wanting some sort of
plausible deniability for data that is illegal.
Yeah, like, if I go to a storage unit,
I got to sign a bunch of papers
and say, I'm not storing illegal stuff in their storage unit.
Right?
And that's their protection.
Because what if I am?
And then the police bust into the storage unit,
and then, you know what I mean?
Like, what are the protections for the users?
Right.
Right, so this is of course a very difficult topic
and a very important one.
Because, exactly, if you want to participate in this network,
and you want to help the system
offering this that you are introducing,
you want to have those protections.
One way to protect the user,
would be, for example, to enforce encryption
on all data sets.
Now, we are not, we are not true.
This is the way we will go with codecs.
There are some systems that implement that.
But this will be, you know, this will be a way to say,
okay, the data that I receive is encrypted.
And I don't have any way to decrypt it.
I don't have the keys.
So I'm just storing this.
That is already an extra protection
on top of the, of the other, you know,
encoding system that we are implementing in codecs.
Because, you know, when, when somebody uploads a data set,
you don't even get to know what type of file it is
because you don't get the file.
You just get the piece of the file.
Right.
You just get a block of data.
This block of data is also not necessarily a contiguous
part of the file.
It's probably a, you know, it's the composition of many small blocks.
And perhaps many of those blocks are actually encoded blocks,
not only the original data.
And so what happens is that the data that you are getting
is really an escramble of many small blocks
after the encoding was produced.
And you are getting original blocks as well as encoded blocks
and all you're getting all this back together.
So in a sense, the data that you are already receiving
is, you know, very, very hard to understand in a possible way
and to make any sense of it.
But if you wanted, yeah, you could apply encryption on top of it
and say, well, you know, anyway, even if I managed to
remake all the, the order in between these small blocks
and everything, I will not be able to decrypt the content
because I don't have the keys.
Um, so, yeah.
Okay, sorry, I interrupted you, but the light bulb went off.
So it's like, you know, there's like a 5 million
piece puzzle of like the sky, the hardest puzzle on the planet, right?
Like, here's a blue, here's a 5 million
piece puzzle of blue, right?
And then on my computer, you're like, and here's four
pieces. Don't know where I got them from.
They could be border pieces or they could be in the middle,
but here's three of them.
Good luck, right?
There's really no way for me to even know what
the data is.
Exactly.
And remember that one of one million out of those five
million pieces are actually not part of the original
picture of this guy is other coded.
It's a coded sky, you know.
Oh, okay.
Well, there you go.
I would, I would go to court on those grounds.
I would be like, you put that piece together.
You put that puzzle together.
Judge.
So, and then I did sit the jail.
But no, okay, I'm good.
No, but it's true that it's important to be
attention to those details, because yeah, we want to make
sure that we are doing things right.
Okay.
All right, so LIBO went off.
I'll pass the rock to you, just.
Yeah, so just a recap of like all the different things
we kind of talked about.
So it sounds like codex and the work that's being done
with the EF.
Again, like you said, uses the same tools
from the same toolbox in order to be more efficient
with the amount of storage across the network.
So rather than doing full replicas,
like you said originally,
you use racer coding in order to reduce the number of people
that need to hold parts of the entire data set.
And then you also on top of that,
try to be conscious of bandwidth for the nodes.
So rather than doing very frequent repair,
you have some sort of concept of laziness
of this repair, so that you were saying that you're playing
with the parameters such that the persistence of the file
is guaranteed, and you're not crossing that threshold
in terms of not requesting proofs often enough
that the user actually ends up losing their file.
So you want to make sure that they have their file forever
or for as long as the contract is,
but you're being conscious of the bandwidth
by regulating the frequency of the proofs down,
the requesting of those proofs.
Right.
And then in terms of the encoding scheme,
I think you just covered that you're not doing
the exact same data structure in terms of the 2D
read Solomon coding that is for the DAS, the EF stuff.
Codex has its own data structure.
Right.
And then they're tackling completely different problems.
Like you mentioned, one is for the Ethereum Foundation,
they're just trying to provide a way for validators
to have access to data in order to replay or verify state.
OK, welcome to the state, yes.
OK, versus Codex is trying to solve this problem
of data durability, which is more persistence
and like probabilistic retrievability of data.
Right, exactly.
And there is also another difference
that I just remember now and I didn't mention.
And it's the size of the data.
In Codex, for example, we want to offer users
the possibility to upload any data set,
whether it's just a small file or whether it's
just a huge backup of your all your pictures or whatever
or movies or anything that is legal, of course.
And that's, you know, but any size of the data.
In terms of Ethereum, you always have the same amount of data
because what we are trying to protect in the case of Ethereum
is a block that is going to be produced.
And we have very specific sizes for this block.
It cannot be a huge block of one terabyte that is just
unthinkable.
And it's usually not one kilobyte.
I mean, that's not the numbers that we are talking about.
So here, we are, for example, in the current model, which
doesn't mean that it's going to be like that at the end,
it can change.
But we are thinking, you know, something
between 32 megabytes that after encoded can be
or in the order of 120 megabytes.
And so now we have a very specific size of the data
that we are planning to protect.
In the case of Codex, it can be, you know,
it can be very small or it can be huge.
But when it's huge, then that applies
a certain burden on top of the system
because, you know, you might have to encode this huge amount
of data very fast.
And, you know, that has a certain number of implications.
And so, yeah, that's another difference between the two
systems.
So what are some, I guess, open-ended points of research
or continued work that will definitely
have you want to discuss for Codex in the near future?
Right.
So in the context of Codex, we're still
working on the Snarks on what is the best way
to implement very succinct small proofs
and how to aggregate those proofs in a very efficient way.
We already have a few models that tell us how much space
is going to be consumed by those proofs,
given a system of a certain, you know, a number of files
and, you know, average file size and so on and so forth.
So we already have some models that allow us
to predict, you know, these kind of things.
But we still miss a couple of inputs, for example.
We are still not sure how fast we can encode a proof
using Snarks.
And that's something that we need in order
to put as an input in our models to say, OK,
this is the amount of time that we are going to spend
encoding proofs or aggregating proofs in the system.
And, you know, the more files you get, the more proofs you get,
and the more times you will spend aggregating those.
And that's important because, you know, then if you need many,
then you can say, OK, then maybe we need a certain number
of aggregators in the system.
Aggregators will be a specific role in the system
and know that only cares about aggregating proofs.
It's not restoring any data.
It's not promising any storage along to long term,
but it's just aggregating proofs in the network.
So those kind of things were still not 100% sure.
They were still working on those.
And that's still part of the ongoing research
in the team in the codex team.
Awesome.
Oh.
Can we have you back to talk about, like, use cases?
Excuse me?
Can we have you back to talk about use cases?
Ah, yeah, sure.
Of course, sorry.
I don't know.
I lost some of the technology.
Oh, good.
I thought you were just like, no, no, you can't.