DA=Data Availability≠Historical Data Retrieval

IntermediateFeb 28, 2024
This article focuses on what exactly data availability is.
DA=Data Availability≠Historical Data Retrieval
  • Forwarded original title: Misunderstanding of data availability: DA = data release ≠ historical data retrieval

Introduction:

What exactly is data availability? For most people, the first impression might be “accessing historical data of a certain moment,” but this is actually a major misunderstanding of the DA concept. Recently, co-founders of L2BEAT and proponents of Danksharding, along with the founder of Celestia, have clarified this misconception. They pointed out that Data Availability (DA) should actually refer to “data publishing,” but most people have interpreted DA as “historical data retrievability,” which actually involves issues of data storage.


For instance, a while ago, Dankrad mentioned the mandatory withdrawal/escape hatch mechanism of Layer 2, noting that Validium’s mandatory withdrawal requires obtaining the latest L2 state to construct a Merkle Proof, whereas Plasma only needs the data from 7 days prior (this relates to their methods of determining a legitimate State root).

With this, Dankrad clearly pointed out that Validium requires DA to ensure the security of user funds, but Plasma does not. Here,The Dankrad use case points out the difference between DA and historical data retrieval, which is that DA often only involves newly released data.


At L2BEAT, the distinction between Data Availability (DA) and Data Storage (DS) has been further emphasized. L2BEAT’s Bartek has repeatedly stressed that DA and data storage/historical data retrievability are two different things, and that users can access the L2 data they need only because those nodes providing data are “kind enough to you.” Additionally, L2BEAT plans to use “whether there are permissioned data storage nodes available” as a new metric for evaluating Rollups, beyond DA.


The statements from Ethereum community/Ethereum Foundation members show their intention to clarify and refine the concepts related to Layer 2 in the future, as well as to provide a more detailed definition of Layer 2 itself. This is because many terms related to Rollup and L2 have not been clearly explained, such as how far back data is considered “historical”—some believe that since smart contracts can only call data from the past 256 blocks, data from before 256 blocks (50 minutes) is considered “historical.”

However, the “Rollup” mentioned by Celestia and the Ethereum Foundation strictly refers to two different things. This article aims to clarify the difference between the DA concept and data storage, from the source of DA, data availability sampling, to the implementation methods of DA in Rollups, explaining what data availability truly means—data publishing.

The origin of the DA concept

The concept of DA originates from the question of “data availability,” which Celestia founder Mustafa explains as follows: DA is about how to ensure that all data in a block is published to the network when a block producer proposes a new block. If the block producer does not release all the data in the block, it’s impossible to check whether the block contains erroneous transactions.

Mustafa also points out that Ethereum Rollups simply publish L2 block data on the Ethereum chain and rely on ETH to ensure data availability. On the Ethereum official website, the data availability problem is summarized as the question: “How do we verify whether the data of a new block is available?” For light clients, the issue of data availability refers to the verification of a block’s availability without needing to download the entire block.

The Ethereum official website also clearly distinguishes between data availability and data retrievability: data availability refers to the ability of nodes to download block data when it is proposed, in other words, it is related to the time before a block reaches consensus. Data retrievability refers to the ability of nodes to retrieve historical information from the blockchain. Although archiving may require blockchain historical data, nodes do not need to use historical data to verify blocks and process transactions.

In the view of Celestia’s Chinese contributor and W3Hitchhiker partner, Ren Hongyi, Layer 2 assumes beforehand that Ethereum is sufficiently secure and decentralized. Sorters can confidently send DA data to Ethereum, and this data will be unobstructedly propagated to all Ethereum full nodes. Since L2 full nodes themselves run the Geth client, they are considered a subset of Ethereum full nodes and can thus receive Layer 2’s DA data.

In the eyes of Dr. Qi Zhou, the founder of EthStorage, the definition of Data Availability (DA) is that no one can withhold the transaction data submitted by users to the network. The corresponding trust model is that we only need to trust the protocol of Layer 1 (L1) itself, without the need to introduce other assumptions of trust.

Qi Zhou points out that the current implementation of DA in Ethereum is essentially P2P broadcasting (using the gossip protocol), where every full node downloads, propagates new blocks, and stores Rollup data. However, Ethereum full nodes will not store historical blocks forever. After the implementation of EIP-4844, they might automatically delete data from a certain time ago (apparently 18 days). There are not many archive nodes that store all historical data worldwide. EthStorage plans to fill this gap in the Ethereum ecosystem and help Layer 2 establish its dedicated data permanence nodes.

The early discussions about data availability by the Ethereum Foundation can be seen in Vitalik Buterin’s tweets and GitHub documents in 2017. At that time, he believed that to ensure the scalability and efficiency of the blockchain, it was necessary to increase the hardware configuration of full nodes (full nodes are the ones that download the complete block and verify its validity, and Validators, who participate in consensus, are a subset of full nodes). However, increasing the hardware requirements for full nodes would also increase operational costs, leading to blockchain centralization.

On this matter, Vitalik suggested designing a scheme to address the security risks brought about by the centralization tendency of high-performance full nodes. He planned to introduce erasure codes and data random sampling to design a protocol that allows light nodes with lower hardware capabilities to verify that a block is problem-free without knowing the complete block.

His initial idea was actually related to an idea mentioned in the Bitcoin whitepaper, which states that light nodes do not need to receive the complete block but will be alerted by honest full nodes if there is a problem with the block. This concept could be extended to later fraud proofs but does not ensure that honest full nodes can always obtain sufficient data, nor can it judge after the fact whether the block proposer withheld some data from being published.

For example, a node A could publish a fraud proof claiming it received an incomplete block from node B. However, it’s impossible to determine whether the incomplete block was fabricated by A or sent by B. Vitalik pointed out that this problem could be solved by Data Availability Sampling (DAS), which obviously involves issues of data publication.

Vitalik briefly discussed these issues and their solutions in his article “A note on data availability and erasure coding”. He pointed out that DA (Data Availability) proofs are essentially a “complement” to fraud proofs.

Data availability sampling

However, the concept of DA is not so easy to explain, as evidenced by Vitalik’s GitHub document undergoing 18 corrections, with the last correction submitted on September 25, 2018. Just the day before, on September 24, 2018, Celestia’s founder Mustafa and Vitalik co-authored the later famous paper “Fraud and Data Availability Proofs: Maximising Light Client Security and Scaling Blockchains with Dishonest Majorities”.

Interestingly, Mustafa is listed as the first author of the paper, not Vitalik (another author is now a researcher at the Sui public blockchain). The paper mentions the concept of Fraud Proofs and explains the principle of Data Availability Sampling (DAS), roughly designing a mixed protocol of DAS + two-dimensional erasure coding + fraud proofs. The paper specifically mentions that the DA proof system is a necessary supplement to fraud proofs.

From Vitalik’s perspective, the protocol works as follows:

Suppose a public blockchain has N consensus nodes (Validators) with high-capacity hardware, which allows for high data throughput and efficiency. Although such a blockchain might have high TPS (Transactions Per Second), the number of consensus nodes, N, is relatively small, making it more centralized with a higher probability of collusion among nodes.

However, it is assumed that at least one of the N consensus nodes is honest. As long as at least 1/N of the Validators are honest, able to detect and broadcast fraud proofs when a block is invalid, light clients or honest Validators can become aware of security issues in the network and can use mechanisms like slashing malicious nodes and social consensus forks to recover the network to normal.

As Vitalik mentioned before, if an honest full node receives a block and finds it lacking in some parts, and publishes a fraud proof, it’s difficult to determine whether the block proposer failed to publish that part, it was withheld by other nodes during transmission, or if it’s a false flag by the node publishing the fraud proof. Moreover, if the majority of nodes conspire, the 1/N honest Validator might be isolated, unable to receive new blocks, which is a scenario of data withholding attack. In such cases, the honest node cannot determine whether it’s due to poor network conditions or a deliberate withholding conspiracy by others, nor can it know if other nodes are also isolated, making it difficult to judge whether a majority has conspired in data withholding.

Therefore, there needs to be a way to ensure, with a very high probability, that honest Validators can obtain the data required to validate blocks; and to identify who is behind a data withholding attack—whether it’s the block proposer who failed to publish sufficient data, it was withheld by other nodes, or if it’s a majority conspiracy. Clearly, this security model offers much more protection than the “honest majority assumption” common in typical PoS chains, and Data Availability Sampling (DAS) is the specific implementation method.

Assuming there are many light nodes in the network, possibly 10 times N, each connected to multiple Validators (for simplicity, assume each light node is connected to all N Validators). These light nodes would conduct multiple data samplings from Validators, each time randomly requesting a small portion of data (suppose it’s only 1% of a block). Then, they would spread the acquired fragments to Validators lacking this data. As long as there are enough light nodes and the frequency of data sampling is high enough, even if some requests are denied, as long as most are responded to, it can be ensured that all Validators can eventually acquire the necessary amount of data to validate a block. This can negate the impact of data withholding by nodes other than the block proposer.

(Image source: W3 Hitchhiker)

If the majority of Validators conspire and refuse to respond to most requests from light nodes, it would be easy for people to realize there’s a problem with the chain (because even if some people have poor internet, it wouldn’t result in most light node requests being denied). Thus, the aforementioned scheme can highly likely determine majority conspiracy behavior, although such situations are rare in themselves. With this approach, uncertainties from sources other than the block proposer can be resolved. If the block proposer withholds data, such as not publishing enough data in the block to validate it (after introducing two-dimensional erasure coding, a block contains 2k2k fragments, and restoring the original data of the block requires at least about kk fragments, or 1/4. To prevent others from restoring the original data, the proposer would need to withhold at least k+1*k+1 fragments), they will eventually be detected by honest Validators, who will then broadcast fraud proofs to warn others.


According to Vitalik and Mustafa, what they did was essentially combine ideas that had already been proposed by others and added their own innovations on top of that. When looking at the concept and implementation method as a whole, it’s clear that “data availability” refers to whether the data needed to verify the latest block has been published by the block proposer and whether it can be received by the verifiers. This is about whether the data is “fully published” rather than whether “historical data can be retrieved.”

How Ethereum Rollup’s Data Availability (DA) is Implemented

With the above assertion, let’s look at how Data Availability (DA) is implemented in Ethereum Rollups, which becomes quite clear: The block proposer in a Rollup is known as the Sequencer, which publishes the data needed to verify Layer 2 state transitions on Ethereum at intervals. Specifically, it initiates a transaction to a designated contract, stuffing the DA-related data into the custom input parameters, which is then recorded in an Ethereum block. Given Ethereum’s high degree of decentralization, it can be assured that the data submitted by the sequencer will be smoothly received by the “verifiers.” However, the entities playing the role of “verifiers” differ across various Rollup networks.

For instance, in the case of Arbitrum, the sequencer posts batches of transactions to a certain contract on Ethereum. The contract itself does not verify this data but emits an event for L2 full nodes to listen to, letting them know that the sequencer has published a batch of transactions. Specifically, ZK Rollups use a Verifier contract on Ethereum as the “verifier.” A ZK Rollup only needs to publish a State Diff + Validity Proof, i.e., information on state changes plus a proof of validity. The Verifier contract checks the validity proof to see if it matches the State Diff. If the validation passes, the L2 Block/Batch published by the sequencer is considered valid.

(Source: Former Polygon Hermez White Paper)

Optimistic Rollups require publishing more data on Ethereum because they rely solely on L2 full nodes to download data and verify the validity of blocks. This means that, at a minimum, the digital signatures of each L2 transaction (now commonly using aggregated signatures) must be disclosed. If contract calls are made, the input parameters must also be disclosed, in addition to the transaction transfer addresses, the nonce values to prevent replay attacks, etc. However, compared to the full transaction data, there is still some trimming.

Compared to ZK Rollups, the DA (Data Availability) cost of Optimistic Rollups is higher because ZK Rollups only need to disclose the final state changes after a batch of transactions is executed, accompanied by a validity proof, leveraging the succinctness of ZK SNARK/STARK; whereas Optimistic Rollups can only use the most cumbersome method, requiring all transactions to be re-executed by other L2 full nodes.

Previously, W3hitchhiker roughly estimated that without considering the future developments of EIP-4844 and blobs, the scaling effect of ZKR (Zero-Knowledge Rollups) could reach several times that of OPR (Optimistic Rollups). If considering smart wallets related to EIP-4337 (which use fingerprints, iris data instead of private key signatures), the advantage of ZKR would be even more apparent, because it does not need to post the binary data of fingerprints, irises on Ethereum, whereas Optimistic Rollups do.

As for Validium and Plasma/Optimium, they actually use Ethereum’s off-chain DA layer to achieve DA. For example, ImmutableX, which adopted a validity proof system, has set up a set of DAC (Data Availability Committee) nodes specifically for publishing data related to DA; Metis publishes DA data on Memlabs, and Rooch and Manta use Celestia. Currently, due to the existence of DAS (Data Availability Solutions) and fraud proof systems, Celestia is one of the most trustworthy DA layer projects outside of Ethereum.

Disclaimer:

  1. This article is reprinted from [Geek Web3]. Forward the Original Title:Misunderstandings About Data Availability: DA = Data Publishing ≠ Historical Data Retrieva, All copyrights belong to the original author [ Faust, Geek web3]. If there are objections to this reprint, please contact the Gate Learn team, and they will handle it promptly.
  2. Liability Disclaimer: The views and opinions expressed in this article are solely those of the author and do not constitute any investment advice.
  3. Translations of the article into other languages are done by the Gate Learn team. Unless mentioned, copying, distributing, or plagiarizing the translated articles is prohibited.
Start Now
Sign up and get a
$100
Voucher!
Create Account