Proof of SQL 101
Earlier this month, Space and Time launched Proof of SQL, a novel zk-proof for SQL operations, in one of the most highly anticipated moments of the project’s tenure. The protocol, which is now available in alpha, was handed to the first wave of projects that will use it to connect verifiable queries to their smart contracts. The launch was an exciting moment for Space and Time’s core team, community, and customers, and a monumental leap forward for what we’ve coined the “verify-everything” future.
If you’ve been following along with Space and Time or the Web3 news cycle, you’ve probably heard of Proof of SQL. But unless you’re well-versed in zk or really did your homework, you may still be wondering just what exactly it is. Proof of SQL is a zk-SNARK which cryptographically proves that queries run in Space and Time are accurate and that both the query and the underlying data are verifiably tamperproof. If that sentence made perfect intuitive sense to you, you can learn more about how to start building with Proof of SQL here. If not, stick with me and we’ll break it down.
What it is and why we built it
When explaining a piece of complex tech, I like to start by talking about the problem that it solves. The notion of a tamperproof query doesn’t make sense if you don’t already understand the implications of a tampered query. Verifiability doesn’t mean much if you’re not already aware of the consequences of a system not having it. Proof of SQL has the potential to fundamentally transform not only Web3, but the entire data foundation of the world’s business and economy. But… how?
Problem #1: Data tampering
Material and costly
To answer that, let’s examine a long and financially devastating history of data tampering. Data tampering refers to someone manipulating, changing, or deleting information stored in a database. In many cases, data tampering is executed by an external bad actor (read: hacker). Hacking and the threat it poses to businesses has allowed a $2 trillion cybersecurity industry to flourish as databases have become ubiquitous in every government, industry, and organization. More than ever, companies are investing in tools to secure their data from outside threats, like in-database encryption, role-based access controls, and multi-factor authentication.
But malicious outside actors are not the only thing a company has to worry about. Even with the strictest security measures in place, data is still vulnerable to an arguably even greater threat: internal human error. In fact, an estimated 62% of data tampering incidents are caused by a user with authorized access to the data. Malicious, negligent, and compromised users pose a great (and expensive) threat to the safety, productivity, and reputation of an organization. By some estimates, a single incident of data tampering by an internal source costs on average $15.38 million at face, not including revenue losses resulting from a decline in consumer trust.
Real-life data tampering incidents
Just this year, JPMorgan Chase was fined $4 million by the SEC for a mistaken deletion of over 4.7 million email records, after paying $125 million in penalties in 2021 for failing to preserve other communications records. The company places blame on the archiving vendor that it hired to handle the data storage for the emails, many of which were sought by subpoenas in SEC investigations.
In February, Twitter experienced a major site-wide outage after an employee accidentally deleted data from the internal service that handles rate limits. As a result, masses of users were unable to tweet or retweet and instead received an error message saying they had reached their daily limit.
In April 2021, the City of Dallas fired an IT worker who, by way of recklessness, deleted 20.68 terabytes of Dallas police data in a move that cost the city half a million dollars. The employee, who was trained and authorized to access the data, confessed that he deleted the tables without checking that they were backed up. The incident caused over 8.7 million archives that had been collected as evidence in an estimated 17,500 criminal cases to be lost. Only 3TB of the missing data was able to be recovered.
Proving that data hasn’t been tampered
With accelerating advancements in the cybersecurity industry, there are plenty of ways to ensure that data isn’t manipulated by an outside actor. But what about where those security measures end? How does a company ensure that an authorized database administrator, for example, doesn’t make a mistake that costs them millions?
Well, that’s where Proof of SQL comes in. Proof of SQL was designed to guarantee two things: that, at the time of a query, the data you’re getting is correct and hasn’t been manipulated since the time of ingestion, and the query result you’re getting is an accurate SQL computation of that data. Although this design only provides a proof of the state of data at a single point in time, it can easily be leveraged to create data lineage: proof of the state of data at all times. All you would need to do is set up a periodic health check query for as often as you want to verify your data—every few days, every few hours, or even every few seconds. This creates a tamperproof lineage that ensures your data is traceable at all times.
With Proof of SQL, not only can you catch an incident of tampering immediately after it occurs, you can also easily restore your data to its pre-tampered state. If a DBA accidentally deletes a few tables, the company has a tamperproof audit trail of what data was there before. Proof of SQL allows the database to function essentially as a tamperproof ledger, like a blockchain, but with the storage, compute, and SQL interface required for business operations.
Problem #2: Smart contracts can’t ask questions
There’s a completely separate problem in Web3: smart contracts can’t ask data-driven questions. To understand why, let’s talk about where data storage and compute fit into the Web3 stack.
Limited on-chain storage and compute
A blockchain is materially just a database that contains a single immutable table of data. Each time a new transaction is minted, data is added to the table, where it’s permanently stored. If the blockchain is on-chain storage, a smart contract is on-chain compute: a piece of code designed to execute very basic if/then logic over blockchain data. “If X happens, transact Y.” That works really well for simple value exchange on-chain, but what about something that requires more complex computations, like options trading? On its own, a smart contract can’t even ask a simple question like “what are all the wallets that own two NFTs from this collection?” In order to make smart contracts “smarter,” you have to connect them to off-chain storage and compute; you have to give them a way to ask questions.
Are off-chain databases the solution?
SQL (which stands for Structured Query Language) is the question-asking programming language of the world. It’s how every enterprise, from Microsoft to McDonalds, performs computation over data—how they ask questions like “what is the total revenue generated by this product over the last quarter?” And they all do it using one common tool: a SQL database.
Enterprises leverage databases in a similar way that dapps leverage the blockchain: to store, manage, and retrieve data. But unlike the blockchain, databases are incredibly efficient and can handle more complex computations that are required to build business logic: the rules that determine how data is processed within a system. Data warehouses, on the other hand, are even more robust: they’re specialized enterprise-scale databases built to process analytics. Data warehouses provide a means for enterprises to glean insights from their data, supporting tasks like trend analysis, reporting, and decision-making. Both transactional databases and analytic data warehouses are critical components of the enterprise data management stack, and offer the necessary storage and compute to supplement that of the blockchain and allow smart contracts to ask complex questions.
Centralized systems require trust
Unfortunately, you can’t just connect any old database or data warehouse, like PostgreSQL or Snowflake, to a smart contract. For one, databases don’t come preloaded with indexed blockchain data, and a smart contract has to be able to know what’s happening on-chain. More importantly, off-chain databases don’t natively write back to the chain, so there’s no built-in way to connect the answers to the questions you’re asking to your smart contract. Perhaps most importantly, databases today all require one thing that is fundamentally incompatible with blockchain: trust.
Trustlessness is a cornerstone of blockchain technology and a crucial factor that differentiates it from traditional systems. The term "trustless" doesn't mean that blockchain is untrustworthy; rather, it implies that users don't need to trust a central authority or intermediary to validate and enforce transactions. Trustlessness ensures that smart contracts can execute their predefined rules and conditions with full certainty, without interference or manipulation from external parties. This is essential for automating value-driven processes, like financial transactions, where any deviation from the expected behavior could have significant consequences. By operating in a trustless environment, smart contracts guarantee consistent and transparent outcomes.
In contrast, traditional databases and data warehouses require users to trust the central authority that oversees and manages the system. These centralized systems are vulnerable to potential tampering, fraud, and the risks associated with single points of failure. Connecting such systems to smart contracts compromises the zero-trust model of blockchain technology.
Decentralized database vs. zero-knowledge database
The seemingly obvious answer is a decentralized database. Decentralized databases, like Space and Time’s decentralized HTAP data warehouse, bring the efficient compute capabilities of a SQL database to the blockchain stack without introducing centralized, trust-required points of failure. In a decentralized database, data is distributed across a network of nodes, each of which is responsible for a portion of the data. On the surface, it seems like the perfect supplement to the limited storage and compute of smart contracts, and many companies in the Web3 space are building for this.
But, unless the data and compute are spread across enough nodes for an effective consensus algorithm to guarantee their integrity, trust is still a factor. If your data is managed by a single node operator, or even a handful of them, you’re still required to trust that the operator hasn’t manipulated anything. Consensus works well for a network like Ethereum because transactions are replicated across thousands of nodes, but it’s decreasingly effective as the number of nodes goes down. And this extremely redundant replication that we see in blockchain networks is the reason they’re so storage and compute limited in the first place. So, what do you do? How do you build an efficient decentralized database that can serve query results to smart contracts in a fully trustless way? For that, you need a zero-knowledge proof.
Zk-proven query results and what they enable
Proof of SQL is a zk-proof that provides cryptographic assurance that your data and query results haven’t been manipulated by the node operator running your database cluster. Unlike consensus, which relies on multiple nodes to agree upon the validity of data, zero-knowledge proofs allow one party to prove the correctness of a statement without revealing any additional information. In the case of Proof of SQL, it ensures that the data you're querying is correct and hasn't been tampered with since the time of ingestion, and that the query result you're receiving is an accurate SQL computation of that data.
Proof of SQL is required to guarantee the trustlessness of a decentralized data network without the need for unnecessary node replication or reliance on consensus. This gives smart contracts the ability to ask complex, data-driven questions and receive trustworthy answers, expanding their potential applications and functionality. With Proof of SQL, you can achieve the efficiency of a SQL database with the security and tamperproof lineage of a blockchain. This technology allows enterprises and dapps to confidently utilize off-chain storage and compute to supplement the limited capacity of smart contracts, unlocking new possibilities for Web3 applications and making smart contracts “smarter.”
How it works
Now that we’ve addressed two of the main problems that Proof of SQL solves, we can delve a little further into how it works. We know that Proof of SQL guarantees the following:
- That between the time data was ingested and the time it’s queried, it hasn’t been manipulated.
- That the query result you receive is an accurate computation on the stored data.
- That the query result itself wasn’t tampered with before it’s returned to you.
But let’s take a look at how. Keep in mind, we’re only going to go as deep as is needed for the average person (read: non-cryptographer) to understand what’s going on behind the scenes. For a deeper dive, check out the Proof of SQL documentation authored by Jay White, Space and Time Co-Founder, Head of Research, and creator of the protocol.
The diagram below shows the Proof of SQL architecture at a high level:
Prover and verifier in the Space and Time architecture
Zk-proofs are protocols where an untrusted prover convinces a verifier that a statement is true without sharing any additional information. Like any zk-proof architecture, Proof of SQL is composed of two components: the prover and the verifier. The components are well-named and function exactly as they sound: the prover generates a proof, and the verifier verifies it. In other words, think of the prover as the person who's making a claim, and the verifier as the person who's checking to see if that claim is true.
In Space and Time, when you run a query using Proof of SQL, the data warehouse (where the data is stored) acts as the prover. It generates a proof that shows that the query result is accurate and hasn't been tampered with. This proof is generated using a complex cryptographic algorithm, which requires a lot of computational power. Fortunately, the data warehouse is built on NVIDIA GPUs, which are designed for parallel processing (vs. sequential processing) and are therefore much faster than CPUs for this type of computation. This means that the proof can be generated relatively quickly, even though it's computationally intensive. And because the protocol doesn’t require that the prover is trustless (only the verifier), it only has to be done once per query.
The verifier's job is to check the proof and make sure it's correct. Verification is computationally lightweight enough that it can be done redundantly by the Space and Time validator layer, by a decentralized oracle network, or by a smart contract, all of which ensure the trustlessness of the protocol. The Proof of SQL team is building to make verification so lightweight that it will eventually even be able to be executed by something like an iPhone running a client library.
Tamperproof query process
Let’s walk through the process of a tamperproof query step by step.
Step 1: Ingestion and commitment creation
When data is ingested into Space and Time, whether it’s data we’ve indexed from major blockchains, or data you load from a game, app, or other database, it’s first routed through the validator layer. The validator captures a commitment of the data—let’s call it a “fingerprint”—and stores it for later use. (Note: I mentioned before that verification could also be done by a smart contract. That’s because, unlike the raw data, which might be several gigabytes at a time, the fingerprint is small and can be stored on-chain in an affordable way). Next, the validator sends the raw data to the data warehouse where it’s stored.
Step 2: Query request
When you send a query request through an API, the JDBC driver, or on the Space and Time Studio, the validator routes it to the data warehouse. (The validator is also responsible for determining whether the query is transactional or analytic and routing it to the appropriate OLTP or OLAP engine, which is pretty neat, but you can read more about that here).
Step 3: Proof generation and query result
The data warehouse parses the query, computes the correct result, and produces a cryptographic proof that nothing was tampered as the query was computed. It then sends both the query result and the proof back to the verifier.
Step 4: Proof verification
The verifier then checks the proof, the query result, and the stored digital fingerprint to confirm that the data wasn’t changed between ingestion time and the time of the query. Et voilà! You’ve got yourself a tamperproof query result.
What it enables
Hopefully at this point, you have a thorough—albeit high-level—understanding of what Proof of SQL is, how it works, and what it solves, which should give you better foundational context to understand some use cases for it. We’re going to walk through two of them.
Decentralized options trading
Let’s go back to the example of decentralized options trading. Earlier in this blog post, you read this:
“[A smart contract] works really well for simple value exchange on-chain, but what about something that requires more complex computations, like options trading? On its own, a smart contract can’t even ask a simple question like “what are all the wallets that own two NFTs from this collection?” In order to make smart contracts “smarter,” you have to connect them to off-chain storage and compute; you have to give them a way to ask questions.”
Now you know that Proof of SQL is the missing piece that allows a smart contract to ask and answer data-driven questions about data on its own chain, data on other chains (read more about the Space and Time blockchain indexing service here), and data loaded from any off-chain source. This makes Proof of SQL particularly well-suited for data-driven DeFi protocols, like a decentralized options trading platform.
Options are financial instruments that give traders the right to buy or sell an underlying asset, like crypto, at a predetermined price within a specific time frame. In Web3, options trades are automated by smart contracts. Proof of SQL allows these smart contracts to query pricing information in realtime with the guarantee that the information they’re getting is accurate and hasn’t been manipulated.
Verifiable price information
Calculating an options price is far too computationally heavy to be done by a smart contract; it has to be done off-chain using a complex algorithm like the Black-Scholes model. If a user wants to calculate implied volatility or options prices for derivatives against crypto tokens, they would take the price history of those tokens in Space and Time and then run a Black-Scholes model to calculate the options prices or implied volatility. In this scenario, a smart contract can query the data directly from Space and Time before a trade occurs to ensure the correct, verifiable price at the time of the trade. Here’s how the process works to ensure the price is accurate and tamperproof:
- Store the necessary inputs for the model in Space and Time. These could include parameters like the token price, strike price, volatility, and time to expiration.
- Implement the model and use the stored inputs to calculate the options price.
- Store the resulting options price in Space and Time, along with any intermediate values or calculations used in the model.
- Feed a SQL query that retrieves the realtime options price from Space and Time to your smart contract.
- Use Proof of SQL to verify the integrity of the query and its results, ensuring that the inputs and outputs haven't been tampered with.
Bank consortium ledgers
Proof of SQL also enables enterprises outside of Web3 to leverage verifiable data and compute—which is particularly relevant for industries where financial value is directly tied to data. For example, in a bank consortium ledger, each member bank calculates its own profit and loss (P&L) statement based on its individual transactions, which are stored in a separate database. Once the P&L is calculated, the bank then records it on a shared ledger, which functions somewhat like a blockchain (transactions can be added, but not deleted or changed. The consortium comes to consensus on the added data). This ledger becomes a single source of truth for the consortium.
Accurate and verifiable calculations
The consortium members need confidence that the recorded P&L for each bank is accurate and hasn't been manipulated before being added to the ledger. Proof of SQL allows this process to be done in a verifiable way, so that no one bank has to trust the integrity of the others. When a bank adds its P&L statement to the shared ledger, it provides a proof that guarantees its P&L was calculated correctly according to the transactions in its database… that no one manipulated the result or the intraday data used to calculate it.
Transparency and accountability
This system of mutual accountability fosters trust and transparency among consortium members. Each bank can be confident that the shared ledger contains accurate and reliable financial data, without having to place their trust in another institution.
The verify-everything future
If you’ve stuck with me this far, you’re probably beginning to realize that this is much bigger than smart contracts, or banks, or databases. Proof of SQL is empowering a future where all financial, economic, and business processes are not built on trust… but on verification. The verify-everything future.
Space and Time has pioneered the first zk-data warehouse to kickstart the era of verification, but we envision a future that goes far beyond Space and Time. We envision a future where every database in the world is verified by Proof of SQL—where Proof of SQL is the standard. We’re excited to continue building toward that future, and look forward to sharing more milestones in the coming months.