Exploring Space and Time: Decentralized Data Warehousing

Exploring Space and Time is a deep-dive video interview series on the various components of the Space and Time platform.

Exploring Space and Time is a deep-dive video interview series on the various components of the Space and Time platform. In this episode, Content Lead Catherine Hickox interviews Senior Software Engineer Jack Carroll on Space and Time’s decentralized data warehouse.

“Having been a Data Engineer myself, I am well aware of the pain points of existing monolithic data-lake architectures and their support components. Scaling problems are frequent, and data lineage and governance is an afterthought. Space and Time's innovative application of bleeding-edge Web3 technology has the potenital to address these problems amongst others, not as afterthoughts, but as core architectural tenants. Such a proposition proved too enticing to pass up, so I have made the decision to join the incredible team at Space and Time. I am humbled to be working with a number of the most talented engineers of my career, and am excited for our collective journey ahead.” Jack Carroll, Senior Software Engineer, Space and Time

Catherine: Hi, everyone. I'm Catherine Hickox, Content Lead here at Space and Time, and I'm here with Space and Time Senior Software Engineer Jack Carroll. Thanks for joining me, Jack.

Jack: Thanks, Catherine. Thank you for this introduction. As Cat said, I work on the data warehousing team as Senior Software Engineer. We are working on a number of intractable problems both within the Web3 space and within the data warehousing space and finding synergies every day to drive that direction forward. I'm really excited to talk to you, and for you to learn a bit more about what we do in the data warehousing team, and hopefully I can answer everybody's lingering questions on everything that data warehousing is at Space and Time.

Catherine: Awesome. Yeah, I'm really excited to talk to you, too. I know we're doing some really, really cool stuff with what we're building. I'm excited to dive into that. Can you start just by giving me a brief introduction to data warehousing? Talk a little bit about what that means and what's different about the data warehouse that we're building at Space and Time.

Jack: Awesome. So, a bit of background about myself. I was a data engineer at my previous company. I saw an opportunity with Space and Time to solve some really intractable problems in the data space and to do it with some really, really talented engineers that we've got on board. So that's really the first place that I'd love to start. In the data warehousing space, we are trying to build a solution that will serve everybody's needs. So as a data engineer, I do not think of data warehousing as one solution. I think of it as five discrete steps that have to happen for a user to really get their data, to be able to access it in the manner that they need to access it. And Space and Time as a platform, as being the first Web3 platform, we have to cover all of those use cases and do it in a way that stays performant, and in a way that all our users can come to the table in an equitable fashion and have the tools necessary to do the job the best way that they know how. And so that's a lot of what we're trying to do here, is empower some really, really smart people, give them the right tools they need to solve problems that we couldn't even talk about before Web3 came into the sphere.

And I actually want to dig into that a bit more. I said I was a data engineer. This is my first Web3 company. I did not think that I was going to be jumping into Web3 as quickly as I did, but I really saw issues as a data engineer around governance and security and trust. And that is something that Web3 is going to solve. A Web3 data solution is going to solve that. And that is really, really exciting for a data engineer, for an end user. So that was really what I honed in on and how I decided that I wanted to join Space and Time.

Catherine: Awesome. This is also my first Web3 company to work with, and there's a saying or a quote that says, "The Gold Rush is a good time to be in the picks and shovels business." And I think that's exactly what we're doing here at Space and Time. You referred to some really interesting challenges that we're solving. So, super exciting things are on the horizon, for sure.

Jack: Yeah, we are not pushing ourselves to blockchain because we have to. Blockchain is a technology that really is solving a lot of problems in the data space. I viewed blockchain previously as a solution looking for a problem. There was a very good problem in the finance space, and so we immediately saw that solution. People have yet to actually fully explore the solutions that can actually make a real impact on blockchain technology. So yeah, that's a really exciting part of my job, for sure.

Catherine: Totally. Yeah, and I want to explore that a little further. But before we do, can you tell me a little bit about some of the architecture we're employing at Space and Time? For folks who have been following along with the project, they've probably heard us use the term HTAP engine, or hybrid transactional/analytical processing engine, to refer to the query engine that is employed by the Space and Time data warehouse. So why don't we start with that? What is an HTAP engine?

Jack: HTAP is a really interesting thing that's come out of the data market that they have quantified as being able to do everything, a one size fits all solution to the data problem. Us being an end-to-end solution provider, we have to go and provide everything for an end-to-end solution. So that's really how we started going into the HTAP space, and saying, "We want all parties to be able to come together in an equitable fashion, and they need to have all the tools necessary to do their job." And that's really what an HTAP system is: one that can perform all of these operations efficiently, but can also be repurposed at will to perform different operations. And there's a big reason for that within Web3 and decentralized space: you bring trust out of that. There's a big reason for that that I'm sure we’ll get to. But within the context, we need to have a data warehouse that can stand on its own, that fits all of our users' use cases. And so that's really where we started exploring HTAP.

Catherine: I know that the dichotomy of that is supporting both transactional and analytic queries. So, can you explain the difference between a transactional query and analytic query, maybe give an example of each, and talk about how queries are handled differently in a transactional engine versus an analytic engine?

Jack: When you're talking about transactions, you are talking about near-real-time operations that are either reads or writes. So you are either looking up a single row, you are looking up a single data point, or you are going and modifying that single data point. A really good concrete use case of this is looking up the wallet address of a user. So, we index all our data; we want a user to be able to go and find the wallet address in under 150 milliseconds. And so that is the benchmark that we are trying to reach for transactional reads. Transactional writes are also a component of that that we will put out benchmarks for. But then that brings us to the analytics side of things, which is vastly different. When you talk about analytics, the first thing you have to talk about is aggregations.

So, it's an aggregation operation. Say that you're taking an average over this column or you're taking the sum, the min, the max over this column. Those operations are not optimized in a transactional system. A transactional system is very good at real-time operations. But when you want to go and answer a question, like "What is the average of this column over millions upon millions of rows?" a transactional system is going to fall flat. And so that's really where we are talking about both the transactional side of things and the analytic side of things. We want users to be able to push data in at a massive rate, but then go and pull the data out and get the insights they need to do the work that they need to do.

Catherine: So how, from an architecture and engineering standpoint, does an HTAP engine support both workloads, and what are some of the challenges of building that?

Jack: Yeah, I think that you hit the nail on the head. The workload management is the crux of the problem. And so what Space and Time has done is said, "There's a Web2 solution that is ready to go, that is very battle-tested within the market, and so let's use it to build our Web3 system to have the primitives necessary for us to do this very advanced workload management, to scale right workloads and read workloads within the context of one cluster." That system is Kubernetes. We are a full Kubernetes shop. We do deep integration with the Kubernetes scheduler. And so Kubernetes is really a workhorse solution that we felt comfortable saying can support our workload management needs. And it's a middle ground where we can meet, where our enterprise customers are very, very familiar with it.

Catherine: Yeah, absolutely. I want to dive a bit more into scaling and balancing loads. So tell me about Kubernetes and a little bit about why this matters.

Jack: I'm going to get into why this matters within the context of a Web3 company, because I think it illustrates the point really well. So, say you are a company that has data sitting in, let's say, warehouses in Australia, halfway around the world, and it is there, but you need to answer insights on that data. Traditionally, what were your options? You could spin up a cluster next to the data that exists, which incurs cost. You could pass all of that raw data over the wire, which would incur serious cost. Space and Time is going to enable, to empower, a third option. The option is that I don't need to own this infrastructure. I can own the data that comes from this infrastructure, but I don't need to trust the infrastructure itself in order to trust the data that is coming from it.

And what that allows us to do is to very quickly repurpose more generic resources to very specific use cases, to where they can perform really, really well. Within a matter of minutes, you can repurpose your OLAP query engine that can work over billions and billions of rows of data. And then the next minute you are pulling in thousands and thousands of gigabytes every single minute. And so we need a system that is able to do that so that our users can go and take advantage of the opportunities when they are not using their own infrastructure. So the analogy that I've been using to engineers is a spot model: we are renting spot resources. What the Web3 data warehouse allows us to do is rent spot resources and repurpose infrastructure within the network without having to spin up outside of the network.

Catherine: Yeah, that's really cool. I know a lot of the thought behind building a decentralized data warehouse is obviously to build it on top of Web3 principles, right? Decentralization, trustlessness. But it's cool to see the actual real world utility that has for our functions. I want to talk a little bit about scaling when resources are limited. You're talking about repurposing resources and what that looks like, but what about when there's a limited number of nodes? And talk a little bit about the difference between scaling two query engines independently within a cluster versus scaling between multiple clusters, and how this compares to scaling with cloud elasticity.

Jack: Cloud elasticity is the point that I would love to start at, because it is the point that our enterprise customers are most familiar with. They understand that we can go add hardware to this data warehouse and get more performance out of it. That will always be an option within the Space and Time platform, but that is not the only option within the Space and Time platform. When you are talking about working within a constrained resource space, that's where you start talking about very efficient workload management and repurposing your friends that are sitting there next to you, that are co-located with you, using their infrastructure that they've already rented, to go and do your work. And so when you are talking about a very constrained resource environment, Space and Time is going to perform better than the existing options. If you're talking about the infinitely elastic environment that we are all used to right now, that is something that Space and Time is absolutely going to support and absolutely will be performant, but it is not the main use case.

Catherine: So, we've touched on the importance of scalability for Space and Time as an end-to-end data platform, but the crux of what we do is we allow that data and the query results being run within the platform to be connected back to smart contracts. So can you talk a little bit about what smart contracts will enable for scaling?

Jack: I think that from a data engineer's perspective, I think about the five distinct workloads that I covered previously. They are ingestion, movement, storage, refinement and serving. And to get a full data product, you have to get all the way to the end of that pipeline. Traditionally, enterprises had to do that entire pipeline themselves. Now, what smart contracts are enabling is saying, "You do not have to actually own all of this. You can go and contract out to a third party, somebody that's the expert, the absolute expert, in their field and you can receive the product in an intermediate state and then go and do everything else that you need to do." So, if you're talking about scale, you are talking about democratizing data operations and vastly increasing access to data services for some really, really smart people that can solve problems much better than you and I can, that understand the problem space really well, and we want to empower them to solve those problems.

Catherine: Amazing. Yeah, it's so cool to see what smart contracts are empowering across all spaces, in all industries. And it's so cool to get in on the ground level of that with what we're building in Space and Time.

Jack: Absolutely. And it's a very, very powerful idea within the data space. As I said, I would not have joined this company if it weren't. I am so very excited to be diving into this with an absolutely amazing team and seeing what we can put out. I think there are questions we are asking ourselves that came about that wouldn't have come about at a Web2 company. We are going to upend the data market. I can definitively say that. And we have the resources to do it. We have the smart people that are just itching to do that. That is what I'm most excited about at Space and Time.

Catherine: We're really excited to have you on board also, Jack. Thank you so much for sitting down and chatting with me and lending a little bit more insight into what we're building here.

Jack: Absolutely. Anytime.