rss-bridge 2024-02-28T19:31:00+00:00

SE Radio 605: Yingjun Wu on Streaming Databases

Yingjun Wu, founder of RisingWave Labs and previously a software engineer at Amazon Web Services and researcher at IBM Almaden Research Center, speaks with SE Radio host Brijesh Ammanath about streaming databases. After considering the benefits and unique challenges, they delve into the architecture and design patterns of streaming databases, as well as the evolution and security considerations. Yingjun also talks about the future of streaming databases, including the potential impact that Amazon S3 Express One Zone will have on the streaming landscape, and how the unified batch and streaming might evolve in the database world. Brought to you by IEEE Computer Society and IEEE Software magazine.

Yingjun Wu, founder of RisingWave Labs and previously a software engineer at Amazon Web Services and researcher at IBM Almaden Research Center, speaks with SE Radio host Brijesh Ammanath about streaming databases. After considering the benefits and unique challenges, they delve into the architecture and design patterns of streaming databases, as well as the evolution and security considerations. Yingjun also talks about the future of streaming databases, including the potential impact that Amazon S3 Express One Zone will have on the streaming landscape, and how the unified batch and streaming might evolve in the database world.

Show Notes

Related Episodes

SE Radio 346 – Stephan Ewen on Streaming Architecture

SE Radio 218 – Udi Dahan on CQRS (Command Query Responsibility Segregation)

Other References

Rethinking Stream Processing and Streaming Databases

RisingWave.com

RisingWave on Slack

Yingjun Wu on Github

Techtarget.com: Amazon S3 Storage Picks Up Speed at ReInvent 2023

Transcript

Transcript brought to you by IEEE Software magazine and IEEE Computer Society. This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number.

Brijesh Ammanath 00:00:18 Welcome to Software Engineering Radio. I’m your host, Brijesh Ammanath. Today’s session is about streaming databases, understanding what it is, benefits of using it, and the challenges unique to it. Our guest today is Yingjun Wu, who is the founder of RisingWave Labs, building the product RisingWave, which is a distributed SQL database for stream processing. Prior to RisingWave Labs, YJ was the software engineer at the Redshift team, Amazon Web Services, and a researcher at Database Group IBM, El Madden Research Center. Yingjun, welcome to the show.

Yingjun Wu 00:00:50 Hi. Hello everyone. Thanks for having me here. I’m Yingjun, yes I’m the founder of Origin Wave Labs and I’m a database guy and a stream processing guy. So really excited to join the show.

Brijesh Ammanath 00:01:02 Let’s start with the fundamentals. Can you provide a high-level overview of what a streaming database is?

Yingjun Wu 00:01:08 Well, streaming database it’s more about a stream processing, but it’s not just about a stream processing engine, it is a database system. So people may hear of, probably have already used some stream processing systems like Apache, Flink or probably back streaming, right? But for RisingWave it basically provides the people with the database experience for processing streaming data in another way. It allows people to process streaming data inside of a database. So this is fundamental difference from the stream processing engines because streaming database actually stores the data, which means that we can essentially persist to the input and output of the stream processing jobs and then do data serving inside of a database system. And for RisingWave specifically as a streaming database, it is Postgres compatible, which means that, if you have system that can talk to Postgres, this system is highly likely to be able to talk to RisingWave. Which means that RisingWave can connect to a lot of systems in the Postgres ecosystem.

Brijesh Ammanath 00:02:20 Right. And what is the problem that we’re trying to solve by using streaming databases?

Yingjun Wu 00:02:25 Well, the streaming databases definitely tries to solve the stream processing problem. So what’s the problem of stream processing? Well, there are already a lot of stream processing systems, right? For partial Flink and the Spark streaming. But I think the key problem here in today’s world is that these types of systems are kind of first, difficult to use and second, they are not very cost efficient. So let’s talk about the ease of use first. When we try to use a big data system like Spark or Flink, well we need to think about how to write Java, right? And people may argue that these systems have already provided a Python, API, probably SQL API, and then probably we can directly write SQL code or Python code. But fundamentally you actually need to match a JVM based system. And more importantly, the thing here is that you actually need to understand the fundamental concepts inside of these stream pricing systems.

Yingjun Wu 00:03:24 Like you need to understand what is checkpointing? What is, well, how we can do for tolerance and how we can do scaling, right? You need to know how the system works internally. But in writing Wave, people do not need to think a lot of things about this, right? To think about, well, if you’re using Postgres, you will never need to worry about how the Postgres process the transactions, right? And if you use Postgres, you will never need to worry about what is Checkpoint and what is failed recovery, right? You just use it and you just, right. SQL code that’s fundamentally different from using let’s say system from Spark or probably Hadoop, probably Flink, right? That’s fundamentally different. And second thing here that will cost efficiency. So when we talk about what the spark of Flink, right? Where in this kind of big data systems where we, people are talking about map reduce, right?

Yingjun Wu 00:04:15 We try to shove the data, position data and shove the data into multiple machines and ask every single machine to their local accommodation. And then we do the application, right? Such kind of architecture is highly optimized, well performance, which means that I can give you 10 machines and this kind of system will achieve fast performance by running on top of these 10 machines. But what we see in today’s world is that people use Cloud. So we never need to worry about how many computer nodes we can get, right? Where as long as we pay money, we can get as many computer node as we want, right? But the thing here is more about cost efficiency, that is how I can reduce the cost. Why is that important because I mean, people don’t really want to spend, let’s say, always spend a lot of money on this data system and they just want to provision a small amount machine that’s just a fit to their workload, right?

Yingjun Wu 00:05:14 If let’s say that I have a small workload, I don’t really want to spend 10 machines, I just want to provision one single machine for the system. And in the stream processing world, the key thing here that the streaming workload can fluctuate, think about where if you’re using Uber, right, or think about it, where you using some, let’s say linking or some other Twitter, right? The workload can fluctuate, right? A lot of people will Vista website in the morning, but very few people wear Vista website at midnight, right? That’s traffic and that’s how wide the workload can fluctuate. In the stream processing model we really want to achieve, let’s say, oh, so called dynamic scaling, which means that we really want to just provision as much resource as needed, right? We do not need to provision a lot of computer nodes and we just want to provision computer node that can fit into my, that can support my workload. So yeah, to summarize RisingWaves trying to address the ease-of-use problem as whereas the cost efficiency problem.

Brijesh Ammanath 00:06:14 Right. And what’s the tradeoff for achieving cost efficiency?

[...]

Original source