PostHole
Compose Login
You are browsing eu.zone1 in read-only mode. Log in to participate.
rss-bridge 2023-12-27T20:12:00+00:00

SE Radio 596: Maxim Fateev on Durable Execution with Temporal

Maxim Fateev, the CEO of Temporal, speaks with SE Radio's Philip Winston about how Temporal implements durable execution. They explore concepts including workflows, activities, timers, event histories, signals, and queries. Maxim also compares deployment using self-hosted clusters or the Temporal Cloud.


Maxim Fateev, the CEO of Temporal, speaks with SE Radio’s Philip Winston about how Temporal implements durable execution. They explore concepts including workflows, activities, timers, event histories, signals, and queries. Maxim also compares deployment using self-hosted clusters or the Temporal Cloud.



Show Notes

Related Episodes

Related IEEE Computer Society Articles

Related Links

  • Previous Systems

Transcript

Transcript brought to you by IEEE Software magazine and IEEE Computer Society. This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number.

Philip Winston 00:00:18 Welcome to Software Engineering Radio. This is Philip Winston. I’m here today with Maxim Fateev. Maxim is the co-founder and CEO of Temporal. He has worked in lead roles at Amazon, Microsoft, Google, and Uber. He led the development of AWS Simple Workflow Service at Amazon and the open-source Cadence workflow at Uber. Maxim holds degrees in computer science and physics. We’re going to talk today about automating business processes with distributed systems, but let’s start with a term, not everyone might be familiar with durable execution. Maxim, what is a durable execution and why is it a useful abstraction?

Maxim Fateev 00:00:58 Durable execution is a new abstraction, which we kind of introduced and it’s not very well known, but I think every engineer absolutely should have it in their back pocket of tools. The idea is pretty straightforward. Imagine runtime, which keeps the full state of your code execution durable all the time. It means that if a process crashes or any other infrastructure outage happens like a network event, power outage and so on, your code will keep running after the problem is resolved. Or if the process crashes it will just be migrated to a separate process. But if the whole region is down as soon as it comes back, your code will continue executing. And I’m not saying process, but actually the full state. So for example, if you have a function which is blocked, like executes, let’s say three API calls A, B, C, and it’s blocked on B and then this process crashes, this execution will migrate to another machine and it’ll be still blocked on B and B result can come two days later and then it’ll continue executing. So all local variables in the whole state, including block and calls, are fully preserved all the time.

Philip Winston 00:02:00 Great. We’re definitely going to get into how that works. But before we do that, we’re going to do a little bit of history and how we got here. But let’s just describe Temporal as it exists today.

Maxim Fateev 00:02:10 So Temporal is an open-source project. It’s MIT licensed, so you can absolutely run it yourself, including both his DK libraries and the backend server, which keeps the state. And so this project we started it at Uber. It was initially called Cadence, and then four years ago we started a company, we worked the project and we call it Temporal. And it’s been around for seven years in production at high scale and used by thousands of companies around the world for mission critical applications.

Philip Winston 00:02:36 Let’s go back even further. Before Cadence, you worked on the simple workflow service at Amazon. Can you tell me a little bit about how that project started and what were the goals of the project?

Maxim Fateev 00:02:48 So I probably want to start even earlier. I joined Amazon for the first time in 2002 and back then Amazon was actively moving from monolithic architecture to microservice architecture. It wasn’t called microservices back then. It was called just services. Later, it was called Saw. But the basic idea was the same that this monolith was unsustainable and just re linking that binary was taking 45 minutes on the high-end desktop. So it wasn’t practical to do any development. So Amazon had to do that. And we found that practically we used a standard approach even back then of using queues to connect our multiple services together and practically execute all these backend business processes. And I was tech lead for the pop ups group. We, I was leading the group, which was owning all backend pop ups of Amazon. I worked on the distributed storage replicated storage for Q in it still, I think it’s used by Simple Q service at its backend.

Maxim Fateev 00:03:38 And so I, I was in the middle of a lot of conversations about design choices. When you build these large scale systems using queues and event driven architectures, it quickly became apparent that it’s not the right abstraction. Events and queues are not the right abstraction to build such systems. So one very important point there is that they’re very good runtime abstractions, right? So event-driven systems and queues give you awesome runtime characteristics because for example, your service can be slow, but other services are not slowed down by that because things can be buffered in queues or your service can be even down and the other services don’t even notice that. They just see messages maybe not coming or buffer it. So runtime characteristic of these systems is awesome, but as a design choice, they actually create very brutal systems because everything is connected to everything and there are no clear APIs and it’s very hard to create the state machine in every service and support all the business logic.

Maxim Fateev 00:04:32 So our team recognized that and we started to think about an orchestrator and Amazon had an internal orchestrator based on petri nets idea based on top of Oracle and it was used to actually orchestrate orders and we started to think about more scalable solutions. We had three different iterations, three versions of the service internally and only the third version was public and it was released as a simple workflow service, AWS simple workflow service. So it took us a lot of kind of try different attempts to get it right, but I think even simple workflow I don’t think is really, I wouldn’t say we’ve got it right because it’s still not very popular service, but we actually learned a lot doing that. And then what happened is that we kind of left to other companies and co-founder of Temporal, Samar Abbas, who worked with me on a simple workflow, he went to Microsoft and at Microsoft he wrote a durable task framework, which later was adopted by Azure functions and now Microsoft has Azure durable functions. And then we met at Uber and we started to kind of, we created this similar, we like took all these ideas and created the next generation of the technology, which we are kind of right now. You can think of this T

[...]


Original source

Reply