Back to Library

High Availability Sequencing for OP Chains Using op-conductor

by
Arsh Singh
on
September 3, 2024

Single point of failure

Similar to L1 ethereum, a node for an OP stack chain has an execution client (op-geth) and a consensus client (op-node). The consensus client can optionally run a "sequencer" which is responsible for block production. Unlike L1 however, there aren't multiple validators that take part in block production, instead there is only one permissioned node run by the chain operator that can act as a sequencer and produce blocks.

The sequencer is a single point of failure for the entire chain. If the sequencer is offline, block production stalls.

Simple example: You're running an OP-stack chain, and you want to upgrade the node to the latest version. Since there can only be one sequencer running at a time, you'll have to stop the node, upgrade, and then start the node again. This means the sequencer will be offline for the duration of the upgrade.

A bigger problem is if the sequencer goes offline unexpectedly or for an extended period of time. This can happen for any number of reasons, such as host machine crash, storage failure, network issues, etc. This means the chain is effectively halted until the sequencer is brought back online.

Fallback sequencers

Even though there can only be one sequencer running at a time, there can be multiple nodes that can potentially be promoted to be the active sequencer. In this case, an update could be performed like so:

  • stop sequencer on node A
  • start sequencer on node B
  • upgrade node A
  • stop sequencer on node B
  • start sequencer on node A
  • upgrade node B

This is a simple solution but it requires manual intervention, is error prone, and does't handle unexpected node failures. We need automatic failover! This might sound simple but there are a couple of problems that will need to be solved...

Problem #1: The one true sequencer

Let's say we build a simple solution that continuously monitors the health of the nodes, and transfers the sequencer role to a healthy node if the current sequencer goes offline.

Now consider a scenario where node A (active sequencer) becomes unresponsive, for e.g. network issues, resource exhaustion, etc. Not only will the chain stop producing blocks, but node A cannot process admin_stopSequencer calls. This means our monitoring solution has two choices: do nothing i.e downtime :(, or start the sequencer on node B.

If the sequencer on node B is started, and at some point node A becomes responsive again, we will have two sequencers and the chain will be forked!

Problem #2: Potential re-orgs

The RPC method admin_startSequencer requires a block hash, i.e. the latest block on top of which to build new blocks. Consider the following scenario:

  • block height for the chain is 99 and node A and B are in sync
  • node A produces new block 100
  • sequencer on node A is stopped
  • sequencer on node B is started

When the sequencer on node B starts, it might not have the latest block 100 produced by A (e.g. p2p comms delay). If admin_startSequencer didn't require the hash of the latest block, node B would create a new block 100 on top of block 99, causing the old block 100 to be re-orged.

So the block hash param on admin_startSequencer helps us in preventing re-orgs but this also means that in order to start sequencing, node B must have the last block that was produced by A. Querying A might not be possible if A crashed, so we need a way to gaurantee that all nodes agree on what the latest block is.

op-conductor

OP Labs (?) built a service called op-conductor that aims to solve these problems. It runs alongside op-node and op-geth, and guarantees zero downtime and no re-orgs for unsafe blocks, as long as a majority of the nodes are healthy.

op-conductor uses raft consensus, an algorithm that helps multiple computers agree on a shared state, like which actions to take next, even if some of them fail. This allows op-conductor to elect the active sequencer and provides automatic failover by monitoring node health and transferring leadership if the current leader goes offline or becomes unhealthy. Additionally, op-node has been updated to integrate with op-conductor, requiring the sequencer to commit new blocks to it when enabled.

How does it guarantee only 1 active sequencer?

If the active sequencer becomes unresponsive, leadership will be transferred to another node. If at a later point the original sequencer comes back online, it will not be able to commit new blocks as it is not the leader anymore.

How does it prevent re-orgs?

op-conductor stores new blocks in the raft log, which means it gets replicated across the cluster. This ensures that all nodes have the same view of the chain and can start sequencing from the last block produced by the previous sequencer.

Increasing reliability

op-conductor is a major step towards making OP stack chains more reliable. It has made rolling upgrades easier and safer, and is now a standard part of our chain deployments.

However, there are still steps that can be taken to make your chain even more reliable, such as setting up proper monitoring and alerting, and having a disaster recovery plan in place. e.g. it is important to monitor that your chain is rolling up to L1 (or L2 if your chain is an L3) as expected. This is what happened to Degen and Proof of Play, when their batch submitter stopped working because of a misconfig and wasn't fixed within the sequencing window which ended up causing re-orgs, a lot of downtime, and some loss of funds. We'll cover these topics in future posts, so stay tuned!

About Chaindrop

Running a chain is challenging. That's why many companies choose to work with a Rollup as a Service (RaaS) provider. We at Chaindrop specialize in deploying fully customizable Ethereum rollups directly into your AWS environment. We provide you with full control, ensuring you can launch, scale, and modify your infrastructure with no issues. Chaindrop supports Optimism and Arbitrum stacks.

Want to get the latest updates?

A Brief History of Rollups

A Brief History of Rollups

Discover the history and evolution of Ethereum scalability solutions, from sharding and Plasma chains to the rise of Layer-2 rollups. Learn how rollups enhance blockchain efficiency and how Chaindrop's rollup-as-a-service can help your Ethereum startup scale effectively.

Read
8
min
arrow
Choosing Your Data Availability Layer

Choosing Your Data Availability Layer

Learn how to choose between Ethereum and alternative Data Availability layers, balancing security and cost for your blockchain project.

Read
6
min
arrow
High Availability Sequencing for OP Chains Using op-conductor

High Availability Sequencing for OP Chains Using op-conductor

Discover how OP-Conductor improves reliability for OP Stack chains by solving single points of failure, preventing re-orgs, and enabling automated failover for seamless sequencing. Learn about Raft consensus, zero downtime upgrades, and how to avoid common pitfalls in rollup management. Stay ahead with practical tips to ensure your chain's long-term health and stability.

Read
5
min
arrow
Custom Gas Token vs Native Ethereum for Your Chain

Custom Gas Token vs Native Ethereum for Your Chain

Learn how to decide between using Ethereum or a custom gas token for your blockchain. Understand the trade-offs between simplicity, security, and customizability for your new chain.

Read
4
min
arrow