Similar to L1 ethereum, a node for an OP stack chain has an execution client (op-geth
) and a consensus client (op-node
). The consensus client can optionally run a "sequencer" which is responsible for block production. Unlike L1 however, there aren't multiple validators that take part in block production, instead there is only one permissioned node run by the chain operator that can act as a sequencer and produce blocks.
The sequencer is a single point of failure for the entire chain. If the sequencer is offline, block production stalls.
Simple example: You're running an OP-stack chain, and you want to upgrade the node to the latest version. Since there can only be one sequencer running at a time, you'll have to stop the node, upgrade, and then start the node again. This means the sequencer will be offline for the duration of the upgrade.
A bigger problem is if the sequencer goes offline unexpectedly or for an extended period of time. This can happen for any number of reasons, such as host machine crash, storage failure, network issues, etc. This means the chain is effectively halted until the sequencer is brought back online.
Even though there can only be one sequencer running at a time, there can be multiple nodes that can potentially be promoted to be the active sequencer. In this case, an update could be performed like so:
This is a simple solution but it requires manual intervention, is error prone, and does't handle unexpected node failures. We need automatic failover! This might sound simple but there are a couple of problems that will need to be solved...
Let's say we build a simple solution that continuously monitors the health of the nodes, and transfers the sequencer role to a healthy node if the current sequencer goes offline.
Now consider a scenario where node A (active sequencer) becomes unresponsive, for e.g. network issues, resource exhaustion, etc. Not only will the chain stop producing blocks, but node A cannot process admin_stopSequencer
calls. This means our monitoring solution has two choices: do nothing i.e downtime :(, or start the sequencer on node B.
If the sequencer on node B is started, and at some point node A becomes responsive again, we will have two sequencers and the chain will be forked!
The RPC method admin_startSequencer
requires a block hash, i.e. the latest block on top of which to build new blocks. Consider the following scenario:
When the sequencer on node B starts, it might not have the latest block 100 produced by A (e.g. p2p comms delay). If admin_startSequencer
didn't require the hash of the latest block, node B would create a new block 100 on top of block 99, causing the old block 100 to be re-orged.
So the block hash param on admin_startSequencer
helps us in preventing re-orgs but this also means that in order to start sequencing, node B must have the last block that was produced by A. Querying A might not be possible if A crashed, so we need a way to gaurantee that all nodes agree on what the latest block is.
OP Labs (?) built a service called op-conductor
that aims to solve these problems. It runs alongside op-node
and op-geth
, and guarantees zero downtime and no re-orgs for unsafe blocks, as long as a majority of the nodes are healthy.
op-conductor
uses raft consensus, an algorithm that helps multiple computers agree on a shared state, like which actions to take next, even if some of them fail. This allows op-conductor
to elect the active sequencer and provides automatic failover by monitoring node health and transferring leadership if the current leader goes offline or becomes unhealthy. Additionally, op-node
has been updated to integrate with op-conductor
, requiring the sequencer to commit new blocks to it when enabled.
If the active sequencer becomes unresponsive, leadership will be transferred to another node. If at a later point the original sequencer comes back online, it will not be able to commit new blocks as it is not the leader anymore.
op-conductor
stores new blocks in the raft log, which means it gets replicated across the cluster. This ensures that all nodes have the same view of the chain and can start sequencing from the last block produced by the previous sequencer.
op-conductor
is a major step towards making OP stack chains more reliable. It has made rolling upgrades easier and safer, and is now a standard part of our chain deployments.
However, there are still steps that can be taken to make your chain even more reliable, such as setting up proper monitoring and alerting, and having a disaster recovery plan in place. e.g. it is important to monitor that your chain is rolling up to L1 (or L2 if your chain is an L3) as expected. This is what happened to Degen and Proof of Play, when their batch submitter stopped working because of a misconfig and wasn't fixed within the sequencing window which ended up causing re-orgs, a lot of downtime, and some loss of funds. We'll cover these topics in future posts, so stay tuned!
Running a chain is challenging. That's why many companies choose to work with a Rollup as a Service (RaaS) provider. We at Chaindrop specialize in deploying fully customizable Ethereum rollups directly into your AWS environment. We provide you with full control, ensuring you can launch, scale, and modify your infrastructure with no issues. Chaindrop supports Optimism and Arbitrum stacks.
Discover the history and evolution of Ethereum scalability solutions, from sharding and Plasma chains to the rise of Layer-2 rollups. Learn how rollups enhance blockchain efficiency and how Chaindrop's rollup-as-a-service can help your Ethereum startup scale effectively.
Learn how to choose between Ethereum and alternative Data Availability layers, balancing security and cost for your blockchain project.
Discover how OP-Conductor improves reliability for OP Stack chains by solving single points of failure, preventing re-orgs, and enabling automated failover for seamless sequencing. Learn about Raft consensus, zero downtime upgrades, and how to avoid common pitfalls in rollup management. Stay ahead with practical tips to ensure your chain's long-term health and stability.
Learn how to decide between using Ethereum or a custom gas token for your blockchain. Understand the trade-offs between simplicity, security, and customizability for your new chain.