Commit 6b0b8a00 authored by Joshua Gutow's avatar Joshua Gutow

op-service: Harden Transaction Manager

This commit fixes a common source of errors in the transaction manager
which then cause significant problems for the batcher. The transaction
manager is designed to be resilient from errors when resubmitting txns or
when polling for transaction receipts. It was not designed to be resilient
on the initial transaction submission, and that was the root cause of
several incidents.

This commit fixes this issue by wrapping `craftTx` in a retry. If there is a
sustained problem for longer than the amount of retries, issues could still
happen, but this will significantly reduce the number of issues.

A failure in craftTx is so harmful is because the transaction manager is
wrapped in a txmgr.Queue which handles multiple in flight transactions.
The queue uses an errgroup to manage concurrency and when any single
txmgr.Send fails it will cancel the context & cancel the rest of the in flight
sends. Because txmgr.Send could fail when creating a transaction, a transient
failure would cancel multiple in flight transactions. Some of these in flight
transactions would eventually land on L1 and the batcher would lose track of
which frames it had submitted & thus could submit duplicate frames.

Two examples of this flow are provided in the logs. First a timeout happens in
the transaction creation, then multiple transactions are cancelled via the context.
Then there is a log for "aborted transaction sending". This log occurs because a
transaction that was cancelled landed on L1 and the nonce of the transaction is too
low. That then cancels then pending transactions again.

t=2023-07-13T16:22:28+0000 lvl=warn msg="unable to publish tx"                   err="context canceled"                    data_size=120,000
t=2023-07-13T16:22:28+0000 lvl=warn msg="unable to publish tx"                   err="context canceled"                    data_size=120,000
t=2023-07-13T16:22:28+0000 lvl=warn msg="unable to publish tx"                   err="context canceled"                    data_size=120,000
t=2023-07-13T16:22:28+0000 lvl=warn msg="unable to publish tx"                   err="context canceled"                    data_size=120,000
t=2023-07-13T16:22:28+0000 lvl=warn msg="unable to publish tx"                   err="context canceled"                    data_size=104,419
t=2023-07-13T16:22:28+0000 lvl=warn msg="unable to publish tx"                   err="context canceled"                    data_size=120,000
t=2023-07-13T16:22:28+0000 lvl=warn msg="unable to publish tx"                   err="aborted transaction sending"         data_size=120,000
t=2023-07-13T16:20:24+0000 lvl=warn msg="unable to publish tx"                   err="context canceled"                    data_size=120,000
t=2023-07-13T16:20:24+0000 lvl=warn msg="unable to publish tx"                   err="context canceled"                    data_size=120,000
t=2023-07-13T16:20:24+0000 lvl=warn msg="unable to publish tx"                   err="context canceled"                    data_size=120,000
t=2023-07-13T16:20:24+0000 lvl=warn msg="unable to publish tx"                   err="context canceled"                    data_size=860
t=2023-07-13T16:20:24+0000 lvl=warn msg="unable to publish tx"                   err="context canceled"                    data_size=120,000
t=2023-07-13T16:20:24+0000 lvl=warn msg="unable to publish tx"                   err="context canceled"                    data_size=120,000
t=2023-07-13T16:20:24+0000 lvl=warn msg="unable to publish tx"                   err="failed to create the tx: eth_signTransaction failed: Post \"--snip--\": context deadline exceeded" data_size=120,000

t=2023-07-13T16:07:37+0000 lvl=warn msg="unable to publish tx"                   err="context canceled"                    data_size=104,670
t=2023-07-13T16:07:37+0000 lvl=warn msg="unable to publish tx"                   err="context canceled"                    data_size=120,000
t=2023-07-13T16:07:37+0000 lvl=warn msg="unable to publish tx"                   err="context canceled"                    data_size=120,000
t=2023-07-13T16:07:37+0000 lvl=warn msg="unable to publish tx"                   err="aborted transaction sending"         data_size=120,000
t=2023-07-13T16:05:05+0000 lvl=warn msg="unable to publish tx"                  err="context canceled" data_size=120,000
t=2023-07-13T16:05:05+0000 lvl=warn msg="unable to publish tx"                  err="context canceled" data_size=110,635
t=2023-07-13T16:05:05+0000 lvl=warn msg="unable to publish tx"                  err="context canceled" data_size=120,000
t=2023-07-13T16:05:05+0000 lvl=warn msg="unable to publish tx"                  err="context canceled" data_size=120,000
t=2023-07-13T16:05:05+0000 lvl=warn msg="unable to publish tx"                  err="failed to create the tx: failed to get gas price info: failed to fetch the suggested gas tip cap: Post \"--snip--\": context deadline exceeded" data_size=120,000
parent b9238afd
......@@ -17,6 +17,7 @@ import (
"github.com/ethereum/go-ethereum/core/types"
"github.com/ethereum/go-ethereum/log"
"github.com/ethereum-optimism/optimism/op-service/backoff"
"github.com/ethereum-optimism/optimism/op-service/txmgr/metrics"
)
......@@ -175,7 +176,13 @@ func (m *SimpleTxManager) send(ctx context.Context, candidate TxCandidate) (*typ
ctx, cancel = context.WithTimeout(ctx, m.cfg.TxSendTimeout)
defer cancel()
}
tx, err := m.craftTx(ctx, candidate)
tx, err := backoff.Do(ctx, 30, backoff.Fixed(2*time.Second), func() (*types.Transaction, error) {
tx, err := m.craftTx(ctx, candidate)
if err != nil {
m.l.Warn("Failed to create a transaction, will retry", "err", err)
}
return tx, err
})
if err != nil {
return nil, fmt.Errorf("failed to create the tx: %w", err)
}
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment