• Joshua Gutow's avatar
    op-service: Harden Transaction Manager · 6b0b8a00
    Joshua Gutow authored
    This commit fixes a common source of errors in the transaction manager
    which then cause significant problems for the batcher. The transaction
    manager is designed to be resilient from errors when resubmitting txns or
    when polling for transaction receipts. It was not designed to be resilient
    on the initial transaction submission, and that was the root cause of
    several incidents.
    
    This commit fixes this issue by wrapping `craftTx` in a retry. If there is a
    sustained problem for longer than the amount of retries, issues could still
    happen, but this will significantly reduce the number of issues.
    
    A failure in craftTx is so harmful is because the transaction manager is
    wrapped in a txmgr.Queue which handles multiple in flight transactions.
    The queue uses an errgroup to manage concurrency and when any single
    txmgr.Send fails it will cancel the context & cancel the rest of the in flight
    sends. Because txmgr.Send could fail when creating a transaction, a transient
    failure would cancel multiple in flight transactions. Some of these in flight
    transactions would eventually land on L1 and the batcher would lose track of
    which frames it had submitted & thus could submit duplicate frames.
    
    Two examples of this flow are provided in the logs. First a timeout happens in
    the transaction creation, then multiple transactions are cancelled via the context.
    Then there is a log for "aborted transaction sending". This log occurs because a
    transaction that was cancelled landed on L1 and the nonce of the transaction is too
    low. That then cancels then pending transactions again.
    
    t=2023-07-13T16:22:28+0000 lvl=warn msg="unable to publish tx"                   err="context canceled"                    data_size=120,000
    t=2023-07-13T16:22:28+0000 lvl=warn msg="unable to publish tx"                   err="context canceled"                    data_size=120,000
    t=2023-07-13T16:22:28+0000 lvl=warn msg="unable to publish tx"                   err="context canceled"                    data_size=120,000
    t=2023-07-13T16:22:28+0000 lvl=warn msg="unable to publish tx"                   err="context canceled"                    data_size=120,000
    t=2023-07-13T16:22:28+0000 lvl=warn msg="unable to publish tx"                   err="context canceled"                    data_size=104,419
    t=2023-07-13T16:22:28+0000 lvl=warn msg="unable to publish tx"                   err="context canceled"                    data_size=120,000
    t=2023-07-13T16:22:28+0000 lvl=warn msg="unable to publish tx"                   err="aborted transaction sending"         data_size=120,000
    t=2023-07-13T16:20:24+0000 lvl=warn msg="unable to publish tx"                   err="context canceled"                    data_size=120,000
    t=2023-07-13T16:20:24+0000 lvl=warn msg="unable to publish tx"                   err="context canceled"                    data_size=120,000
    t=2023-07-13T16:20:24+0000 lvl=warn msg="unable to publish tx"                   err="context canceled"                    data_size=120,000
    t=2023-07-13T16:20:24+0000 lvl=warn msg="unable to publish tx"                   err="context canceled"                    data_size=860
    t=2023-07-13T16:20:24+0000 lvl=warn msg="unable to publish tx"                   err="context canceled"                    data_size=120,000
    t=2023-07-13T16:20:24+0000 lvl=warn msg="unable to publish tx"                   err="context canceled"                    data_size=120,000
    t=2023-07-13T16:20:24+0000 lvl=warn msg="unable to publish tx"                   err="failed to create the tx: eth_signTransaction failed: Post \"--snip--\": context deadline exceeded" data_size=120,000
    
    t=2023-07-13T16:07:37+0000 lvl=warn msg="unable to publish tx"                   err="context canceled"                    data_size=104,670
    t=2023-07-13T16:07:37+0000 lvl=warn msg="unable to publish tx"                   err="context canceled"                    data_size=120,000
    t=2023-07-13T16:07:37+0000 lvl=warn msg="unable to publish tx"                   err="context canceled"                    data_size=120,000
    t=2023-07-13T16:07:37+0000 lvl=warn msg="unable to publish tx"                   err="aborted transaction sending"         data_size=120,000
    t=2023-07-13T16:05:05+0000 lvl=warn msg="unable to publish tx"                  err="context canceled" data_size=120,000
    t=2023-07-13T16:05:05+0000 lvl=warn msg="unable to publish tx"                  err="context canceled" data_size=110,635
    t=2023-07-13T16:05:05+0000 lvl=warn msg="unable to publish tx"                  err="context canceled" data_size=120,000
    t=2023-07-13T16:05:05+0000 lvl=warn msg="unable to publish tx"                  err="context canceled" data_size=120,000
    t=2023-07-13T16:05:05+0000 lvl=warn msg="unable to publish tx"                  err="failed to create the tx: failed to get gas price info: failed to fetch the suggested gas tip cap: Post \"--snip--\": context deadline exceeded" data_size=120,000
    6b0b8a00
txmgr.go 22.1 KB