op-service: Harden Transaction Manager
This commit fixes a common source of errors in the transaction manager which then cause significant problems for the batcher. The transaction manager is designed to be resilient from errors when resubmitting txns or when polling for transaction receipts. It was not designed to be resilient on the initial transaction submission, and that was the root cause of several incidents. This commit fixes this issue by wrapping `craftTx` in a retry. If there is a sustained problem for longer than the amount of retries, issues could still happen, but this will significantly reduce the number of issues. A failure in craftTx is so harmful is because the transaction manager is wrapped in a txmgr.Queue which handles multiple in flight transactions. The queue uses an errgroup to manage concurrency and when any single txmgr.Send fails it will cancel the context & cancel the rest of the in flight sends. Because txmgr.Send could fail when creating a transaction, a transient failure would cancel multiple in flight transactions. Some of these in flight transactions would eventually land on L1 and the batcher would lose track of which frames it had submitted & thus could submit duplicate frames. Two examples of this flow are provided in the logs. First a timeout happens in the transaction creation, then multiple transactions are cancelled via the context. Then there is a log for "aborted transaction sending". This log occurs because a transaction that was cancelled landed on L1 and the nonce of the transaction is too low. That then cancels then pending transactions again. t=2023-07-13T16:22:28+0000 lvl=warn msg="unable to publish tx" err="context canceled" data_size=120,000 t=2023-07-13T16:22:28+0000 lvl=warn msg="unable to publish tx" err="context canceled" data_size=120,000 t=2023-07-13T16:22:28+0000 lvl=warn msg="unable to publish tx" err="context canceled" data_size=120,000 t=2023-07-13T16:22:28+0000 lvl=warn msg="unable to publish tx" err="context canceled" data_size=120,000 t=2023-07-13T16:22:28+0000 lvl=warn msg="unable to publish tx" err="context canceled" data_size=104,419 t=2023-07-13T16:22:28+0000 lvl=warn msg="unable to publish tx" err="context canceled" data_size=120,000 t=2023-07-13T16:22:28+0000 lvl=warn msg="unable to publish tx" err="aborted transaction sending" data_size=120,000 t=2023-07-13T16:20:24+0000 lvl=warn msg="unable to publish tx" err="context canceled" data_size=120,000 t=2023-07-13T16:20:24+0000 lvl=warn msg="unable to publish tx" err="context canceled" data_size=120,000 t=2023-07-13T16:20:24+0000 lvl=warn msg="unable to publish tx" err="context canceled" data_size=120,000 t=2023-07-13T16:20:24+0000 lvl=warn msg="unable to publish tx" err="context canceled" data_size=860 t=2023-07-13T16:20:24+0000 lvl=warn msg="unable to publish tx" err="context canceled" data_size=120,000 t=2023-07-13T16:20:24+0000 lvl=warn msg="unable to publish tx" err="context canceled" data_size=120,000 t=2023-07-13T16:20:24+0000 lvl=warn msg="unable to publish tx" err="failed to create the tx: eth_signTransaction failed: Post \"--snip--\": context deadline exceeded" data_size=120,000 t=2023-07-13T16:07:37+0000 lvl=warn msg="unable to publish tx" err="context canceled" data_size=104,670 t=2023-07-13T16:07:37+0000 lvl=warn msg="unable to publish tx" err="context canceled" data_size=120,000 t=2023-07-13T16:07:37+0000 lvl=warn msg="unable to publish tx" err="context canceled" data_size=120,000 t=2023-07-13T16:07:37+0000 lvl=warn msg="unable to publish tx" err="aborted transaction sending" data_size=120,000 t=2023-07-13T16:05:05+0000 lvl=warn msg="unable to publish tx" err="context canceled" data_size=120,000 t=2023-07-13T16:05:05+0000 lvl=warn msg="unable to publish tx" err="context canceled" data_size=110,635 t=2023-07-13T16:05:05+0000 lvl=warn msg="unable to publish tx" err="context canceled" data_size=120,000 t=2023-07-13T16:05:05+0000 lvl=warn msg="unable to publish tx" err="context canceled" data_size=120,000 t=2023-07-13T16:05:05+0000 lvl=warn msg="unable to publish tx" err="failed to create the tx: failed to get gas price info: failed to fetch the suggested gas tip cap: Post \"--snip--\": context deadline exceeded" data_size=120,000
Showing
Please register or sign in to comment