Cardano-db-sync: Random "DB lookup fail in insertABlock" error

Created on 28 Sep 2020  ·  6Comments  ·  Source: input-output-hk/cardano-db-sync

As per requested by @erikd on this other issue, I'm reopening this as a fresh issue since I got hit by this in the latest 5.0.1 version packaged in the official hub.docker.com image (inputoutput/cardano-db-sync:5.0.1 [0]).
I must say problem has not reappeared again since the crash from these logs [1][2].

I think it might be caused by some postgresql connectivity problem, but anyways, IMHO, in worst case main process should exit >0 (as it doesn't recover from the db thread shutdown automatically) or db thread should be restarted when these exceptions are raised.

[0] inputoutput/cardano-db-sync@sha256:b09f440d868749135e74c0bfe6154f210d5836bc2d24a44e484c7dbb4b837689
[1]

[db-sync-node:Info:3354] [2020-09-26 04:29:44.87 UTC] insertByronBlock: slot 3025000, block 3023467, hash 43b510c9fa0d1021d4efbfb0a07c1e90f40258c05d7d66b266aee4fb9963f678
[db-sync-node:Error:3354] [2020-09-26 04:29:59.56 UTC] DB lookup fail in insertABlock: block hash f8452c44591e3db7d5534ceaaff56922609d81deed45b953e7220601bfd4ec87
[db-sync-node:Info:3354] [2020-09-26 04:29:59.56 UTC] Shutting down DB thread
[db-sync-node:Error:3357] [2020-09-26 04:29:59.56 UTC] recvMsgRollForward: AsyncCancelled
...
FROZEN FOREVER
...

[2]

Generating PGPASS file
Connecting to network: mainnet
[db-sync-node:Info:4] [2020-09-26 09:01:38.91 UTC] NetworkMagic: 764824073
[db-sync-node:Info:4] [2020-09-26 09:02:11.59 UTC] Initial genesis distribution present and correct
[db-sync-node:Info:4] [2020-09-26 09:02:11.59 UTC] Total genesis supply of Ada: 31112484745.000000
[db-sync-node:Info:4] [2020-09-26 09:02:11.69 UTC] Inserting Shelley Genesis distribution
[db-sync-node:Info:4] [2020-09-26 09:02:11.83 UTC] epochPluginOnStartup: Checking
[db-sync-node:Info:4] [2020-09-26 09:02:11.86 UTC] localInitiatorNetworkApplication: connecting to node via "/node-ipc/node.socket"
[db-sync-node.Handshake:Info:30] [2020-09-26 09:02:11.86 UTC] [String "Send MsgProposeVersions (fromList [(NodeToClientV_1,TInt 764824073),(NodeToClientV_2,TInt 764824073),(NodeToClientV_3,TInt 764824073)])",String "LocalHandshakeTrace",String "ConnectionId {localAddress = LocalAddress {getFilePath = \"\"}, remoteAddress = LocalAddress {getFilePath = \"/ipc/node.socket\"}}"]
[db-sync-node.Handshake:Info:30] [2020-09-26 09:02:11.87 UTC] [String "Recv MsgAcceptVersion NodeToClientV_3 (TInt 764824073)",String "LocalHandshakeTrace",String "ConnectionId {localAddress = LocalAddress {getFilePath = \"\"}, remoteAddress = LocalAddress {getFilePath = \"/ipc/node.socket\"}}"]
[db-sync-node:Info:34] [2020-09-26 09:02:11.87 UTC] Starting chainSyncClient
[db-sync-node:Info:34] [2020-09-26 09:02:13.97 UTC] Cardano.Db tip is at slot 3025325, block 3023792
[db-sync-node:Info:39] [2020-09-26 09:02:13.97 UTC] Running DB thread
[db-sync-node:Info:39] [2020-09-26 09:02:14.45 UTC] Rolling back to slot 3025325, hash f8452c44591e3db7d5534ceaaff56922609d81deed45b953e7220601bfd4ec87
[db-sync-node:Info:39] [2020-09-26 09:02:14.46 UTC] Deleting blocks numbered: []
[db-sync-node:Info:42] [2020-09-26 09:02:14.58 UTC] getHistoryInterpreter: acquired
[db-sync-node:Error:39] [2020-09-26 09:02:42.88 UTC] DB lookup fail in insertABlock: block hash ff6d511e65fb979ac511e8658c20cfe608bdfe7d3e6172f49115e52487812423
[db-sync-node:Info:39] [2020-09-26 09:02:42.88 UTC] Shutting down DB thread
[db-sync-node:Error:42] [2020-09-26 09:02:42.89 UTC] recvMsgRollForward: AsyncCancelled
...
FROZEN FOREVER
...

All 6 comments

I have seen

DB lookup fail in insertABlock: block hash

before, but one was due to postgres running out of disk space and the others were after I did potentially unsafe manipulations of the database during testing.

As for the FROZEN FOREVER part, this is not something I have direct control over. My code calls into the networking code that does the ChainSync protocol which then calls back into another bit of my code. In this instance, my inner code threw an exception that was caught by the network code and not propagated back up to my code. There may be a fix for this, but it is unlikely to be simple.

My suggestion is :

  • Ensure that the machine has enough disk space (fully synced to epoch 220, my db-sync instance is using 6G of disk).
  • Drop and recreate the database (to make sure existing corruption is removed).
  • Manually monitor the sync (should be up to about 3 hours).

Sorry, I forgot to mention that it did not happen anymore since that time from the attached logs; that's why I assumed it could have been some exceptional problem with my postgres instance (I was playing with HA at that time IIRC).

We can keep this open for a while to see if we come up with any more helpful clues.

JFTR, I'm resyncing mainnet from the scratch and found that it actually recovers after a big while (at least on latest version) and that it's probably caused by some cardano-node socket unavailability [0].
This latter issues is due I share the socket via TCP so it's accesible to different services running from different k8s pods (no shared ipc), and I get some instabilities (not sure yet where they come exactly, I need to do more TCP tuning).

[0]

[db-sync-node:Info:217] [2020-10-20 08:09:02.03 UTC] epochPluginInsertBlock: epoch 182
[db-sync-node:Info:217] [2020-10-20 08:11:01.01 UTC] insertByronBlock: slot 3955000, block 3952915, hash d5941f934cf5eb5c3105b584c07aa6c872272dbd0d4fbe78efb36682ed59f211
[db-sync-node:Info:217] [2020-10-20 08:15:22.01 UTC] insertByronBlock: slot 3960000, block 3957915, hash 92494b54140f9db989618f93f11be579b1112d66c173d09cdfb5ac8f9ae6d9ce
[db-sync-node:Error:217] [2020-10-20 08:15:54.86 UTC] DB lookup fail in insertABlock: block hash 55f53f6ddf3546aa29079945eb995244c4a5a205bb287ca76c89a85e2f60ef6d
[db-sync-node:Info:217] [2020-10-20 08:15:54.86 UTC] Shutting down DB thread
[db-sync-node:Error:220] [2020-10-20 08:15:54.86 UTC] recvMsgRollForward: AsyncCancelled
[db-sync-node:Error:211] [2020-10-20 08:34:04.83 UTC] ChainSyncWithBlocksPtcl: AsyncCancelled
[db-sync-node.Subscription:Error:207] [2020-10-20 08:34:04.83 UTC] [String "Application Exception: LocalAddress {getFilePath = \"/node-ipc/node.socket\"} MuxError MuxBearerClosed \"<socket: 12> closed when reading data, waiting on next header True\"",String "SubscriptionTrace"]
[db-sync-node.ErrorPolicy:Warning:4] [2020-10-20 08:34:04.83 UTC] [String "ErrorPolicySuspendPeer (Just (ApplicationExceptionTrace (MuxError MuxBearerClosed \"<socket: 12> closed when reading data, waiting on next header True\"))) 20s 20s",String "ErrorPolicyTrace",String "LocalAddress {getFilePath = \"/node-ipc/node.socket\"}"]
[db-sync-node.Handshake:Info:1879] [2020-10-20 08:34:05.83 UTC] [String "Send MsgProposeVersions (fromList [(NodeToClientV_1,TInt 764824073),(NodeToClientV_2,TInt 764824073),(NodeToClientV_3,TInt 764824073)])",String "LocalHandshakeTrace",String "ConnectionId {localAddress = LocalAddress {getFilePath = \"\"}, remoteAddress = LocalAddress {getFilePath = \"/ipc/node.socket\"}}"]
[db-sync-node.Handshake:Info:1879] [2020-10-20 08:34:05.85 UTC] [String "Recv MsgAcceptVersion NodeToClientV_3 (TInt 764824073)",String "LocalHandshakeTrace",String "ConnectionId {localAddress = LocalAddress {getFilePath = \"\"}, remoteAddress = LocalAddress {getFilePath = \"/ipc/node.socket\"}}"]
[db-sync-node:Info:1883] [2020-10-20 08:34:05.85 UTC] Starting chainSyncClient
[db-sync-node:Info:1883] [2020-10-20 08:34:07.63 UTC] Cardano.Db tip is at slot 3960564, block 3958479
[db-sync-node:Info:1889] [2020-10-20 08:34:07.63 UTC] Running DB thread
[db-sync-node:Info:1889] [2020-10-20 08:34:08.16 UTC] Rolling back to slot 3960564, hash 55f53f6ddf3546aa29079945eb995244c4a5a205bb287ca76c89a85e2f60ef6d
[db-sync-node:Info:1889] [2020-10-20 08:34:08.17 UTC] Deleting blocks numbered: []
[db-sync-node:Info:1889] [2020-10-20 08:37:42.51 UTC] insertByronBlock: slot 3965000, block 3962915, hash c1fa47cd48391aa3774e61c8830b5d1e8bbf62c4720b856a952fdd6d95dcd5e6
[db-sync-node:Info:1889] [2020-10-20 08:41:59.70 UTC] insertByronBlock: slot 3970000, block 3967915, hash 71dcc988afa8fbf6e2f5c88b165f7ee7dbbe8a1a07141df8a5f2c42986494528
[db-sync-node:Info:1889] [2020-10-20 08:46:19.05 UTC] epochPluginInsertBlock: epoch 183
[db-sync-node:Info:1889] [2020-10-20 08:46:43.59 UTC] insertByronBlock: slot 3975000, block 3972915, hash 9be00c8a13eeaa21b24b070fe9254c6b30c3e1953e8625604c86e8c8d5695213
[db-sync-node:Info:1889] [2020-10-20 08:51:42.20 UTC] insertByronBlock: slot 3980000, block 3977915, hash ec3903b66e2a7638315e9d4b9a19ed904b71eae3e5e4cdd412a9b8ed5875b970
[db-sync-node:Info:1889] [2020-10-20 08:57:17.01 UTC] insertByronBlock: slot 3985000, block 3982915, hash 527077e870a8fa8648fb035107c906cc72e3ea2438a674020dc472ec9e2b8d50
[db-sync-node:Info:1889] [2020-10-20 09:01:33.12 UTC] insertByronBlock: slot 3990000, block 3987915, hash 2d6ed2788d5a694d05866d5c08002e46b9e119cab71c2e56d5b76432f37b8eda
[db-sync-node:Info:1889] [2020-10-20 09:05:17.18 UTC] insertByronBlock: slot 3995000, block 3992915, hash e043d137b267ed16b73b41cb92889b37cad381c53c80390598b547dc9fc58892
[db-sync-node:Info:1889] [2020-10-20 09:06:00.19 UTC] epochPluginInsertBlock: epoch 184
[db-sync-node:Info:1889] [2020-10-20 09:09:43.66 UTC] insertByronBlock: slot 4000000, block 3997915, hash a1f93a150ed43de21591124248b27ba0c830bed8c8e81c9a17d7823b5a6cfc97
[db-sync-node:Info:1889] [2020-10-20 09:14:39.01 UTC] insertByronBlock: slot 4005000, block 4002914, hash 3ad70e79c62b3f0cafac5a6bcce66702cac04ea16b7487ceb4627cd61eb9ca42

it's probably caused by some cardano-node socket unavailability

That by itself would not cause a DB lookup fail.

However, if db-sync was doing a DB lookup when the socket disappeared, the resulting exception could cause the DB lookup to be terminated early, which in turn might be reported as a failed lookup.

You will notice that when the socket returns, it rolls back to block hash 55f53f6d...2f60ef6d (the hash that previously failed lookup) and this time the lookup must have succeeded or the chain sync would have died right there.

Is it worth keeping this open? The problem is that in almost all cases, I cannot fix problems I cannot reproduce.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

erikd picture erikd  ·  7Comments

reliablestaking picture reliablestaking  ·  31Comments

rhyslbw picture rhyslbw  ·  4Comments

dmitrystas picture dmitrystas  ·  28Comments

refi93 picture refi93  ·  15Comments