```
[db-sync-node:Info:64] [2020-11-07 15:28:05.27 UTC] Starting chainSyncClient
[db-sync-node:Info:64] [2020-11-07 15:28:05.31 UTC] Cardano.Db tip is at slot 13132763, block 4917327
[db-sync-node:Info:69] [2020-11-07 15:28:05.31 UTC] Running DB thread
[db-sync-node:Info:69] [2020-11-07 15:28:06.16 UTC] Rolling back to slot 13132763, hash c19c89792973fe2fff25e5b715e785d549da9647c2f9b7940aefcd29759dcd70
[db-sync-node:Info:69] [2020-11-07 15:28:06.17 UTC] Deleting slots numbered: []
[db-sync-node:Error:69] [2020-11-07 15:28:08.93 UTC] runDBThread: Panic! applyHeaderTransition failed: [[NewEpochFailure (EpochFailure (NewPpFailure (UnexpectedDepositPot (Coin 804642000000) (Coin 804640000000))))]]
CallStack (from HasCallStack):
error, called at src/Shelley/Spec/Ledger/API/Validation.hs:92:15 in shelley-spec-ledger-0.1.0.0-inplace:Shelley.Spec.Ledger.API.Validation
[db-sync-node:Error:64] [2020-11-07 15:28:08.93 UTC] ChainSyncWithBlocksPtcl: Panic! applyHeaderTransition failed: [[NewEpochFailure (EpochFailure (NewPpFailure (UnexpectedDepositPot (Coin 804642000000) (Coin 804640000000))))]]
CallStack (from HasCallStack):
error, called at src/Shelley/Spec/Ledger/API/Validation.hs:92:15 in shelley-spec-ledger-0.1.0.0-inplace:Shelley.Spec.Ledger.API.Validation
[db-sync-node.Subscription:Error:60] [2020-11-07 15:28:08.93 UTC] [String "Application Exception: LocalAddress {getFilePath = \"/opt/cardano/cnode/sockets/node0.socket\"} Panic! applyHeaderTransition failed: [[NewEpochFailure (EpochFailure (NewPpFailure (UnexpectedDepositPot (Coin 804642000000) (Coin 804640000000))))]]\nCallStack (from HasCallStack):\n error, called at src/Shelley/Spec/Ledger/API/Validation.hs:92:15 in shelley-spec-ledger-0.1.0.0-inplace:Shelley.Spec.Ledger.API.Validation",String "SubscriptionTrace"]
[db-sync-node.ErrorPolicy:Error:4] [2020-11-07 15:28:08.93 UTC] [String "ErrorPolicyUnhandledApplicationException Panic! applyHeaderTransition failed: [[NewEpochFailure (EpochFailure (NewPpFailure (UnexpectedDepositPot (Coin 804642000000) (Coin 804640000000))))]]\nCallStack (from HasCallStack):\n error, called at src/Shelley/Spec/Ledger/API/Validation.hs:92:15 in shelley-spec-ledger-0.1.0.0-inplace:Shelley.Spec.Ledger.API.Validation",String "ErrorPolicyTrace",String "LocalAddress {getFilePath = \"/opt/cardano/cnode/sockets/node0.socket\"}"]
```
Is this mainnet? Are you upgrading from one version of the software to another?
~The NewEpochFailure
part of the error message suggests that the db-sync
version is incompatible with the node
version.~
Version 6.0.x
of db-sync
is compatible with 1.21.x
of the node.
yes, mainnet, I built a new DB and it is working now with the same version, so not sure what happened
Closing this.
I've just hit this on the cardano-graphql
CI server, without any change of interest such as version updates. Connecting to mainnet with this config.
[db-sync-node:Error:62741] [2020-11-12 06:02:09.50 UTC] runDBThread: Panic! applyHeaderTransition failed: [[NewEpochFailure (EpochFailure (NewPpFailure (UnexpectedDepositPot (Coin 831366000000) (Coin 831368000000))))]]
CallStack (from HasCallStack):
error, called at src/Shelley/Spec/Ledger/API/Validation.hs:92:15 in shelley-spec-ledger-0.1.0.0-3QeazRqhkmeDSfJ73hDh1U:Shelley.Spec.Ledger.API.Validation
[db-sync-node:Error:62736] [2020-11-12 06:02:09.50 UTC] ChainSyncWithBlocksPtcl: Panic! applyHeaderTransition failed: [[NewEpochFailure (EpochFailure (NewPpFailure (UnexpectedDepositPot (Coin 831366000000) (Coin 831368000000))))]]
CallStack (from HasCallStack):
error, called at src/Shelley/Spec/Ledger/API/Validation.hs:92:15 in shelley-spec-ledger-0.1.0.0-3QeazRqhkmeDSfJ73hDh1U:Shelley.Spec.Ledger.API.Validation
[db-sync-node.Subscription:Error:62732] [2020-11-12 06:02:09.50 UTC] [String "Application Exception: LocalAddress {getFilePath = \"/node-ipc/node.socket\"} Panic! applyHeaderTransition failed: [[NewEpochFailure (EpochFailure (NewPpFailure (UnexpectedDepositPot (Coin 831366000000) (Coin 831368000000))))]]\nCallStack (from HasCallStack):\n error, called at src/Shelley/Spec/Ledger/API/Validation.hs:92:15 in shelley-spec-ledger-0.1.0.0-3QeazRqhkmeDSfJ73hDh1U:Shelley.Spec.Ledger.API.Validation",String "SubscriptionTrace"]
[db-sync-node.ErrorPolicy:Error:4] [2020-11-12 06:02:09.50 UTC] [String "ErrorPolicyUnhandledApplicationException Panic! applyHeaderTransition failed: [[NewEpochFailure (EpochFailure (NewPpFailure (UnexpectedDepositPot (Coin 831366000000) (Coin 831368000000))))]]\nCallStack (from HasCallStack):\n error, called at src/Shelley/Spec/Ledger/API/Validation.hs:92:15 in shelley-spec-ledger-0.1.0.0-3QeazRqhkmeDSfJ73hDh1U:Shelley.Spec.Ledger.API.Validation",String "ErrorPolicyTrace",String "LocalAddress {getFilePath = \"/node-ipc/node.socket\"}"]
[db-sync-node.Handshake:Info:62759] [2020-11-12 06:02:17.54 UTC] [String "Send (ClientAgency TokPropose,MsgProposeVersions (fromList [(NodeToClientV_1,TInt 764824073),(NodeToClientV_2,TInt 764824073),(NodeToClientV_3,TInt 764824073)]))",String "LocalHandshakeTrace",String "ConnectionId {localAddress = LocalAddress {getFilePath = \"\"}, remoteAddress = LocalAddress {getFilePath = \"/ipc/node.socket\"}}"]
[db-sync-node.Handshake:Info:62759] [2020-11-12 06:02:17.54 UTC] [String "Recv (ServerAgency TokConfirm,MsgAcceptVersion NodeToClientV_3 (TInt 764824073))",String "LocalHandshakeTrace",String "ConnectionId {localAddress = LocalAddress {getFilePath = \"\"}, remoteAddress = LocalAddress {getFilePath = \"/ipc/node.socket\"}}"]
[db-sync-node:Info:62763] [2020-11-12 06:02:17.54 UTC] Starting chainSyncClient
[db-sync-node:Info:62763] [2020-11-12 06:02:17.55 UTC] Cardano.Db tip is at slot 13564796, block 4938498
[db-sync-node:Info:62768] [2020-11-12 06:02:17.55 UTC] Running DB thread
[db-sync-node:Info:62768] [2020-11-12 06:02:18.01 UTC] Rolling back to slot 13564796, hash 5d8ea0d4cf2d4f46cc91aa48e83c029691f836d7200e11e26402f9a2bcb25987
[db-sync-node:Info:62768] [2020-11-12 06:02:18.01 UTC] Deleting slots numbered: []
[db-sync-node:Error:62768] [2020-11-12 06:02:19.47 UTC] runDBThread: Panic! applyHeaderTransition failed: [[NewEpochFailure (EpochFailure (NewPpFailure (UnexpectedDepositPot (Coin 831366000000) (Coin 831368000000))))]]
CallStack (from HasCallStack):
error, called at src/Shelley/Spec/Ledger/API/Validation.hs:92:15 in shelley-spec-ledger-0.1.0.0-3QeazRqhkmeDSfJ73hDh1U:Shelley.Spec.Ledger.API.Validation
[db-sync-node:Error:62763] [2020-11-12 06:02:19.47 UTC] ChainSyncWithBlocksPtcl: Panic! applyHeaderTransition failed: [[NewEpochFailure (EpochFailure (NewPpFailure (UnexpectedDepositPot (Coin 831366000000) (Coin 831368000000))))]]
CallStack (from HasCallStack):
error, called at src/Shelley/Spec/Ledger/API/Validation.hs:92:15 in shelley-spec-ledger-0.1.0.0-3QeazRqhkmeDSfJ73hDh1U:Shelley.Spec.Ledger.API.Validation
[db-sync-node.Subscription:Error:62759] [2020-11-12 06:02:19.47 UTC] [String "Application Exception: LocalAddress {getFilePath = \"/node-ipc/node.socket\"} Panic! applyHeaderTransition failed: [[NewEpochFailure (EpochFailure (NewPpFailure (UnexpectedDepositPot (Coin 831366000000) (Coin 831368000000))))]]\nCallStack (from HasCallStack):\n error, called at src/Shelley/Spec/Ledger/API/Validation.hs:92:15 in shelley-spec-ledger-0.1.0.0-3QeazRqhkmeDSfJ73hDh1U:Shelley.Spec.Ledger.API.Validation",String "SubscriptionTrace"]
[db-sync-node.ErrorPolicy:Error:4] [2020-11-12 06:02:19.47 UTC] [String "ErrorPolicyUnhandledApplicationException Panic! applyHeaderTransition failed: [[NewEpochFailure (EpochFailure (NewPpFailure (UnexpectedDepositPot (Coin 831366000000) (Coin 831368000000))))]]\nCallStack (from HasCallStack):\n error, called at src/Shelley/Spec/Ledger/API/Validation.hs:92:15 in shelley-spec-ledger-0.1.0.0-3QeazRqhkmeDSfJ73hDh1U:Shelley.Spec.Ledger.API.Validation",String "ErrorPolicyTrace",String "LocalAddress {getFilePath = \"/node-ipc/node.socket\"}"]
~After chatting with Rhys on Slack, i suspect that in his case db-sync
ran into problems over 1000 blocks before this error and what he is seeing is a result of it restarting and rotating the logs.~
The problem happens at epoch rollover.
The ones @rhyslbw and @mmahut caught both barfed on the same last applied block, hash 5d8ea0d4cf2d4f46cc91aa48e83c029691f836d7200e11e26402f9a2bcb25987
. That is very unlikely to be a coincidence.
HIghly likely that block is the first block of a new epoch.
@rhyslbw what is the git hash of the db-sync
version you are using?
The one @mmahut was using in #404 was commit https://github.com/input-output-hk/cardano-db-sync/commit/6187081a7ea66954c86094578bd37e01bca8aaec which is missing commit https://github.com/input-output-hk/cardano-db-sync/commit/afe68e08cf5f8b3b1b6690e411670908bc0f5942 which contains changes to the ledger-state config. This issue is about it dying in ledger state related code, but that change should not make a difference on mainnnet. The 6.0.0
tag is after the second commit.
This is a HUGE pain in the neck to debug without a fix for https://github.com/input-output-hk/cardano-db-sync/issues/256 .
@erikd I'm using the release tag commit 3e68f3011bb156b9b799ccf056f9a73281479f9c
Did a LOT of work trying to recreate this issue, but it is not deterministic. I am currently running a version of this code that should better catch any errors (and abort immediately). I am hoping there is a chance of triggering this again on the next epoch boundary which happens in about 14 hours from now.
10 nodes, all of them crashed with this specific bug. The lstate's are:
65bce9e6463a324d612d24588afbdecc 13996555.lstate
77b5e894f8a22cb49605b9bfd474588a 13996568.lstate
12c4be3b0fac587d1b6485284e218404 13996582.lstate
f0b29f6768c836e7283f7033799ce146 13996626.lstate
ba72f63cf8185150c8120f3466756479 13996646.lstate
a2b45038665701084196a238b3beb329 13996669.lstate
7e8cccd8f0f1c3ac519ef7471a998ac1 13996713.lstate
ab304c279c8209e4b21a623b1a6dd80f 13996756.lstate
and using git rev 6187081a7ea66954c86094578bd37e01bca8aaec
(which is a couple of commits behind the 6.0.0
tag).
Current hypothesis is that ledger state gets corrupted at some point and that the corruption is only noticed at the epoch boundary.
Infinite loop NewEpochFailue
99d3e16a319a20ff689ca9582425ddae 13996555.lstate
8205deb9c2b3ad946a99bc7692d4434e 13996568.lstate
8eeb20d372cf5214db7c8287a052707b 13996582.lstate
7133fb72aa8194efa80e95c3fa4af1fb 13996626.lstate
f7199d4a131c6fd4649a76a51167275f 13996646.lstate
faa8d71771e8cc68703fa4f1f08dfce7 13996669.lstate
f6cbf62dad57439dc126f8b56061a863 13996713.lstate
504ea06cb925868c25c100d7d05d6afd 13996756.lstate
cardano-db-sync-extended 6.0.0 - linux-x86_64 - ghc-8.6
git revision 3e68f3011bb156b9b799ccf056f9a73281479f9c
2 of 3 instances threw the NewEpochFailue
error.
git revision 3e68f3011bb156b9b799ccf056f9a73281479f9c
0eb144b880dcb07c8347b560ea77db27 13996555.lstate
6ee65fc1f5d47fbb858e92770e109f0f 13996568.lstate
c526b055c731173bb7a94cbf3144855d 13996582.lstate
932f8a4807537c43332a4b9a91c0c4a7 13996626.lstate
95163e7b5351b04ae5909d221a4ee2e2 13996646.lstate
c584485911b8f246d01e37572b0f4175 13996669.lstate
449a7c5b2669288dfec995867507211a 13996713.lstate
ba8c8f7f1657727c826ca07be4f7d2e2 13996756.lstate
The snapshots of the instance not affected have been rolled out.
The fact that the same ledger state file, eg 13996756.lstate
has three different hashes is a little unexpected.
If the cause is a corrupted ledger state, probably this and #405 are the same issue
@SebastienGllmt Yes, that is possible, but I have not even had a chance to look at #405 yet.
2 out of 4 instances thrown down
MD5 checksums of a failing instance:
189def79f03972649fdcdcd811def1bb 13996555.lstate
dc6865de6149fdf4879a4659bcf02ef0 13996568.lstate
85bf1965fdadee3ee42c10dfd32e0bdb 13996582.lstate
c452477105dc4041c17d718a32f12056 13996626.lstate
9eb0f1fd0a165a8eff3ec4835f370d6d 13996646.lstate
4d5180ba656234020a71f2d46f1d9d0e 13996669.lstate
521ae28570c1630a18bc721cc4707eaa 13996713.lstate
9a2485e192578c1d3c22059648fba79f 13996756.lstate
(the other failing instance has different hashes)
Healthy instance:
98d46070c972d7b4ec564e4053e29eda 14033709.lstate
5c2443fe558a928a86606136337f3648 14033754.lstate
6c180d350ba7becf0f02d698ac160397 14033800.lstate
d655e560b8a43e064b671266795d262c 14033812.lstate
99bfbf88ec497e7e40865c920f8e8e26 14033839.lstate
bec82d94fe348d843389853cd24a3e5b 14033845.lstate
14faba941a24f02b236d784af85f8d32 14033890.lstate
c96d18f7bdadebed8e3b723b4e6691fc 14033936.lstate
I tried to resync 10 different instances on commit bcd82d0a3eada57fdf7cc71670a46c9b3b80464f
(as I need the metadata feature).
2 out of 10 had a corrupted version of the ledger state files (different from the rest). The correct sums are:
/var/lib/csyncdb/14040345.lstate | b1df6bdb2cf6f798d9baf83373a4698f
/var/lib/csyncdb/14040390.lstate | 5694a00b12b47052125178175289ba24
/var/lib/csyncdb/14040394.lstate | b45a5e4b82362def92225a3eec5d1afb
/var/lib/csyncdb/14040412.lstate | 8319589d1b66dffd97f5233c3dcfddd0
/var/lib/csyncdb/14040436.lstate | 9d0fec3f4693bd78f426c34c5aaa5d5d
/var/lib/csyncdb/14040462.lstate | 8afe16d592a57ed5ef79a27adf0803d9
/var/lib/csyncdb/14040481.lstate | bcc2648eb09b0d5503f6397569b33e67
/var/lib/csyncdb/14040527.lstate | 588d787363bfe02cf0fd34ac8f412dd4
/var/lib/csyncdb/14040553.lstate | e1f2f42a1ac49d2ec3a53351d1b267b5
I have also noticed an inconsistency in the files.14040345.lstate
was missing on half of the instances, but these instances had 14040553.lstate
instead.
FYI, same issue again on the epoch 230 transition...
Ok, I know what is causing the problem. Fixing this is relatively simple. The fix will not require a db resync unless the ledger state is already corrupt (which will be detected by the fixed version of the software).
The problem is:
db-sync
is getting blocks that have already been validated by the node, it seemed sensible to use the fast version.@erikd Is it possible to make this a config toggle between full checking and fast checking? I'd prefer to running everything in "safe" mode and use extra resources to make sure it stays up.
@CyberCyclone Once the hash is checked, there is nothing else that can go wrong with probability greater than the chance of a 256 bit hash collision. The hash should have been checked. I thought it was being checked. Once it is checked, there is no reason to do more checking.
Awesome, great to hear! The way it was worded sounded like there was a lot more going on. But yeah, hash collisions aren't anything to worry about.
Since I am using the fast version and my code rolls back to a specified slot, it it possible for it to roll back to the correct slot number, but the wrong block (ie slot number correct, but wrong hash and therefore wrong block).
It should _not be possible for it to roll back to the correct slot number, but the wrong block_.
The chain sync instructs to roll back to a specified point on the chain (point being a slot+hash), but this point is guaranteed to exist on the consumer's chain. Yes it's very sensible to check, but if this check were to fail then that indicates a logic bug somewhere.
So I think this will need more investigation before we can call it fixed. Adding an assertion should detect the problem much more promptly at the point where it occurs, rather than much later at the epoch boundary. Adding an assertion is not itself a fix of course.
It should not be possible for it to roll back to the correct slot number, but the wrong block.
It is possible if the rollback only checks the slot number but not the hash.
The logging has now produced this:
[db-sync-node:Info:39] [2020-11-19 08:47:21.84 UTC] Rolling back to slot 8092720,
hash e1e78605937bb8cfc842d1ee7280b92fa9fce813c26fa66a88eaca74d7af9f05
[db-sync-node:Info:39] [2020-11-19 08:47:21.84 UTC] Deleting slots numbered: [8092760]
Ledger state hash mismatch. Ledger expects 6f1940937d806865a6e96b25a640deb8c1393852fd3d311dbd648e2bfa89056e
but block provides e1e78605937bb8cfc842d1ee7280b92fa9fce813c26fa66a88eaca74d7af9f05.
which is a little odd. Restarting it results in:
```
[db-sync-node:Info:34] [2020-11-19 09:06:15.54 UTC] Database tip is at slot 8092720, block 389107
[db-sync-node:Info:39] [2020-11-19 09:06:15.54 UTC] Running DB thread
[db-sync-node:Info:42] [2020-11-19 09:06:15.55 UTC] getHistoryInterpreter: acquired
[db-sync-node:Info:39] [2020-11-19 09:06:15.55 UTC] Rolling back to slot 8092720,
hash e1e78605937bb8cfc842d1ee7280b92fa9fce813c26fa66a88eaca74d7af9f05
[db-sync-node:Info:39] [2020-11-19 09:06:15.56 UTC] Deleting slots numbered: []
````
Need to check the code for this.
I have a temprory work around fix for this. The work around comes from my work-in-progress debugging branch, but has not been full tested, QAed or released.
If anyone is running the 6.0.0
release and is worried about the epoch rollover taking place in ~12 hours, there is an erikd/tmp-fix-6.0.x
branch (commit 3a6e7199c1f2) with the workaround. The workaround detects something going astray, panics, the execption is retried at a higher level and then db-sync continues.
There are no database changes relative to 6.0.0
so no resync is required.
However, running this version may detect an already corrupted ledger state (I am not even sure what that would look like) in which case a resync will be required.
After adding a bunch of debug code and then waiting for the problem to be triggered.
Turns out this issue is a race condition. From the logs:
[2020-11-21 08:27:09.90 UTC] insertShelleyBlock: epoch 230, slot 14380936, block 4978632,
hash f984eead753a149efad752dd58471d0c53c3fcf973d281acf4fdcbc6fda799c7
[2020-11-21 08:27:34.22 UTC] insertShelleyBlock: epoch 230, slot 14380962, block 4978633,
hash bfe35e62b322d397fa6c5080ccd8294c0d2eaca5695e604df59f27f82292227a
[2020-11-21 08:27:36.69 UTC] loadLedgerState: slot 14380962
hash bfe35e62b322d397fa6c5080ccd8294c0d2eaca5695e604df59f27f82292227a
[2020-11-21 08:27:37.15 UTC] insertShelleyBlock: epoch 230, slot 14380964, block 4978634,
hash 1ef4771244b95d35c59371521d19fc145646f89f28bf7a18c4f6c8d7485da2b3
[2020-11-21 08:27:40.01 UTC] Rolling back to slot 14380962,
hash bfe35e62b322d397fa6c5080ccd8294c0d2eaca5695e604df59f27f82292227a
[2020-11-21 08:27:40.02 UTC] Deleting slots numbered: [14380964]
[2020-11-21 08:27:40.35 UTC] ChainSyncWithBlocksPtcl: FatalError {fatalErrorMessage = "Ledger state hash
mismatch. Ledger head is slot 14380964 hash
1ef4771244b95d35c59371521d19fc145646f89f28bf7a18c4f6c8d7485da2b3 but block previous hash is
bfe35e62b322d397fa6c5080ccd8294c0d2eaca5695e604df59f27f82292227a and block current
hash is 136956bd1c6ce536e3c3bb0cef07b3e380441522317c88274f1455a7b11ca2d5."}
[2020-11-21 08:27:41.35 UTC] Starting chainSyncClient
[2020-11-21 08:27:41.36 UTC] Database tip is at slot 14380962, block 4978633
[2020-11-21 08:27:41.36 UTC] Running DB thread
[2020-11-21 08:27:41.54 UTC] Rolling back to slot 14380962,
hash bfe35e62b322d397fa6c5080ccd8294c0d2eaca5695e604df59f27f82292227a
[2020-11-21 08:27:41.54 UTC] Deleting slots numbered: []
[2020-11-21 08:27:42.54 UTC] loadLedgerState: slot 14380962
hash bfe35e62b322d397fa6c5080ccd8294c0d2eaca5695e604df59f27f82292227a
Basically what happens is:
The fix is to move the code to rollback ledger state from the write end of the queue to the read end.
Fixed on master in https://github.com/input-output-hk/cardano-db-sync/pull/413 .
There will also be 6.0.1
release fixing this.
Most helpful comment
Ok, I know what is causing the problem. Fixing this is relatively simple. The fix will not require a db resync unless the ledger state is already corrupt (which will be detected by the fixed version of the software).
The problem is:
db-sync
is getting blocks that have already been validated by the node, it seemed sensible to use the fast version.