Cardano-db-sync: Exception due to unicode on testnet

Created on 19 Sep 2020  ·  15Comments  ·  Source: input-output-hk/cardano-db-sync

On testnet network, cardano-db-sync at tags/5.0.0 in extended mode, is throwing exceptions at block 1816809 due to unicode parsing/handling issues. It cannot sync past this block.

[db-sync-node:Info:234667] [2020-09-19 14:23:20.41 UTC] Starting chainSyncClient
[db-sync-node:Info:234667] [2020-09-19 14:23:20.48 UTC] Cardano.Db tip is at slot 6061839, block 1816809
[db-sync-node:Info:234672] [2020-09-19 14:23:20.48 UTC] Running DB thread
[db-sync-node:Info:234672] [2020-09-19 14:23:20.65 UTC] Rolling back to slot 6061839, hash 98414d12d1d1f05210dea6ce4082e1bcbbcfdf56343fd1cb44a8778c4c9ea57a
[db-sync-node:Info:234672] [2020-09-19 14:23:20.66 UTC] Deleting blocks numbered: []
[db-sync-node:Error:234672] [2020-09-19 14:23:21.03 UTC] runDBThread: DbInsertException "TxMetadata" (SqlError {sqlState = "22P05", sqlExecStatus = FatalError, sqlErrorMsg = "unsupported Unicode escape sequence", sqlErrorDetail = "\\u0000 cannot be converted to text.", sqlErrorHint = ""})
[db-sync-node:Error:234667] [2020-09-19 14:23:21.05 UTC] ChainSyncWithBlocksPtcl: DbInsertException "TxMetadata" (SqlError {sqlState = "22P05", sqlExecStatus = FatalError, sqlErrorMsg = "unsupported Unicode escape sequence", sqlErrorDetail = "\\u0000 cannot be converted to text.", sqlErrorHint = ""})
[db-sync-node.Subscription:Error:234663] [2020-09-19 14:23:21.05 UTC] [String "Application Exception: LocalAddress {getFilePath = \"/run/cardano-node/node.socket\"} DbInsertException \"TxMetadata\" (SqlError {sqlState = \"22P05\", sqlExecStatus = FatalError, sqlErrorMsg = \"unsupported Unicode escape sequence\", sqlErrorDetail = \"\\\\u0000 cannot be converted to text.\", sqlErrorHint = \"\"})",String "SubscriptionTrace"]
[db-sync-node.ErrorPolicy:Error:4] [2020-09-19 14:23:21.05 UTC] [String "ErrorPolicyUnhandledApplicationException (DbInsertException \"TxMetadata\" (SqlError {sqlState = \"22P05\", sqlExecStatus = FatalError, sqlErrorMsg = \"unsupported Unicode escape sequence\", sqlErrorDetail = \"\\\\u0000 cannot be converted to text.\", sqlErrorHint = \"\"}))",String "ErrorPolicyTrace",String "LocalAddress {getFilePath = \"/run/cardano-node/node.socket\"}"]

Most helpful comment

This how it looks now with temp fix:

[db-sync-node:Info:37] [2020-09-22 10:19:09.66 UTC] insertShelleyBlock: epoch 84, slot 6060000, block 1816721, hash 3ec15354c53deae4eb26a206cc3185f799e80bd09393f279bce7e53a7d633144
[db-sync-node:Warning:37] [2020-09-22 10:19:24.08 UTC] insertTxMetadata: dropped due to a Unicode NUL character.
[db-sync-node:Warning:37] [2020-09-22 10:19:24.08 UTC] insertTxMetadata: dropped due to a Unicode NUL character.
[db-sync-node:Warning:37] [2020-09-22 10:19:24.10 UTC] insertTxMetadata: dropped due to a Unicode NUL character.
[db-sync-node:Warning:37] [2020-09-22 10:19:24.10 UTC] insertTxMetadata: dropped due to a Unicode NUL character.
[db-sync-node:Info:37] [2020-09-22 10:19:24.93 UTC] insertShelleyBlock: epoch 84, slot 6065000, block 1816973, hash 0ab0dd5c36c6eb480b8bbd05508be952c5ed8597a4e422af25bd80a905a9368d

All 15 comments

This is HUGE pain in the neck.

The tx metedata that is being inserted is

(1,S "\NUL\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX")

which is being encoded as JSON:

"\u0000\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003"

which of course is not valid UTF8 and hence the database rejects it.

Even worse, the above is encoded as UTF8 by the Haskell Text library as shown, but PostgreSQL then rejects it.

Let me reiterate what a HUGE pain in the neck this is.

The string in question is decoded to UTF8 by the Haskell Text library and then passed to Postgres which throws an SqlError exception. I have tried catching this exception but catching the exception does not seem to work.

I can catch and log the exception:

[db-sync-node:Warning:37] [2020-09-21 00:26:18.77 UTC] insertTxMetadata: Failed to insert (1,S "\NUL...\ETX")
SqlError {sqlState = "22P05", sqlExecStatus = FatalError, sqlErrorMsg
 = "unsupported Unicode escape sequence", sqlErrorDetail = "\\u0000 cannot be converted to text.", sqlErrorHint = ""}

but then there is a second excpetion:

[db-sync-node:Error:37] [2020-09-21 00:26:18.77 UTC] runDBThread: SqlError {sqlState = "25P02", sqlExecStatus = FatalError, sqlErrorMsg = "current transaction is aborted, commands ignored until end of transaction
 block", sqlErrorDetail = "", sqlErrorHint = ""}

This suggests that I need to more fully validate the JSON before insertion.

We can store invalid UTF8 in PostgreSQL, but only as bytea data not as the jsonb and if we go for the former Postgres's JSON query support will not be avaiable.

I have tried a manual UTF8 decode:

decodeUtf8IO :: ByteString -> IO (Either Text.UnicodeException Text)
decodeUtf8IO = try . evaluate . Text.decodeUtf8With Text.strictDecode

and that too has the same problem.

From: https://stackoverflow.com/questions/31671634/handling-unicode-sequences-in-postgresql

A null byte isn't legal in a PostgreSQL string. Neither is a zero code point.

Currently think this is due to the way Haskell/Persistent serializes this to send to postgres. Currently trying to validate this theory.

Aeson.encode already escapes the NUL char to \u0000 which Persistent then rejects.

Coming up with a neat fix for this is not going to be easy.

I was looking into this and it seems these properties hold:

  • "\u0000\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003" is pretty valid as UTF-8, since UTF-8 does allow \u0000 to be represented directly.
  • \u0000 is not allowed for jsonb because it cannot be converted to text.

The only temporary solution to encode it to a JSON, and then to Text and then detect the bad character sequence and reject the JSON object it if contains the bad sequence. I also wonder if \u0000 is the only sequence I need to reject,

Couldn't you store two columns? the bytea, and the JSON when it doesn't contain the non-text character, so that anyone is free to use whatever they need.

@erikd The \u0000 restriction is clearly stated here, along with all the other rules that Postgres follows. Bottom line: there are rejections and transformations going on.

To avoid a leaky abstraction, the exact values (as given by the user) must be stored. This roughly means either a blob (= no structure in the DB) or JSON string values that somehow preserve the original meaning: for instance base64-encoded string values in the DB (so recovering the structure at the expense of extra processing in the application in order to translate from base64 to the actual value).

The temporary workaround fix (simply dropping metadata objects that cannot be inserted in Postgres) has been merged to master.

This how it looks now with temp fix:

[db-sync-node:Info:37] [2020-09-22 10:19:09.66 UTC] insertShelleyBlock: epoch 84, slot 6060000, block 1816721, hash 3ec15354c53deae4eb26a206cc3185f799e80bd09393f279bce7e53a7d633144
[db-sync-node:Warning:37] [2020-09-22 10:19:24.08 UTC] insertTxMetadata: dropped due to a Unicode NUL character.
[db-sync-node:Warning:37] [2020-09-22 10:19:24.08 UTC] insertTxMetadata: dropped due to a Unicode NUL character.
[db-sync-node:Warning:37] [2020-09-22 10:19:24.10 UTC] insertTxMetadata: dropped due to a Unicode NUL character.
[db-sync-node:Warning:37] [2020-09-22 10:19:24.10 UTC] insertTxMetadata: dropped due to a Unicode NUL character.
[db-sync-node:Info:37] [2020-09-22 10:19:24.93 UTC] insertShelleyBlock: epoch 84, slot 6065000, block 1816973, hash 0ab0dd5c36c6eb480b8bbd05508be952c5ed8597a4e422af25bd80a905a9368d

Whether docker is repaired, and what is the repaired version

Its not fixed, but 5.0.x (but I would recommend 5.0.2) has a workaround that avoids this issue by dropping tx_metadata containing the Unicode NUL character.

Was this page helpful?
0 / 5 - 0 ratings