Ctags: Universal ctags inserts invalid `utf-8` characters for certain files

Created on 30 Jul 2018 · 7Comments · Source: universal-ctags/ctags

(
Thank you for contacting us.

If you are reporting an issue with the parsing output, please fill
the following template. As your custom CTags configuration can
affect results, please always use --options=NONE as the first
option when running ctags.

Otherwise, delete the template and write your issue from scratch.
Examples may help developers understanding your issue better.

Use GitHub web interface and markdown notation.
Using mail results broken text rendering that makes
the developers go crazy.
)

The name of the parser:

The command line you used to run ctags:

$ ctags -R

I dont have any special configurations in .ctags or anywhere else. This is a fresh VM that this test was run on.

The content of input file: https://github.com/pallets/jinja/blob/master/jinja2/_identifier.py

The tags output you are not satisfied with:

Universal-ctags inserts invalid utf-8 characters under certain circumstances.

The tags output you expect:

Expected tag output with all valide utf-8 characters.

The version of ctags:

$ ctags --version
Universal Ctags 0.0.0(3522685), Copyright (C) 2015 Universal Ctags Team
Universal Ctags is derived from Exuberant Ctags.
Ctags 5.8, Copyright (C) 1996-2009 Darren Hiebert
  Compiled: July 27 1018, 23:16:36
  URL: https://ctags.io/
  Optional compiled features: +wildcards, +regex, +iconv, +option-directory, +xpath

How do you get ctags binary:

(
The ctags binary is built on ubuntu-16.04 VM with no modifications other than installing necessary libraries such as automate, autoreconf for compiling ctags and necessary libraries to compile vim based on https://github.com/Valloric/YouCompleteMe/wiki/Building-Vim-from-source#a-for-a-debian-like-linux-distribution-like-ubuntu-type
)

@lilydjwg pointed out to me that ctags was inserting invalid utf-8 characters even though the file being used to generate the tags have all valid utf-8 characters here:
https://github.com/vim/vim/issues/3213#issuecomment-406961075

The compiled version of ctags works great in general.

Recently found out, that it turns out that ctags has a bug due to which the
old Execuberant ctags installed by sudo apt-get install ctags on Ubuntu
16.04 doesn't insert any invalid utf-8 characters, but if I compile
Universal-ctags from source and isntall it based on instructions here:
https://github.com/universal-ctags/ctags/blob/master/docs/autotools.rst, it
will insert invalid utf-8 characters. Heres the evidence:

With exuberant-ctags installed using just sudo apt-get install ctags:

2018-07-29_19-03-44

With Universal-ctags compiled from source (latest commit) as of this post,
compiled with instructions from here:
https://github.com/universal-ctags/ctags/blob/master/docs/autotools.rst:

2018-07-29_19-10-22

This causes a lot of problems in vim, because if invalid utf-8 characters are
passed to vim.eval, vim.eval breaks and this leads to no tags returned at
all. Currently, there is only one way of transfering data contained in a viml
variable to the python-name space, using vim.eval. So, any other plugin in
vim or else where will have similar problems as well. @ludovicchabant for
example had to post-process his tags file to stop such problems:
https://ludovic.chabant.com/devblog/2017/02/25/aaa-gamedev-with-vim/

Also he had to change ctrl-py-matcher to catch this issue.
https://github.com/ludovicchabant/ctrlp-py-matcher/blob/2f6947480203b734b069e5d9f69ba440db6b4698/autoload/pymatcher.py#L22

There are multiple other files I have seen which have similar problems, but I
have just provided one here to narrow down the problem.

My guess is this is a bug, and I don't expect that ctags would do this by
design. Can this be rectified, as this used to work fine in Exuberant Ctags
upon which Universal-ctags is based?

Ref: https://github.com/vim/vim/issues/3213#issuecomment-408727629

Source

alphaCTzo7G

👍1

All 7 comments

Sounds like #1275 to me: the new pattern-length-limit option is cutting at an arbitrary byte position, which happens to be in the middle of a character sequence. See #163, #640 and #1018.

Something like https://github.com/universal-ctags/ctags/issues/1275#issuecomment-274489859 should probably be implemented to fix this.

b4n on 30 Jul 2018

👍1

@alphaCTzo7G see #1807, does that properly fixes it for you?

b4n on 30 Jul 2018

👍1

@b4n, thanks for your quick response...

On the file that I posted here _identifier.py, using the #1805 commit, ctags no longer inserts invalid characters/cuts at a arbitrary location.

I will try out this PR on my real system over the next few days to see if it works for my whole repositories or emits other errors

As ctrlp and ctrlp-py-matcher are very popular plugins, it would be great if #1807 is merged so vim and other text-editor users can use ctrlp and ctrlp-py-matcher without having to worry about this issue.

There was another file which I found was causing problems, with vim.eval, and it contained invalid utf-8 characters as determined by grep -axv '.*' misc.html (misc.html in https://github.com/alphaCTzo7G/test). What I noticed is that ctags will insert the invalide utf-8 characters into the tags file from misc.html.

Does it make sense for ctags to detect invalid characters in files and instead replace them with something like what @tonymec had suggested here? (replace the invalid sequence by one or more instances of the character � (U+FFFD REPLACEMENT CHARACTER) which is meant for exactly that purpose.): https://github.com/vim/vim/issues/3213#issuecomment-405211243?

alphaCTzo7G on 30 Jul 2018

IIUC, ctags (Exuberant ctags, I mean, which is only one of the ctags programs available) is distributed separately from Vim (even if its author knows Bram and even if they occasionnally work together to make Vim and ctags work better together.

From a ctags point of view, it is legitimate to treat program text as just strings of bytes: regardless of whether it is UTF-8, Latin1, Latin9 or some other ISO 8859 charset, a space is 0x20, a hard tab is 0x09, a line break is 0x0A possibly preceded by 0x0D, etc.; and a null byte, which would be 0x00, should not appear in a text file. Ctags treats every program in the same way regardless of which ASCII-compatible encoding it is written in, and therefore it doesn't need to care about which is which. Only for some outlandish charsets like EBCDIC does it need to treat the text as definitely non-ASCII (in EBCDIC, IIRC, A-I are 0xC1-0xC9, J-R are 0xD1-0xD9, S-Z are 0xE2-0xE9, 0-9 are 0xF0-0xF9, and I don't remember what the codes are for a space, a tab, a line break, a dash, an underscore, etc.; but you see that from an ASCII viewpoint it is really outlandish).

IMHO, in ctag's case, the good old principle applies: garbage in, garbage out.

Best regards,
Tony.

tonymec on 30 Jul 2018

@tonymec .. makes sense.. I realize that there may be other tag generation programs, but universal-ctags is the most popular, and among the people using universal-ctags my guess would be that a large part is vim users.

So I wondering if these 2 might work or you have any other ideas of how to handle files that have illegal utf-8 characters?

I also noticed that ctags has this option of +iconv, which enables usage of libiconv. When used on the command line iconv can remove illegal utf8 characters. So I am wonder if I pass --input-enconding=utf-8 and --output-encoding=utf-8, then all illegal utf-8 characters would be changed to legal utf-8 characters.

This is explained in section 1.3.4 of https://media.readthedocs.org/pdf/ctags/latest/ctags.pdf:

Two new options have been introduced (--input-encoding=IN and --output-encoding=OUT). Using the encoding specified with these options ctags converts input from IN to OUT. ctags uses the converted strings when writing the pattern parts of each tag line. As a result the tags output is encoded in OUT encoding. In addition OUT is specified at the top the tags file as the value for the TAG_FILE_ENCODING pseudo tag. The default value of OUT is UTF-8. NOTE: Converted input is NOT passed to language parsers. The parsers still deal with input as a byte sequence. With --input-encoding-<LANG>=IN, you can specify a specific input encoding for LANG. It overrides the global default value given with --input-encoding

leave it upto the editor to handle illegal utf8 characters. In that case, either vim.eval has to be fixed or there has to be a vimL function which can parse and remove illegal utf-8 characters before passing it to vim.eval..

alphaCTzo7G on 30 Jul 2018

@alphaCTzo7G I agree with @tonymec and his conclusion.

Unfortunately, it's a lot of trouble recognizing the proper encoding -- and I insist on proper, because it's easy to find an encoding in which the input would be technically valid, say most if not all 8bit encodings would, but knowing whether it's the right one is tricky or impossible: say, how can one be sure between e.g. ISO 8859-1 and 8859-15? Solutions include complex heuristic about usage frequency and context; or a more naive idea applicable to some languages like HTML would be extracting the encoding statement inside the file, but that can be incorrect just as well.

Also, ctags stands in a difficult position here: many, if not most, consumers don't handle encodings, and generated tags need to match at the byte level. For example, grepping for a tag pattern or even name won't convert encodings for you, so the tag should match the file at the byte level. It was easy when all we had to care about was ASCII, but we're not so lucky anymore… UTF-8 didn't get adopted early enough.
This also applies to the idea of replacing with placeholder characters: what can the consumer do with such a replacement character? It at least has to handle it in a specific way.

However, if you're happy with replacing invalid UTF-8 with U+FFFD or stripping them, maybe you could simply post-process ctags' output?

b4n on 30 Jul 2018

👍1

@b4n, appreciate your comment. I actually deal mostly with utf-8 encoded files and have utf-8 encoded for the files I create. Unfortunately, as you mentioned I do use libraries which tend to have sometimes arbitrary encodings.

I use vim-gutentags, and it does provide a post-processing functionality. While I could manually post-process the tags file to result in all files in utf-8 characters, when I tried to use the post-processing functionality in vim-gutentags, it didn't work. So I thought it might be better to figure out a more robust solution.. but if that doesn't exist I will have to look into it again..

To detect the encoding of the file can't you use the underlying libraries behind one of these options: https://stackoverflow.com/questions/805418/how-to-find-encoding-of-a-file-in-unix-via-scripts

such as enca, file, uchardet, enguess? These are all command line utitilities.. but there must be some library somewhere which can be used internally by ctags perhaps. My guess is because of the number of encodings, as you mentioned, it may never be possible to perfectly predict the encoding, but a simple solution that covers most of it may be better than nothing..

I will try out the --input-encoding (and/or --input-encoding-<LANG>) and --output-encoding options.. Not sure whether it will work all the time, because its very much possible that certain files will have different encodings in the same repository, unless ctags figures out the correct encoding individually and spits it out in the desired format.

alphaCTzo7G on 31 Jul 2018

Was this page helpful?

0 / 5 - 0 ratings