Linenoise: Add multibyte support

Created on 23 Jan 2012  ·  21Comments  ·  Source: antirez/linenoise

Current code doesn't have support for multibyte strings, e.g. strings having unicode characters beyond ASCII range. The column shifts for refreshLine are calculated using strlen() which returns 2 instead of 1 for a 2-byte character like 'Ş' in Turkish.

The library should use mbstowcs() or other functions to get the number of characters instead of number of bytes for column processing (up, down arrows, erasing a character, etc.).

And also as those functions are LC_CTYPE dependent, either you or the applications using linenoise should call setlocale(LC_ALL, "") to set the application's locale to the system locale.

Thanks.

Most helpful comment

My fork (https://github.com/yhirose/linenoise/tree/utf8-support) now supports Unicode 11.0 and includes all the recent changes made in antirez/linenoise.

All 21 comments

Take a look at my fork, https://github.com/msteveb/linenoise, which has support for utf-8

Do you really need all those functions? I'm not quite familiar with the stuff but I easily fixed some of the weird problems by using mbstowcs() instead of strlen() where the length of the string is assumed equivalent to the number of characters in the string. But I couldn't find way to fix deleting of wide characters with backspace..

The approach here is to avoid any reliance on system support for utf-8. For example, I have systems running uClibc without locale support which can still happily run a utf-8 console over a serial port. Of course you are welcome to take a different approach.

I have a similar issue; I tried out line-noise for a shell implementation. If I want coloured prompts, the escape-codes end up being included in the length calculation.

A simpler, easier fix is to eirther:

1) allow specifying the length of the prompt yourself.
2) use terminal commands to extract the position of the cursor after outputing the prompt (not sure if this is possible)

I find this from mongo shell's code. I'm always annoyed by more and more CLI tools (mongo, redis-cli, node)) I use whose cursor moves weiredly when there are multibyte characters. I don't know if the others are using linenoise or something else, but I'd like to see this get fixed :-)

I've made a modified linenoise that lets you specify the width yourself, so it's extra work for the application, but at least possible; I've been using it for about 3 months with no problems. I'll turn it into a pull request, perhaps.

'utf-8 support' branch on my fork fixed the following UTF-8 problems that appear in the latest linenoise version 1.0:

  • Multi-byte characters: ö (U+00F6)
  • Multi-code characters: ö (U+006F U+0308)
  • Wide characters: 日本語 ('Japanese')
  • Prompt text including the above characters and ANSI escaped colored text.

I first tried https://github.com/msteveb/linenoise. But it is not based on the latest linenoise which supports the fantastic multiline mode. Also it doesn't support CJK wide characters and multi-code characters...

Hello, I'm thinking about going the following route with this issue:

  1. Use @yhirose as a reference in order to check where the C plain string functions should be substituted by multi-byte aware ones.
  2. Export an API that allows linenoise user to set alternative functions for string length calculations. Set the function to the plain C functions as default.
  3. Include @yhirose code as a separated file that you can add to your application, calling the linenoise new functions to set the length functions, in order to have multi-byte support.

This way we obtain that linenoise simplicity remains almost untouched, but optionally it is both possible to support multi byte chars both with C++ functions, other user provided functions different from standard ones, or the ones included in linenoise itself if your project is in C and you don't want to rewrite what @yhirose already wrote again and again.

Makes sense to you? Thanks.

@antirez, Thanks for paying attention to the multi-byte code users! The idea that you presented totally makes sense to me. I am even happier because if the linenoise library itself can give the extensibility, we could easily add other multi-byte encoding support.

As you can see in my fork, the most important concept for enabling 'multi byte' support is to make a clear distinction between 'byte position/width' in text buffer and 'column position/width' on screen. Here are some examples in UTF-8:

  • (U+3042): E3 81 82 (3 bytes): Wide (2 column width)
  • ö (U+00F6): C3 B6 (2 bytes): Narrow (1 column width)
  • (U+006F U+0308): 6F CC 88 (3 bytes): Narrow (1 column width)

Once we come to know the difference, it's pretty easy to handle multi-byte code correctly. You can grasp the idea from changes in the 1st commit. I applied the same principle to prompt text in the 2nd commit as well.

The only place where we need to be careful is the multiline mode handling code. For instance, when the last character is wide and there is only 1 column left on the current row, that wide character doesn't fit the remaining space. So the wide character must be displayed at the beginning of the next line. This code handles it.

One more thing that I did is to skip all the ANSI escape sequence characters when calculating column position/width in the 3rd commit. This change enables us to use color in the prompt text.

I am really excited to see the new API in the near future. Please let me know if you have any questions on this matter. I am sure that you will do a fantastic job!!

After researching more about dependencies between the linenoise code and the UTF-8 encoding code according to your design goal, I realized that only three functions are needed when adding other encoding support.

Based on the research, I have updated my branch. Here is the diff between the linenoise head and the utf8-support branch. As you could see there, I got rid of all UTF-8 specific code completely from linenoise.c and put them into encodings/utf8.h and encodings/utf8.c. Also I added one experiment API called linenoiseSetEncodingFunctions on linenoise.h, so that users could set their own set of encoding functions. I confirmed all the functionalities still work.

Here is a snippet of my current experimental API:

typedef size_t (linenoisePrevCharLen)(const char *buf, size_t buf_len, size_t pos, size_t *col_len);
typedef size_t (linenoiseNextCharLen)(const char *buf, size_t buf_len, size_t pos, size_t *col_len);
typedef size_t (linenoiseReadCode)(int fd, char *buf, size_t buf_len, int* c);

void linenoiseSetEncodingFunctions(
    linenoisePrevCharLen *prevCharLenFunc,
    linenoiseNextCharLen *nextCharLenFunc,
    linenoiseReadCode *readCodeFunc);

linenoisePrevCharLen and linenoiseNextCharLen return byte length as the return value, and set column length to col_len parameter. linenoiseReadCode reads bytes into buf, and convert the bytes and set a meaningful character code for the encoding to c parameter.

If users don't call linenoiseSetEncodingFunctions, it'll end up calling _default_ implementations. They simply handle _one byte_ as a character.

Hope the post will be helpful when you design the new encoding API. I am really looking forward to it!!

@yhirose that's a fantastic work!!! :-) I'm going to check the code and merge it. Thank you for this.

Not merged yet?

@antirez any progress on merging it?

I have modified my fork (https://github.com/yhirose/linenoise/tree/utf8-support) to catch up with the recent changes made in the original linenoise such as 'hints' feature.

Thank you very much @yhirose. You have made good code better! and my
job easier!
 
@sonophoto

On Mon, 27 Jun 2016 18:56:45 -0700, yhirose wrote:

   I have modified my fork 

(https://github.com/yhirose/linenoise/tree/utf8-support) to catch up
with the recent changes made in the original linenoise such as 'hints'
feature.

You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

*

My fork (https://github.com/yhirose/linenoise/tree/utf8-support) now supports Unicode 9.0.

@antirez Will you have free time in the near future to merge @yhirose's multi-byte support? Or should we switch https://github.com/hoelzro/lua-linenoise to use @yhirose's fork until then? ✌️

My fork (https://github.com/yhirose/linenoise/tree/utf8-support) now supports Unicode 11.0 and includes all the recent changes made in antirez/linenoise.

My fork (https://github.com/yhirose/linenoise/tree/utf8-support) now supports Unicode 12.1.

My fork (https://github.com/yhirose/linenoise/tree/utf8-support) now supports Unicode 13.0.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

krux02 picture krux02  ·  8Comments

denisvm picture denisvm  ·  9Comments

ghost picture ghost  ·  4Comments

fatcerberus picture fatcerberus  ·  5Comments

JelteF picture JelteF  ·  8Comments