Linenoise: Add multibyte support

Created on 23 Jan 2012 · 21Comments · Source: antirez/linenoise

Current code doesn't have support for multibyte strings, e.g. strings having unicode characters beyond ASCII range. The column shifts for refreshLine are calculated using strlen() which returns 2 instead of 1 for a 2-byte character like 'Ş' in Turkish.

The library should use mbstowcs() or other functions to get the number of characters instead of number of bytes for column processing (up, down arrows, erasing a character, etc.).

And also as those functions are LC_CTYPE dependent, either you or the applications using linenoise should call setlocale(LC_ALL, "") to set the application's locale to the system locale.

Thanks.

Source

ozancaglayan

👍2

Most helpful comment

My fork (https://github.com/yhirose/linenoise/tree/utf8-support) now supports Unicode 11.0 and includes all the recent changes made in antirez/linenoise.

yhirose on 15 Oct 2018

👍5

All 21 comments

See: http://www.cl.cam.ac.uk/~mgk25/unicode.html

ozancaglayan on 23 Jan 2012

Take a look at my fork, https://github.com/msteveb/linenoise, which has support for utf-8

msteveb on 23 Jan 2012

Do you really need all those functions? I'm not quite familiar with the stuff but I easily fixed some of the weird problems by using mbstowcs() instead of strlen() where the length of the string is assumed equivalent to the number of characters in the string. But I couldn't find way to fix deleting of wide characters with backspace..

ozancaglayan on 23 Jan 2012

The approach here is to avoid any reliance on system support for utf-8. For example, I have systems running uClibc without locale support which can still happily run a utf-8 console over a serial port. Of course you are welcome to take a different approach.

msteveb on 24 Jan 2012

I have a similar issue; I tried out line-noise for a shell implementation. If I want coloured prompts, the escape-codes end up being included in the length calculation.

A simpler, easier fix is to eirther:

1) allow specifying the length of the prompt yourself.
2) use terminal commands to extract the position of the cursor after outputing the prompt (not sure if this is possible)

jasom on 3 Dec 2013

I find this from mongo shell's code. I'm always annoyed by more and more CLI tools (mongo, redis-cli, node)) I use whose cursor moves weiredly when there are multibyte characters. I don't know if the others are using linenoise or something else, but I'd like to see this get fixed :-)

lilydjwg on 12 Mar 2014

I've made a modified linenoise that lets you specify the width yourself, so it's extra work for the application, but at least possible; I've been using it for about 3 months with no problems. I'll turn it into a pull request, perhaps.

jasom on 14 Mar 2014

'utf-8 support' branch on my fork fixed the following UTF-8 problems that appear in the latest linenoise version 1.0:

Multi-byte characters: ö (U+00F6)
Multi-code characters: ö (U+006F U+0308)
Wide characters: 日本語 ('Japanese')
Prompt text including the above characters and ANSI escaped colored text.

I first tried https://github.com/msteveb/linenoise. But it is not based on the latest linenoise which supports the fantastic multiline mode. Also it doesn't support CJK wide characters and multi-code characters...

yhirose on 26 Oct 2015

Hello, I'm thinking about going the following route with this issue:

Use @yhirose as a reference in order to check where the C plain string functions should be substituted by multi-byte aware ones.
Export an API that allows linenoise user to set alternative functions for string length calculations. Set the function to the plain C functions as default.
Include @yhirose code as a separated file that you can add to your application, calling the linenoise new functions to set the length functions, in order to have multi-byte support.

This way we obtain that linenoise simplicity remains almost untouched, but optionally it is both possible to support multi byte chars both with C++ functions, other user provided functions different from standard ones, or the ones included in linenoise itself if your project is in C and you don't want to rewrite what @yhirose already wrote again and again.

Makes sense to you? Thanks.

antirez on 26 Oct 2015

@antirez, Thanks for paying attention to the multi-byte code users! The idea that you presented totally makes sense to me. I am even happier because if the linenoise library itself can give the extensibility, we could easily add other multi-byte encoding support.

As you can see in my fork, the most important concept for enabling 'multi byte' support is to make a clear distinction between 'byte position/width' in text buffer and 'column position/width' on screen. Here are some examples in UTF-8:

あ (U+3042): E3 81 82 (3 bytes): Wide (2 column width)
ö (U+00F6): C3 B6 (2 bytes): Narrow (1 column width)
ö (U+006F U+0308): 6F CC 88 (3 bytes): Narrow (1 column width)

Once we come to know the difference, it's pretty easy to handle multi-byte code correctly. You can grasp the idea from changes in the 1st commit. I applied the same principle to prompt text in the 2nd commit as well.

The only place where we need to be careful is the multiline mode handling code. For instance, when the last character is wide and there is only 1 column left on the current row, that wide character doesn't fit the remaining space. So the wide character must be displayed at the beginning of the next line. This code handles it.

One more thing that I did is to skip all the ANSI escape sequence characters when calculating column position/width in the 3rd commit. This change enables us to use color in the prompt text.

I am really excited to see the new API in the near future. Please let me know if you have any questions on this matter. I am sure that you will do a fantastic job!!

yhirose on 27 Oct 2015

After researching more about dependencies between the linenoise code and the UTF-8 encoding code according to your design goal, I realized that only three functions are needed when adding other encoding support.

Based on the research, I have updated my branch. Here is the diff between the linenoise head and the utf8-support branch. As you could see there, I got rid of all UTF-8 specific code completely from linenoise.c and put them into encodings/utf8.h and encodings/utf8.c. Also I added one experiment API called linenoiseSetEncodingFunctions on linenoise.h, so that users could set their own set of encoding functions. I confirmed all the functionalities still work.

Here is a snippet of my current experimental API:

typedef size_t (linenoisePrevCharLen)(const char *buf, size_t buf_len, size_t pos, size_t *col_len);
typedef size_t (linenoiseNextCharLen)(const char *buf, size_t buf_len, size_t pos, size_t *col_len);
typedef size_t (linenoiseReadCode)(int fd, char *buf, size_t buf_len, int* c);

void linenoiseSetEncodingFunctions(
    linenoisePrevCharLen *prevCharLenFunc,
    linenoiseNextCharLen *nextCharLenFunc,
    linenoiseReadCode *readCodeFunc);

linenoisePrevCharLen and linenoiseNextCharLen return byte length as the return value, and set column length to col_len parameter. linenoiseReadCode reads bytes into buf, and convert the bytes and set a meaningful character code for the encoding to c parameter.

If users don't call linenoiseSetEncodingFunctions, it'll end up calling _default_ implementations. They simply handle _one byte_ as a character.

Hope the post will be helpful when you design the new encoding API. I am really looking forward to it!!

yhirose on 29 Oct 2015

👍2

@yhirose that's a fantastic work!!! :-) I'm going to check the code and merge it. Thank you for this.

antirez on 8 Nov 2015

Not merged yet?

henriqueleng on 28 Jan 2016

@antirez any progress on merging it?

dumblob on 25 Jun 2016

I have modified my fork (https://github.com/yhirose/linenoise/tree/utf8-support) to catch up with the recent changes made in the original linenoise such as 'hints' feature.

yhirose on 28 Jun 2016

Thank you very much @yhirose. You have made good code better! and my
job easier!

@sonophoto

On Mon, 27 Jun 2016 18:56:45 -0700, yhirose wrote:

   I have modified my fork

(https://github.com/yhirose/linenoise/tree/utf8-support) to catch up
with the recent changes made in the original linenoise such as 'hints'
feature.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

Sonophoto on 28 Jun 2016

My fork (https://github.com/yhirose/linenoise/tree/utf8-support) now supports Unicode 9.0.

yhirose on 25 Oct 2016

@antirez Will you have free time in the near future to merge @yhirose's multi-byte support? Or should we switch https://github.com/hoelzro/lua-linenoise to use @yhirose's fork until then? ✌️

aleclarson on 21 Feb 2018

My fork (https://github.com/yhirose/linenoise/tree/utf8-support) now supports Unicode 11.0 and includes all the recent changes made in antirez/linenoise.

yhirose on 15 Oct 2018

👍5

My fork (https://github.com/yhirose/linenoise/tree/utf8-support) now supports Unicode 12.1.

yhirose on 10 Jul 2019

👍4

My fork (https://github.com/yhirose/linenoise/tree/utf8-support) now supports Unicode 13.0.

yhirose on 24 Apr 2020

❤3

Was this page helpful?

0 / 5 - 0 ratings