Xterm.js: Support RTL languages

Created on 13 Jun 2017  ·  17Comments  ·  Source: xtermjs/xterm.js

Downstream issue: https://github.com/Microsoft/vscode/issues/28571

When we enforced unicode character width in https://github.com/sourcelair/xterm.js/issues/467 this broke RTL language characters as they are now rendered in reverse (LTR). We could revert that for RTL character ranges only but we should do the right fix and reverse the strings so they're actually on the character grid as the new selection model relies on all characters lining up perfectly on the grid https://github.com/sourcelair/xterm.js/pull/670

Ideally line reflow https://github.com/sourcelair/xterm.js/issues/622 would be done before this so it's easier to change the contents of multiple lines.

Terminal.app:

image

VS Code 1.13 (notice sentences are reversed):

image

@mostafa69d @CherryDT a little info on the languages in question would be handy:

  1. Where should the strings be flipped.for Hebrew/Arabic/Persian, do I reverse entire continuous sequences of characters in-between ascii characters?
  2. How are the characters meant to interact with characters like 0-9 or punctuation?

Useful references:

arei18n arerenderer typenhancement

Most helpful comment

@Tyriar
First of all I'm gonna give you a very brief perspective of Arabic and Persian language maybe it help you(I'm not sure if the Hebrew is the same).
In Arabic and Persian languages the alphabets are like "آ" "ب" "س" and so on. And the words are made by these alphabets (obviously) with a very different rule in compare with for example English.
The difference is that we have more than one shape for some alphabet like "س" .The first shape is "س" and the second one is " سـ" ,the other one is "ـسـ" and the last one is "ـس". And what's the usage of these shapes? Based on where the alphabet in a word appears, the shape of alphabet we use varies. For example, for the mentioned alphabet "س" we use the shape "سـ" when a word starts with this alphabet like "سلام". Here is the problem and actually the difference between a language like English and Persian or Arabic. We generate words in these languages by concating the different shapes of these alphabets(we adhere them together in some cases). Again I highlight these rule: we generate these words by concating the shapes not the alphabets(Which is always concating alphabets in English) you can see some examples below:
we have alphabets "ک" "ن" "ا" "د" "ی"
I make these words by just mentioned alphabets : نادان , یاد,دکان
So, to wrap it up and give you the clue what happened in the screenshots I posted , the terminal breaks the words to alphabets and reverse them.(So it's not just about reversing). Take a look at words I created and alphabets I mentioned before, Now the VS terminal shows them "separated" and "reversed".

Correct format: نادان Terminal: ن ا د ا ن
Correct format:یاد Terminal: د ا ی
Correct format: دکان Terminal: ن ا ک د

Now your questions:
Where should the strings be flipped.for Hebrew/Arabic/Persian, do I reverse entire continuous sequences of characters in-between ascii characters?
I don't have any idea about Hebrew, but in Arabic and persian the sequences of character should flip when they encounter a space character(The word separator is space) like this:" من در حال نوشتن هستم" but still it should keep the "shapes" and necessary adherence.

How are the characters meant to interact with characters like 0-9 or punctuation?
About numbers and punctuation the rules are the same as English and the numbers and punctuation signs follows the characters. like this:
?من در سال "۱۳۶۹" به دنیا آمدم.
من در سال "1369" به دنیا آمدم.
Actually a sequences of characters containing RTL and none-RTL characters is a whole different story and if you need more information I can elaborate that.

P.S 1:
This link here is a source code which is written to solve the same problem in PHP( for sure old versions) you can take a look
https://github.com/slashmili/php-gd-persian/blob/master/phpgd/fagd.php

P.S 2:
Here is a resource on wikipedia about the Persian characters
https://en.wikipedia.org/wiki/Persian_alphabet

P.S 3:
Again, I have to mention that in the previous version of VS Code, everything was fine.

P.S 4:
About the problem with selecting a word containing some LTR character like
<p>اینجا را بخوانید</p> which @CherryDT mentioned , there are some minor bugs which I don't have problem with them and I found quick solutions for them.(But still if you need some elaboration about those let me know)

All 17 comments

It is actually a whole lot more complicated and includes statefulness and even mirroring certain characters. I'd say it's a science of its own. (And I have the deepest respect for those people who wrote robust text rendering libraries that handle all the BiDi issues properly, so _I_ don't have to mess around with it, to be honest.)

See also:
https://en.wikipedia.org/wiki/Bi-directional_text (good overview)
https://www.w3.org/International/articles/inline-bidi-markup/uba-basics
https://www.w3.org/International/tutorials/svg-tiny-bidi/ (the initial premise is not related but it explains a few things better than the previous link)
https://github.com/fevangelou/doctype-mirror/tree/master/bidihowto/bidi-support-in-a-ui

EDIT: I think the way the new selection works may actually be unexpected because it is going to behave differently than VSCode itself. For example, given the text "The song מדינת קומבינה makes me think", when I start selecting at "The" and end between the two Hebrew words, I will have selected "The song מדינת", while in the console I will have selected "The song קומבינה".

See example:
Image

However it will still be better than how Sublime Text "works" last time I checked, because there you will see one thing selected but copy another, which is very annoying.

@Tyriar
First of all I'm gonna give you a very brief perspective of Arabic and Persian language maybe it help you(I'm not sure if the Hebrew is the same).
In Arabic and Persian languages the alphabets are like "آ" "ب" "س" and so on. And the words are made by these alphabets (obviously) with a very different rule in compare with for example English.
The difference is that we have more than one shape for some alphabet like "س" .The first shape is "س" and the second one is " سـ" ,the other one is "ـسـ" and the last one is "ـس". And what's the usage of these shapes? Based on where the alphabet in a word appears, the shape of alphabet we use varies. For example, for the mentioned alphabet "س" we use the shape "سـ" when a word starts with this alphabet like "سلام". Here is the problem and actually the difference between a language like English and Persian or Arabic. We generate words in these languages by concating the different shapes of these alphabets(we adhere them together in some cases). Again I highlight these rule: we generate these words by concating the shapes not the alphabets(Which is always concating alphabets in English) you can see some examples below:
we have alphabets "ک" "ن" "ا" "د" "ی"
I make these words by just mentioned alphabets : نادان , یاد,دکان
So, to wrap it up and give you the clue what happened in the screenshots I posted , the terminal breaks the words to alphabets and reverse them.(So it's not just about reversing). Take a look at words I created and alphabets I mentioned before, Now the VS terminal shows them "separated" and "reversed".

Correct format: نادان Terminal: ن ا د ا ن
Correct format:یاد Terminal: د ا ی
Correct format: دکان Terminal: ن ا ک د

Now your questions:
Where should the strings be flipped.for Hebrew/Arabic/Persian, do I reverse entire continuous sequences of characters in-between ascii characters?
I don't have any idea about Hebrew, but in Arabic and persian the sequences of character should flip when they encounter a space character(The word separator is space) like this:" من در حال نوشتن هستم" but still it should keep the "shapes" and necessary adherence.

How are the characters meant to interact with characters like 0-9 or punctuation?
About numbers and punctuation the rules are the same as English and the numbers and punctuation signs follows the characters. like this:
?من در سال "۱۳۶۹" به دنیا آمدم.
من در سال "1369" به دنیا آمدم.
Actually a sequences of characters containing RTL and none-RTL characters is a whole different story and if you need more information I can elaborate that.

P.S 1:
This link here is a source code which is written to solve the same problem in PHP( for sure old versions) you can take a look
https://github.com/slashmili/php-gd-persian/blob/master/phpgd/fagd.php

P.S 2:
Here is a resource on wikipedia about the Persian characters
https://en.wikipedia.org/wiki/Persian_alphabet

P.S 3:
Again, I have to mention that in the previous version of VS Code, everything was fine.

P.S 4:
About the problem with selecting a word containing some LTR character like
<p>اینجا را بخوانید</p> which @CherryDT mentioned , there are some minor bugs which I don't have problem with them and I found quick solutions for them.(But still if you need some elaboration about those let me know)

After Updating my vscode, Everything reversed, That is Very bad, Please Solve This problem
I want to downgrade, Witch version is okey?

@mostafa69d luckily enough in Hebrew that barely exist. Hebrew letters stay mostly the same in any position inside a word, besides few letters which are כ which turns to ך, then מ which turns to ם, then נ which turns to ן, then פ which turns to ף and finally צ which turns to ץ. This makes Hebrew easier to format, I guess.

However these are still separate characters (in terms of character encoding) and always display the same. They do not change appearance when moved around. (It's the writer's job to use the right letter - sofit or not - at the right position.)

The problem with the splitting characters is when they are wrapped within span one by one it will require connection and it will miss represent the shape (Arabic letters).

To fix the problem these characters must be within one span or not wrap them at all.

The list of the unicode all of these letters are
Arabic (0600–06FF, 255 characters)
Arabic Supplement (0750–077F, 48 characters)
Arabic Extended-A (08A0–08FF, 73 characters)
Arabic Presentation Forms-A (FB50–FDFF, 611 characters)
Arabic Presentation Forms-B (FE70–FEFF, 141 characters)
Rumi Numeral Symbols (10E60–10E7F, 31 characters)
Arabic Mathematical Alphabetic Symbols (1EE00—1EEFF, 143 characters)
screen shot 2017-11-29 at 11 45 00 pm

required reading: https://opensource.com/life/16/3/twisted-road-right-left-language-support

from https://github.com/Microsoft/vscode/issues/28571#issuecomment-307991443

do you have an example of another terminal that handles this well?

mlterm seems to be better than the average (non-web based) terminal.
2018-11-15-023232_577x981_scrot
It is cursive but in some cases cut off, I think it can be solved by changing the font, this paragraph was copied from Wikipedia, the blue characters are the RTL mark, that's how vim is outputing them and mlterm is rendering them in blue.

The character joiner API might be able to solve this, we could probably make all adjacent arabic/hebrew/etc. unicode characters join and be drawn in the same glyph.

For what it's worth, the debug console works well with RTL texts. This is what I've tried:
code
And this is the output on the debug console:
debug
But the terminal is still the same:
terminal

I'm using VS Code - Insiders v1.31.0.

@babakks Only two Terminals as far as I know in the Linux system can output RTL correctly, konsole and mlterm, they are available in all the distros repos.

@elieobeid7 @babakks Mac OS terminal output RTL correctly

Put out a PR to fix this, if anyone wants to test out the branch that would be useful as I don't speak these languages. https://github.com/xtermjs/xterm.js/pull/1899

To test:

git clone https://github.com/Tyriar/xterm.js
cd xterm.js
git checkout 701_rtl_support
yarn
yarn watch

# another terminals
yarn start

You may need some dependencies to be installed https://github.com/Microsoft/node-pty#dependencies

Please hold off for a little bit :)

I've been recently working on studying, evaluating existing docs and implementations of RTL in terminals, and come up with a (draft) recommendation. I'll release it real soon now.

It's way more complicated than one would first think. A bit of spoiler: If you start shuffling the characters around according to the BiDi algorithm, it becomes literally, mathematicaly provably impossible to have proper BiDi-aware text editing-viewing experience (e.g. vim, emacs...) on top of that platform. (And to respond to the previous few comments: no, konsole, mlterm and macOS Terminal don't get it right either.)

@egmontkob does this take into account the fact that we get to leverage the browser's bidi support? All my change does is force related unicode sequences to be drawn together not as separate characters. This is probably wrong when the cursor is over the character but it seems to work other than that.

@Tyriar Sorry Tyriar, but it's still wrong. I commented under the pull request.
https://github.com/xtermjs/xterm.js/pull/1899#issuecomment-455333377

The spec defines how the canvas needs to look like, after receiving some data. The spec doesn't care what the backend of the terminal emulator is (e.g. a graphical canvas, or a browser (HTML DOM), or another terminal emulator (tmux)), it's the terminal emulator's task to implement the specified behavior by whatever means.

And one aspect of the specified behavior is that in some circumstances the character cells need to be shuffled according to the BiDi algorithm (for display purposes only, not affecting the actual storage), because that's the only reasonable way to get simple utilities like "cat" produce the desired output; and in some other circumstances the cells mustn't be rearranged, because that's the only way vim/emacs/whoever can do their own BiDi. There are escape sequences controlling this behavior. And there's much-much more to the story than this.

Please see the published draft BiDi specification at https://terminal-wg.pages.freedesktop.org/bidi/ . Comments, improvement ideas etc. are welcome over there in its issue tracker.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Tyriar picture Tyriar  ·  4Comments

tandatle picture tandatle  ·  3Comments

johnpoth picture johnpoth  ·  3Comments

jerch picture jerch  ·  3Comments

travisobregon picture travisobregon  ·  3Comments