Mathjax: Complex text layout, in particular with TeX input [was: MathJax does not support Complex text layout.]

Created on 19 May 2013  ·  23Comments  ·  Source: mathjax/MathJax

Because MathJax looks at individual code points it has trouble dealing with scripts that require bidirectionality, context shaping etc. This is visible whenever trying to use hebrew or arabic for instance.

It would be good if MathJax would be able to identify these ranges and be able to keep those as blocks instead of dividing it into individual characters. At the very least in \text mode.

http://en.wikipedia.org/wiki/Complex_text_layout

Accepted

Most helpful comment

Note that if you set mtextFontInherit to true in the HTML-CSS and SVG sections of your configuration, then MathJax will process \text{} as a single <span>, and so that should do as you request. You are right that MathJax could do better when mtextFontInherit is false. It should group "unknown" characters into a single collection, rather than putting each into a separate <span>.

All 23 comments

Note that if you set mtextFontInherit to true in the HTML-CSS and SVG sections of your configuration, then MathJax will process \text{} as a single <span>, and so that should do as you request. You are right that MathJax could do better when mtextFontInherit is false. It should group "unknown" characters into a single collection, rather than putting each into a separate <span>.

PS, I saw the report on the Wikimedia bugzilla and was planning to add it to the list of things to fix. Thanks for staring the issue here to track that.

Thanks for the mtextFontInherit tip. I was going to enable that anyways, but this is one more reason to do that.

Some support for RTL was added in v2.3, but the issue of multiple-character sequences being treated as a unit remains. For \text{}, these characters should already be grouped into a single <span>, so that would be one way to handle it, though not very convenient.

Ideally, MathJax would put each sequence that forms one group into a single <mi> or <mo>, just as it does for single Latin letters now. I've looked into this to some degree, and there are some difficulties handling it. It is possible to have combining characters grouped with their preceding characters, but it is not clear to me how some characters work. For example, it seems that the virama (U+0D4D) combines not just the character on its left, but also on the right, though I might be misunderstanding it. It also seems that some of these grouping are handled by ligatures within the fonts, not by combining characters. Unfortunately, MathJax does not have access to ligature information from the fonts. While it would be possible to add ligature data to MathJax's font tables, this could be a significant amount of data very little of which would be used by any one page.

I'm really not familiar enough with the languages that use these features to know if what I'm trying out would be sufficient or not. I'm wondering if it is possible to get some examples from a variety of languages that show the range of situations that need to be accommodated.

One approach might be to put the data needed for each language's script into an individual extension that gets loaded for those pages that need it (either explicitly in the MathJax configuration, or via \require{} within the math on the page). Do you think that would be acceptable?

Perhaps @amire80 of our WMF language engineering is able to help out a bit here...

@hartman do you think you could poke @amire80 some time? We'd love to improve this, especially if Wikipedia wants to roll out the SVG output more widely.

I'm right here :)

How can I help?

Testing? - Gladly, just tel me what to test exactly.

Examples of how non-Latin scripts work in formulas? - It's not used in Hebrew textbooks, but it is used in textbooks in Arabic and Persian. Maybe @ebraminio can chime in here.

Anything else?

Thanks for stopping by @amire80 :-)

How can I help?

I'm hoping we can improve handling of combined characters in non-Latin scripts. This has come up on WMF bugzilla/phabricator repeatedly. To quote Davide from https://github.com/mathjax/MathJax/issues/474#issuecomment-38324717 :

Ideally, MathJax would put each sequence that forms one group into a single or , just as it does for single Latin letters now. I've looked into this to some degree, and there are some difficulties handling it. It is possible to have combining characters grouped with their preceding characters, but it is not clear to me how some characters work. For example, it seems that the virama (U+0D4D) combines not just the character on its left, but also on the right, though I might be misunderstanding it. It also seems that some of these grouping are handled by ligatures within the fonts, not by combining characters. Unfortunately, MathJax does not have access to ligature information from the fonts. While it would be possible to add ligature data to MathJax's font tables, this could be a significant amount of data very little of which would be used by any one page.

I'm really not familiar enough with the languages that use these features to know if what I'm trying out would be sufficient or not. I'm wondering if it is possible to get some examples from a variety of languages that show the range of situations that need to be accommodated.

So our question would be: does anyone have expertise they can share with us? @hartman was kind enough to point to you ;-)

(Perhaps we should split this out into a separate issue.)

The (very) basic idea of virama is that the sequence of consonant + virama + consonant has three Unicode characters, which appear as occupying the space of one glyph (but it can get far more complicated).

More generally, I'd love to understand MathJax's current situation. What should I do to test the current rendering? Install my own instance? Or is there an online instance where a current version can be tested?

consonant + virama + consonant has three Unicode characters, which appear as occupying the space of one glyph

Right. Combined characters are common enough in mathematical layout so we understand the situation in general.

(but it can get far more complicated).

That's our problem. We lack the specifics for most natural language, non-Latin scripts.

Or is there an online instance where a current version can be tested?

You can do this on MediaWiki (using the MathML/SVG mode of the math extension), in the browser (this sample or this codepen) or use a local copy of MathJax -- whichever you like.

A basic example: ത്ര will be converted to &#xD24;&#xD4D;&#xD30; and since we don't have any routines to identify these kinds of combined characters, the TeX input converts this internally to MathML as

<math xmlns="http://www.w3.org/1998/Math/MathML">
  <mrow class="MJX-TeXAtom-ORD">
    <mo>&#xD24;</mo>
  </mrow>
  <mrow class="MJX-TeXAtom-ORD">
    <mo>&#xD4D;</mo>
  </mrow>
  <mrow class="MJX-TeXAtom-ORD">
    <mo>&#xD30;</mo>
  </mrow>
</math>

Which the MathJax output will in turn split across three span's (in the HTML outputs) or three g's (in the SVG output) -- and of course this breaks the rendering of the combined character.

(I just noticed that Firefox sometimes combines the spans in the HTML outputs e.g., ത്ര but not the subscript in കു_ശ. Chrome is more "consistent" in that nothing is combined)

So for us the problem is: is there a concise set of data (or some efficient heuristic) that we could use to identify all relevant situations where we need to re-combine into one mi/mo element in the MathML? Once we have that, the rendering will work as well.

So for us the problem is: is there a concise set of data (or some efficient heuristic) that we could use to > identify all relevant situations where we need to re-combine into one mi/mo element in the MathML?

Sorry for the long comment, bringing a bit of off site discussion back to the issue tracker.

How feasible/expensive would it be to make the Unicode UCD database
combining class available to mathjax for each character? Basically (or
at least as a good first approximation) any character with non zero
combining class (field 4 in UnicodeData.txt) needs to stay with the
preceding one, and in addition if it's class 9 (virama) the following
character needs to be kept together as well.

It's probably also worth noting that tex, even unicode tex like xetex
or luatex are almost certainly _not_ going to get this right without
markup
that is you will need \text{abc} or \mathit{abc} or some other such
command to force a string of characters to be typeset as text with a
single font rather than TeX's normal habit of splitting things up
character by character. Even if the construct _looks_ like a single
character to the author.

In classic tex it is not an issue as fonts can only have 256 characters
and while composed characters can be supported with various macro remapping tricks
composing characters following the base are basically not supportable even for simple
composing accents like acute.

Support in unicode tex variants such as xetex and luatex seems a bit variable. In text, xetex
hands things over to the HarfBuzz library so does pretty well. luatex handles it internally and currently does less well with the virama. In math both require a font with an opentype MATH table to do anything very useful and I couldn't find such a font that had a virama.

The following latex document is using kartika in text and latin modern math in math, you will note that
even european accents typically fail in math, but even the virama example works if you add some markup \mbox here or mi or mtext equivalently in MathML

The image shows xetex at the top and luatex at the bottom.

So while not requiring something like \text{..} or \mbox{...} around such character strings would be desirable, it would put your unicode support a long way ahead of what TeX can currently achieve
so it depends a bit on what the specification of the "tex-like syntax" is, how far beyond what TeX can do is it reasonable to push it?

\documentclass{article}

\usepackage{fontspec}
\usepackage{unicode-math}
\setmainfont{kartika.ttf}


\begin{document}

U+0d24 U+0d4d U+0d30 outputs e.g., ത്ര but 

abc $abc \mbox{ത്ര} $  U+0063

abç $abç \mbox{ത്ര} $ U+00e7

abç $abç \mbox{ത്ര} $  U+0063 U+0327

\end{document}

virama

I'm not really sure if I understand what the discussion is about, but if the idea is to identify what sequence of characters constitute a single unit, then Unicode grapheme clustering should provide the needed information..

Yes - what @khaledhosny says sounds like the right thing to me, although I'm not every experienced with it. Maybe @santhoshtr can contribute more details.

Santhosh, I think that what @pkra wrote three comments above explains the problem best.

On 3 March 2015 at 12:05, Khaled Hosny [email protected] wrote:

I'm not really sure if I understand what the discussion is about, but if
the idea is to identify what sequence of characters constitute a single
unit, then Unicode Grapheme clustering
http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries should
provide the needed information..

Yes but I suppose the question is how far it makes sense for a javascript
library to do that
by hand if the underlying platform doesn't make the unicode properties
available
and if it's emulating tex syntax how far would tex go? You know as much
about the tex support as anyone. How far would it be reasonable in xetex to
have such a cluster do anything sensible in _math_ without escaping to text
with \text{..} or some such command, given that you can't assign a
\mathclass to such a cluster?

I found a CoffeeScript implementation for graphemes.
https://github.com/devongovett/grapheme-breaker

Might be useful.

Thanks for all the useful comments. To summarize,

  • xetex/luatex do not handle input the way requested in this issue, i.e., without extra markup such as \text
  • it's not clear (to me at least) if there are plans to handle it this way
  • a solution could start with the simple approach David C outlined or potentially build on grapheme-breaker (thanks @hartman!)

To add to that,

  • On the other hand, a quick test with LaTeXML and pandoc indicates that they do handle such characters as requested here, i.e., not like xetex/luatex.

So it seems to me that a solution can't be in the core TeX input but needs to be an extension. That's not a problem, of course, since it probably would have ended up an extension anyway.

It would be good to hear from MediaWiki/WMF communities if they actually want to delineate from the TeX-engines here.

Again it would be good to get more feedback.

  • At TeX folks, is handling characters in math mode without extra markup the future direction of xetex/luatex/etc?
  • At MediaWiki / WMF folks: is non-standard TeX behavior actually desired by the relevant communities?

Without more feedback, I think we should punt on this / move it out of the 2.6 milestone.

Let me understand the issue here, people want to do things like $x+y=<complex character>$ where <complex character> is possibly a multi-code point grapheme, and have <complex character> treated as a math identifier, right? If so, then I think that is a reasonable expectation and if current Unicode TeX engines do not handle it correctly (they probably don’t) it is likely a bug or a missing feature, not something by design.

Or is it that people want to do things like $<complex text string>$, where <complex text string> is a multi-character text string that possibly needs complex text layout, and get proper text layout (bidi, shaping etc.)? I don't think that is a reasonable expectation and some kind of markup is needed here to indicate that this is a regular text string that needs to be treated as such.

Thanks, @khaledhosny!

[...] people want to do things like $x+y=$ where is possibly a multi-code point grapheme, and have treated as a math identifier, right?

Yes, that's how I understand it as well. (It's a bit difficult to say since this is originally a request from the Wikipedia end).

I think that is a reasonable expectation

Thanks!

if current Unicode TeX engines do not handle it correctly (they probably don’t) it is likely a bug or a missing feature, not something by design.

Thanks for that, too. The "they probably don’t" part worries me slightly but if you and @davidcarlisle agree that it's the desired behavior in Unicode TeX engines, then that's enough for us, I think.


Still hoping the MediaWiki/WMF/Wikipedia side will chime in.

As per F2F, we're removing this from the v2.6 Milestone (i.e., the upcoming release).

It's not clear what the right approach is, in particular, in terms of compatibility with TeX/LaTeX (or rather XeTeX/LuaTeX). It's also not clear what the WMF and the Wikipedia community really want here.

To be clear, we're not closing this issue and we are still interested in figuring out how complex layout might work in the TeX input.

Blast from the future: there's a TC39 proposal "Unicode segmentation" to allow (among other things) to split strings by grapheme https://github.com/tc39/proposal-intl-segmenter. The repository includes a link to a polyfill (and there's also a non-standard Chrome feautre apparently).

Cool. Thanks, @pkra.

No problem. The polyfill is unfortunately useless -- it only covers Enligsh. But for those who want to try it out, the chrome build-in might be useful.

Was this page helpful?
0 / 5 - 0 ratings