Latex3: Case changing for Cyrillic

Created on 17 Feb 2020  ·  31Comments  ·  Source: latex3/latex3

As noted in https://github.com/latex3/latex3/issues/671, at present

\documentclass{article}
\usepackage[T1,T2A]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{expl3}

\ExplSyntaxOn
\def\test{\text_lowercase:n}
\ExplSyntaxOff

\begin{document}
\test{\.I İ \CYRI И}
\end{document}

gives at-best an 'odd' result.

It should be possible to carry out case-changing here as it is not dependent on \lccode changes but rather on expanding И to

\u8:И ->\IeC {\CYRI }

and then doing the work.

expl3 feature-request

Most helpful comment

@josephwright but you really should implement \text_lowercase:n{\emoji{Man}} = \emoji{Boy} ;-)

All 31 comments

u8:И ->IeC {CYRI }

Couldn't it make more sense to extract И from u8:И, and look up case
information in some intarray?

@blefloch
Yes!

What are these u8:... commands anyway? Are they needed?

@blefloch
Yes!

or maybe not Chris. One may has to deal with ^^notation in that place instead of И but on the whole I agree that looks like the better starting point

What are these u8:... commands anyway? Are they needed?

you should know :-) your name is on the file that contains that code. Yes they are needed: in pdftex LaTeX sees bytes analyzes them and constructs a single csname from them \u8:...which holds the LICR for that utf8 char which in the above case is \IeC {\CYRI } or if the \u8:... is not defined responds with no Unicode representation for ...

you should know :-) your name is on the file that contains that code.
But Not everything I may be responsible for is needed:-).

I agree I should look at the original code! At least to find out where the : came from.

But I should stop now in case I anger a certain person by displaying my opinions in such a public place:-).

@blefloch There are a couple of things needed. The first is to spot a UTF-8 pair/triplet/quartet and grab it whole rather than token-by-token. That's easy enough: check for active char tokens equal to the inputenc starting point. The second phase is to know how to case change them. The reason I mentioned taking the \IeC{...} approach is then we don't need _new_ data: it's the same way that \MakeUppercase handles them and so uses the \@uclclist data we're already collecting.

The reason I mentioned taking the IeC{...} approach is then we don't need new data:
Well, you may need a bit more if you want to cover absolutely every character that changes case (They may not all yet have LICRs.)

Using numbers and Unicode tables is aesthetically more appealing, of course. But if ‘tables of names’ works for now . . .

For Cyrillic, Greek, Armenian, etc etc, is it possible to use new LICRs of the form cyr{}, a bit like accents?

@car222222 The issue came up as there are places that current \MakeUppercase will work that \text_uppercase:n won't, which come down to things that go via u8:.... That's why I was starting with this. If we want the full Unicode range in pdfTeX (doable), we'll need to store the data manually in a integer array.

If we want the full Unicode range in pdfTeX (doable), we'll need to store the data manually in a integer array.

Given that pdfTeX deliberately only provides utf8 chars if supported by the loaded font encodings it is questionable to first case change and then find that the result is an unsupported character. Of course, if the whole data is inside the format then there is no extra payload (other than the size taken up by it) and the initial preparation.

it is questionable to first case change and then find that the result is an unsupported character.

I don't find this very problematic. Lowercase and uppercase are in the same encoding, so you only would get an error on a capital alpha if you start with the unsupported lowercase alpha.

On 2/18/20 3:49 PM, Ulrike Fischer wrote:

it is questionable to first case change and then find that the
result is an unsupported character.

I don't find this very problematic. Lowercase and uppercase are in the
same encoding, so you only would get an error on a capital alpha if you
start with the unsupported lowercase alpha.

Even if there exists an encoding with lowercase alpha but not upper case
alpha (this might plausibly be the case for some of the rarer accents),
getting an error of Unicode char not set up seems better than
accidentally getting the lowercase char.

I agree with Ulrike and Bruno. But I am failing to imagine a realistic case (pun intended) where the upper and lower case characters are not both available/unavailable simultaneously.

Given that pdfTeX deliberately only provides utf8 chars if supported by the loaded font encodings

Meaning what? pdfTeX does not ‘provide chars’ at all, does it? And ‘loaded font encodings’ is a LaTeX concept, not an engine one.

Maybe it means that in the way we originally set up the utf8 stuff for LaTeX, LICRs were only (and mappings were provided only ‘for known encodings’ and then only loaded for loaded encodings.

True, but there is no need to keep such restrictions these days, is there?
We can certainly now easily provide them for any subset of Unicode we wish to, and in this context we only need to cover all ‘casable characters’.

Disclaimer: I was never very keen on that restriction to known encodings:-).

    Given that pdfTeX deliberately only provides utf8 chars if
    supported by the loaded font encodings

Meaning what? pdfTeX does not ‘provide chars’ at all, does it? And
‘loaded font encodings’ is a LaTeX concept, not an engine one.

meaning pdflatex and writing pdftex

Maybe it means that in the way we originally set up the utf8 stuff for
LaTeX, LICRs were only (and mappings were provided only ‘for known
encodings’ and then only loaded for loaded encodings.

yes which was a Good Thing TM because that kept the LaTeX world free of
tofu and missing characters

True, but there is no need to keep such restrictions these days, is there?
We can certainly now easily provide them for any subset of Unicode we
wish to, and in this context we only need to cover all ‘casable characters’.

yes, there is. if you don't have the glyphs to typeset the characters it
is pointless to do so, which is why claiming that you cn do unicode as
as xetex or luatex (latex) does and then just generating holes ans No
char XXX warnings in the log is a step backwards to the pdflatex
solution, imho

Disclaimer: I was never very keen on that restriction to known encodings:-).

well, as long as you write English it usually doesn't matter if you
write in other languages and your document gets corrupted without
warning you you it does

There may well be reasons for not loading LICRs for unrepresentable characters.

But here we are talking only about defining these LICRs and uppercasing characters, note ‘characters‘.
Nothing to do with typesetting them, so the encodings/fonts that are available are not relevant.
Use-case: the uppecased form is only for use in a pdf bookmark, never to be typeset (by TeX, at least!)

After looking at the problem a little more, it seemed easier to handle it using a fixed list of mappings rather than trying to do things by looking inside active chars. I had a quick look at how many codepoints there are with case-changing data: about 2000. That's possibly a bit much to do all of them, so for the present I've picked up Greek and Cyrillic ones that are covered by T2/LGR. Thoughts welcome.

what about the idea to store all of them in an intarray?

The thing with using an intarray is we can't make it sparse, so the size would depend on the codepoint of the final value to be stored. There's also a bit of a performace hit at point-of-use as we'd have to extract, convert to bytes and construct the active chars then, rather than doing it once at load time.

Also, back with the 'what codepoints have glyphs' business, as far as I know, the Greek and Cyrllic ones plus the Latin ones already covered are by far the most useful

Well, to the Greeks and Cyrills they are the most useful, yes! But not to the rest of the world?
Das heisst: how did you measure this utility?

I guess the total gets up so large due to the many latin-derivatives around, or not?
2000 is approx 30+ typical alphabets, I guess.

'Utility' here was just starting with 'what works currently in pdfTeX', so 'what encodings are available'. I'm not sure what exactly all the mappings cover: it's possible there are false-positives. Presumably there are for a start all of the math variants (italic, sanserif, ...).

A lot of it is accented Latin/Cyrillic/Greek, then there is Copic, Armenian, Old Hungarian, Cherokee, etc. Certainly not 30 alphabets, but probably at least 10.

Full list of scripts:

  • Latin (>700 codepoints!) incl. full-width versions
  • Greek
  • Coptic
  • Cyrillic
  • Armenian
  • Georgian
  • Cherokee
  • Glagolitic
  • Deseret
  • Osage
  • Old Hungarian
  • Warang
  • Medefaidrin
  • Adlam

!! Latin (>700 codepoints!) incl. full-width versions
Ah yes, not to mention ‘circled superscript’ versions,
and I am sure there must be lowercase emojis in Unicode by now:-).

@car222222 Luckily no circled letters ;) It's mainly lots and lots of combining accent versions.

@josephwright but you really should implement \text_lowercase:n{\emoji{Man}} = \emoji{Boy} ;-)

Thoughts on further coverage? Or do we go with what I've set up for the present?

The handling of \.I İ in the MWE above is different in pdfLaTeX (also compared to the Unicode engines), but I admit that İ is probably a tricky case in the generic case change code.

So I tried the Turkish case changer

\documentclass{article}
\usepackage{fontspec}
\usepackage{libertinus}
\usepackage{expl3}

\ExplSyntaxOn
\def\test{\text_lowercase:nn{tr}}
\ExplSyntaxOff

\begin{document}
\test{\.I İ \CYRI И}
\end{document}

(L3 programming layer <2020-02-25>) and LuaLaTeX and XeLaTeX are not happy

! Undefined control sequence.
<inserted text> ı

@moewew Hmm, that's a bit odd: I'll get is sorted

@moewew Specific issue with Turkish: now fixed

Thoughts on further coverage? Or do we go with what I've set up for the present?

I would start with present and extend when need arrises

OK, I think that's the best position, and also means we can keep issues moving. I'll close here and specific additions can be addressed in new issues.

Was this page helpful?
0 / 5 - 0 ratings