Latex3: Case changing for Cyrillic

Created on 17 Feb 2020 · 31Comments · Source: latex3/latex3

As noted in https://github.com/latex3/latex3/issues/671, at present

\documentclass{article}
\usepackage[T1,T2A]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{expl3}

\ExplSyntaxOn
\def\test{\text_lowercase:n}
\ExplSyntaxOff

\begin{document}
\test{\.I İ \CYRI И}
\end{document}

gives at-best an 'odd' result.

It should be possible to carry out case-changing here as it is not dependent on \lccode changes but rather on expanding И to

\u8:И ->\IeC {\CYRI }

and then doing the work.

expl3 feature-request

Source

josephwright

Most helpful comment

@josephwright but you really should implement \text_lowercase:n{\emoji{Man}} = \emoji{Boy} ;-)

u-fischer on 24 Feb 2020

😄3

All 31 comments

u8:И ->IeC {CYRI }

Couldn't it make more sense to extract И from u8:И, and look up case
information in some intarray?

blefloch on 18 Feb 2020

@blefloch
Yes!

What are these u8:... commands anyway? Are they needed?

car222222 on 18 Feb 2020

@blefloch
Yes!

or maybe not Chris. One may has to deal with ^^notation in that place instead of И but on the whole I agree that looks like the better starting point

What are these u8:... commands anyway? Are they needed?

you should know :-) your name is on the file that contains that code. Yes they are needed: in pdftex LaTeX sees bytes analyzes them and constructs a single csname from them \u8:...which holds the LICR for that utf8 char which in the above case is \IeC {\CYRI } or if the \u8:... is not defined responds with no Unicode representation for ...

FrankMittelbach on 18 Feb 2020

you should know :-) your name is on the file that contains that code.
But Not everything I may be responsible for is needed:-).

I agree I should look at the original code! At least to find out where the : came from.

But I should stop now in case I anger a certain person by displaying my opinions in such a public place:-).

car222222 on 18 Feb 2020

@blefloch There are a couple of things needed. The first is to spot a UTF-8 pair/triplet/quartet and grab it whole rather than token-by-token. That's easy enough: check for active char tokens equal to the inputenc starting point. The second phase is to know how to case change them. The reason I mentioned taking the \IeC{...} approach is then we don't need _new_ data: it's the same way that \MakeUppercase handles them and so uses the \@uclclist data we're already collecting.

josephwright on 18 Feb 2020

The reason I mentioned taking the IeC{...} approach is then we don't need new data:
Well, you may need a bit more if you want to cover absolutely every character that changes case (They may not all yet have LICRs.)

Using numbers and Unicode tables is aesthetically more appealing, of course. But if ‘tables of names’ works for now . . .

For Cyrillic, Greek, Armenian, etc etc, is it possible to use new LICRs of the form cyr{}, a bit like accents?

car222222 on 18 Feb 2020

@car222222 The issue came up as there are places that current \MakeUppercase will work that \text_uppercase:n won't, which come down to things that go via u8:.... That's why I was starting with this. If we want the full Unicode range in pdfTeX (doable), we'll need to store the data manually in a integer array.

josephwright on 18 Feb 2020

If we want the full Unicode range in pdfTeX (doable), we'll need to store the data manually in a integer array.

Given that pdfTeX deliberately only provides utf8 chars if supported by the loaded font encodings it is questionable to first case change and then find that the result is an unsupported character. Of course, if the whole data is inside the format then there is no extra payload (other than the size taken up by it) and the initial preparation.

FrankMittelbach on 18 Feb 2020

it is questionable to first case change and then find that the result is an unsupported character.

I don't find this very problematic. Lowercase and uppercase are in the same encoding, so you only would get an error on a capital alpha if you start with the unsupported lowercase alpha.

u-fischer on 18 Feb 2020

👍1

On 2/18/20 3:49 PM, Ulrike Fischer wrote:

it is questionable to first case change and then find that the
result is an unsupported character.
I don't find this very problematic. Lowercase and uppercase are in the
same encoding, so you only would get an error on a capital alpha if you
start with the unsupported lowercase alpha.

Even if there exists an encoding with lowercase alpha but not upper case
alpha (this might plausibly be the case for some of the rarer accents),
getting an error of Unicode char not set up seems better than
accidentally getting the lowercase char.

blefloch on 18 Feb 2020

👍1

I agree with Ulrike and Bruno. But I am failing to imagine a realistic case (pun intended) where the upper and lower case characters are not both available/unavailable simultaneously.

car222222 on 18 Feb 2020

Given that pdfTeX deliberately only provides utf8 chars if supported by the loaded font encodings

Meaning what? pdfTeX does not ‘provide chars’ at all, does it? And ‘loaded font encodings’ is a LaTeX concept, not an engine one.

Maybe it means that in the way we originally set up the utf8 stuff for LaTeX, LICRs were only (and mappings were provided only ‘for known encodings’ and then only loaded for loaded encodings.

True, but there is no need to keep such restrictions these days, is there?
We can certainly now easily provide them for any subset of Unicode we wish to, and in this context we only need to cover all ‘casable characters’.

Disclaimer: I was never very keen on that restriction to known encodings:-).

car222222 on 18 Feb 2020

    Given that pdfTeX deliberately only provides utf8 chars if
    supported by the loaded font encodings
Meaning what? pdfTeX does not ‘provide chars’ at all, does it? And
‘loaded font encodings’ is a LaTeX concept, not an engine one.

meaning pdflatex and writing pdftex

Maybe it means that in the way we originally set up the utf8 stuff for
LaTeX, LICRs were only (and mappings were provided only ‘for known
encodings’ and then only loaded for loaded encodings.

yes which was a Good Thing TM because that kept the LaTeX world free of
tofu and missing characters

True, but there is no need to keep such restrictions these days, is there?
We can certainly now easily provide them for any subset of Unicode we
wish to, and in this context we only need to cover all ‘casable characters’.

yes, there is. if you don't have the glyphs to typeset the characters it
is pointless to do so, which is why claiming that you cn do unicode as
as xetex or luatex (latex) does and then just generating holes ans No
char XXX warnings in the log is a step backwards to the pdflatex
solution, imho

Disclaimer: I was never very keen on that restriction to known encodings:-).

well, as long as you write English it usually doesn't matter if you
write in other languages and your document gets corrupted without
warning you you it does

FrankMittelbach on 18 Feb 2020

There may well be reasons for not loading LICRs for unrepresentable characters.

But here we are talking only about defining these LICRs and uppercasing characters, note ‘characters‘.
Nothing to do with typesetting them, so the encodings/fonts that are available are not relevant.
Use-case: the uppecased form is only for use in a pdf bookmark, never to be typeset (by TeX, at least!)

car222222 on 18 Feb 2020

After looking at the problem a little more, it seemed easier to handle it using a fixed list of mappings rather than trying to do things by looking inside active chars. I had a quick look at how many codepoints there are with case-changing data: about 2000. That's possibly a bit much to do all of them, so for the present I've picked up Greek and Cyrillic ones that are covered by T2/LGR. Thoughts welcome.