Rust: Tracking issue for non-ASCII identifiers (feature "non_ascii_idents")

Created on 12 Oct 2015  ·  54Comments  ·  Source: rust-lang/rust

Non-ASCII identifiers are currently feature gated. Handling of them should be fixed and the feature gate removed.

B-unstable C-tracking-issue P-low T-lang

Most helpful comment

Not sure if this is the right place to post this, but some interesting issues are are likely to appear with linting of mathematical symbols. Easily avoided by writing out variable names, but could be important if better correlation with real equations is a goal.

For example, Δ (uppercase) vs. δ (lowercase) in the following screenshot. The linter is not /wrong/, but it also imo doesn't really make sense to apply the snake case requirement here.

screen shot 2017-06-27 at 2 28 55 pm

All 54 comments

/cc @rust-lang/lang

nominating

cc @SimonSapin

Apparently we implement this: http://www.unicode.org/reports/tr31/ or something like it.

I would like to see this stabilised, but it will take some work to persuade ourselves that we are doing the right thing.

I have no idea what the right thing is here. In addition to Unicode recommendations, we might want to look at what other languages actually do, and what related bug reports or criticism they get. Or was this already done when the feature was first introduced?

@SimonSapin
C and C++ use http://unicode.org/reports/tr31/#Alternative_Identifier_Syntax (with some minor restrictions) and I haven't seen any complaints about it on isocpp forums or issue lists :)
Overview of the problem: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1518.htm
Implementation in Clang: http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Lex/UnicodeCharSets.h?view=markup
cc https://github.com/rust-lang/rust/issues/4928

There's also a problem with normalization of identifiers and mapping unicode mod names to the filesystem names (on OS X, IIRC), but I can't find the relevant link here it is: https://github.com/rust-lang/rust/issues/2253. (In the worst case non-inline mods and extern crates can be forced to be ASCII)

Yes #2253 is the big issue I know of that makes me worry about premature stabilization of non-unicode identifiers.

(The discussion there is more broad and arguably could be forked off into two threads; e.g. we _could_ take one normalization path for identifiers and another for string literal contents.)

we may want to migrate This discussion to the RFCS repo, e.g. at https://github.com/rust-lang/rfcs/issues/802

I agree that this is a feature that deserves to be put through the RFC process.

I've repurposed this issue to track stabilization (or deprecation, etc) of the non_ascii_idents feature gate.

After discussion in the lang team meeting, we decided that yes, an RFC would be the proper way forward here. We need something that collects the solutions from other languages, analyzes their pros/cons, and suggests the appropriate choice for Rust. This is controversial and complex enough that it should be brought to the community at large -- especially as many of us hacking on Rust on a daily basis don't have a lot of experience with non-ASCII anyhow.

triage: P-low

Marking as low as there is no RFC at present and hence no actionable content.

In JavaScript, Perl 5 and Perl 6 this feature is available.
JavaScript (Firefox 50)

function Слово(стойност) {
  this.стойност = стойност;
}
var здрасти = new Слово("Здравей, свят");
console.log(здрасти.стойност) //Здравей, свят

Perl >=5.12

use utf8;
{
  package Слово;
  sub new {
    my $self = bless {}, shift;
    $self->{стойност} = shift;
    $self
  }
};
my $здрасти = Слово->new("здравей, свят");
say ucfirst($здрасти->{стойност}); #Здравей, свят

Perl6 (this is not just next version of Perl. This is a new language)

class Слово {
  has $.стойност;
}

my $здрасти = Слово.new(стойност => 'здравей, свят');
say $здрасти.tc; #Здравей, свят

I would be happy to see it in Rust too.

For what it’s worth identifiers in ECMAScript 2015 are based on the Default Identifier Syntax from Unicode Standard Annex #31.

Perl with use utf8; uses the regexp below, with XID_Start and XID_Continue presumably also from UAX # 31.

/ (?[ ( \p{Word} & \p{XID_Start} ) + [_] ])
        (?[ ( \p{Word} & \p{XID_Continue} ) ]) *    /x

Yes! Thanks @SimonSapin!

For Python it’s <XID_Start> <XID_Continue>*.

So it looks like many programming languages that allow non-ASCII identifiers are based on the same standard, but in the details they each do something slightly different…

I would personally love to see support for math-related identifiers. For example, ∅ (and set operators, like ∩ and ∪). Translating equations from research papers/specifications into code is often a terrible process resulting in verbose and difficult to read code. Being able to use the same identifiers in the code that are in the paper's math equations would simplify implementation and would make the code easier to check and compare against the paper's equations.

What's point of this feature exactly? Aside from adding possibility to create truly ugly mix of different languages in your code(english is the only truly international language), it gives no benefits to language functionality wise. Or is it support of unicode for the sake of supporting unicode?

@DoumanAsh Not every program is international, and English fluency doesn’t have to be a requirement for programming.

It’s fine maintainers of any project to decide that variable names and comments in their code should be in English. This is what happens for many open-source projects, including the rustc itself. But that doesn’t mean that the language should be restricted to that.

The use case that I see is not for writting production code, but for teaching. I really sucks to tell people that they have to be fluent in English to become programmers. The other situation is when you are writting a foreign language UI, if your UI has a text box labled "příjmení" but you end up putting the value in a variable named "last name" that is weird. Even weirder is if you have a feild named "rodné_číslo" (Czech national ID number). There is no analogous English word for that. So if I were writting a Czech tax app, or Banking app I'd have to use a weird name for no good reason. Its not like such an app would be portable to other languages anyways.

Another good reason to allow for this is that linguists often need to use IPA notation in variable names. The english names of IPA symbols can be rediculously long. For example, the American English r sound in the word red is transribed as one character ɹ̠ but is named post-alveolar retroflexive approximant. https://en.wikipedia.org/wiki/Alveolar_and_postalveolar_approximants So if I'm writting a text-to-speach program I might reasonably want to write fn say_ɹ̠() in stead of fn say_post_alveolar_retroflexive_approximant().

On a non-opinion side of things, I think that there is an interesting discussion to be had here with regard to which unicode code points are allowed to be part of a variable name. For example: Can I name a variable price€? Probably not, I don't think price$ works does it? Can I create a →![] macro for generating vectors? I know someone might want to do so, but → is a "math symbol" http://www.fileformat.info/info/unicode/char/2192/index.htm . So when lexing, we need to make a decission as to which code points are acceptable and which are not and maybe rust shouldn't simply dumbly ask the unicode standard whether something is a letter or not.

@timthelion in the current implementation Rust does not simply dumbly ask the unicode standard whether something is a letter or not - it relies on the XID_Start and XID_Continue unicode properties which has correct and intuitive behavior in all of your examples.

  • say_ɹ̠ is allowed because both 'ɹ' and '̠' are XID_Continue.
  • price€ and price$ are not allowed because '€' and '$' are not XID_Continue.
  • →![] is not allowed because '→' is not XID_Start.
  • příjmení and rodné_číslo are allowed.

@dtolnay thank you for the explanation. I hope you weren't offended by my use of the word "dumbly", perhaps that was a poorly chosen word.

Nope, just pointing out that great minds think alike and the fine folks of the unicode technical committee had the same concerns that you did.

I can come up with other in-production use cases.

Some words in a specific field are hard to translate into English, but some programs (e.g. games, local online-to-offline services) may have to deal with, e.g. Chinese dish names, hero names, place names. Programmers working for companies don't need to know what the English translations are, but they have to give their variable and function names. They will come up with strange names if they have to use English, usually very hard for other co-workers to understand.

At this point I think there is no doubt that there are many us cases. What remains to do is figure out the details:

  • What characters exactly should be allowed. For example non-ASCII punctuation should probably be excluded.
  • How much normalization should be done: two identifiers can be represented with different code points (different UTF-8 bytes in source files) but still be considered equivalent.

Several other languages agree on Unicode Standard Annex # 31, but have small differences in the details. Ideally, we’d find out what motivated these differences in order to decide what’s best for Rust.

https://rosettacode.org/wiki/Unicode_variable_names has some info for many languages.

I agree with @SimonSapin -- no one doubts this would be useful. The problem is that there is no standard solution and many of us (e.g., myself) are in a poor position to evaluate the tradeoffs. What we're missing is someone to collect the constraints and make a recommendation, I suspect. I suspect at this point any decision would be preferable to no decision -- though I'd definitely prefer to follow some precedent (ideally, a unicode spec or annex, but maybe also another lang) than just adopt Yet Another set of rules.

@nikomatsakis It would be nice to research exactly what motivated the small differences between various languages, but if no-one steps up to do that research and we want to proceed anyway then I think following UAX # 31 exactly (which I believe is what our current implementation does) is a fine default.

It may still be worth going through the RFC process with a detailed design, even if it happens to match the current implementation. (Which characters can be used, how they’re normalized / compared for equivalence, how we deal with future Unicode versions, etc.) I suggest whoever writes this RFC reads UAX 31 top to bottom at least once.

We may also want to consider creating a new (or, more likely, using a restricted subset of one of the existing profiles) PRECIS profile [1] for identifiers. This would allow us to normalize identifiers that should be considered the same even if they're slightly different (eg. for locales that have keyboards which output text that looks the same, but differs slightly in its Unicode representation) as well as provide a clear and concise set of rules to determine what is a valid Rust identifier.

I am not aware of any existing Rust implementations of the PRECIS framework (a lot of the Unicode infrastructure required to create one is still missing I think, but this would probably have to be fixed somewhat either way).

I wouldn't call myself an expert, but I have helped build one PRECIS implementation and am generally familiar with the RFCs and some of the pitfalls and gotchas, so I'd be happy to help (or bug the PRECIS working group for help) where needed.

[1] [RFC 7564](https://tools.ietf.org/html/rfc7564): PRECIS Framework: Preparation, Enforcement, and Comparison of Internationalized Strings in Application Protocols

Good point about the charcters that look the same. Here is the wikipedia
article on the issue
https://en.wikipedia.org/wiki/Duplicate_characters_in_Unicode

Here is an article which explains that the duplicate characters from
asian scripts are mostly unified:

https://people.w3.org/rishida/scripts/chinese/

On 04/11/2017 09:01 PM, Sam Whited wrote:
>

We may also consider creating a new (or, more likely, using a
restricted subset of one of the existing profiles) PRECIS profile [1]
for identifiers. This would allow us to normalize identifiers that
should be considered the same even if they're slightly different (eg.
for locales that have keyboards which output text that looks the same,
but differs slightly in its Unicode representation) as well as provide
a clear and concise set of rules to determine what is a valid Rust
identifier.

I am not aware of any existing Rust implementations of the PRECIS
framework (a lot of the Unicode infrastructure required to create one
is still missing I think, but this would probably have to be fixed
somewhat either way).

[1] RFC 7564 https://tools.ietf.org/html/rfc7564: PRECIS Framework:
Preparation, Enforcement, and Comparison of Internationalized Strings
in Application Protocols


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/rust-lang/rust/issues/28979#issuecomment-293367700,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABU7-IMgXefW2yZYyM0tn8qLhpGFw0bSks5ru84GgaJpZM4GM3Lj.

@SamWhited Why PRECIS over Unicode’s NFC or NFKC?

Why PRECIS over Unicode’s NFC or NFKC?

TL:DR — Normalization is just one step we'd want to do when determining if something is a valid identifier. Other operations may (or may not) also need to be performed.

@SimonSapin Unicode normalization is just one step of a PRECIS profile (so we would in fact be using normalization; probably NFC at a guess), however, PRECIS covers a more wide range of stuff. For instance, normalization forms don't do width mapping (I don't think?), so FullWidth won't be the same identifier as FullWidth. If you're on a keyboard that wants to type full width text this may be an issue (this is probably more of a problem with east asian characters than it is with Latin chars, but maybe someone from a locale that uses full width text could chime in and tell me if I'm misrepresenting the issue in any way). Other things a PRECIS profile can do include defining a subset of character properties that are allowed (eg. letters, numbers, dashes, and starts with a letter or something like that).

_Disclaimer:_ I haven't actually thought through whether mapping full width text would be desirable or not; it's just an example. It may very well be that normalization is all that matters, or maybe we don't care to do any mapping at all; Go only checks if identifiers have the letter or number property, I think, so if they get by with only that, maybe it's fine for us too. More thought is certainly needed.

Further reading: this is what the Go spec does (which is much simpler than what I suggested, which may or may not be a good thing): https://golang.org/ref/spec#Source_code_representation

What uses PRECIS? Does any programming language?

What uses PRECIS? Does any programming language?

I'm not sure what any language but Go does.

Related Go 2 issue: golang/go#16033

On Tue, Apr 11, 2017 at 02:07:49PM -0700, Sam Whited wrote:

For instance, normalization forms don't do width mapping, so FullWidth won't be the same identifier as FullWidth.

NFKC does that:

Python 3.6.0 (default, Jan 16 2017, 12:12:55)
[GCC 6.3.1 20170109] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> FullWidth = 1
>>> FullWidth
1

--
Best regards,
lilydjwg

@SamWhited, in your first link I find:

identifier = letter { letter | unicode_digit } .
letter        = unicode_letter | "_" .

But as far as I can tell Go currently doesn’t do any normalization and using PRECIS is a proposal. Is that correct?

But as far as I can tell Go currently doesn’t do any normalization and using PRECIS is a proposal. Is that correct?

@SimonSapin that's correct; well, not even really a proposal, just an idea to be thought through like this issue (sorry, reread that sentence and my link and it was poorly worded; didn't mean to suggest that it does use it right now, just that I don't know what anything other than Go actually does to handle non-ASCII identifiers).

@SimonSapin

It may still be worth going through the RFC process with a detailed design, even if it happens to match the current implementation.

👍

I was just reading through UAX #31 to see what they did, and another benefit of using a PRECIS profile stood out to me: just like deprecating stringprep and using PRECIS instead, it provides a way to be future compatible and agile across Unicode versions (by operating on derived properties of code points instead of individual code points themselves).

While TR31 does have a concept of "Immutable Identifiers" to help address this, it effectively is a slightly less restrictive version of a PRECIS protocol derived from the freeform class, but without the considerations PRECIS has given to the order in which rules need to be applied (I don't think?) it also doesn't cover edge cases covered by the PRECIS framework such as use of Greek final sigma, or some of the edge cases around Hangul Jamo (again, I am no expert in either of these, but that's why PRECIS exists; the experts have done the work already).

it provides a way to be future compatible and agile across Unicode versions (by operating on derived properties of code points instead of individual code points themselves).

I don’t understand this point. XID_Start and XID_Continue are derived properties.

I don’t understand this point. XID_Start and XID_Continue are derived properties.

I might have misunderstood UAX 31 then; it looked to me like it required a specific Unicode version. Re-reading I can't see where I got that from though.

Not sure if this is the right place to post this, but some interesting issues are are likely to appear with linting of mathematical symbols. Easily avoided by writing out variable names, but could be important if better correlation with real equations is a goal.

For example, Δ (uppercase) vs. δ (lowercase) in the following screenshot. The linter is not /wrong/, but it also imo doesn't really make sense to apply the snake case requirement here.

screen shot 2017-06-27 at 2 28 55 pm

would it be possible to allow emoji in variable names even though they aren't XID Start/Continue, like in Swift?

@fwrs, Emojis are way more complicated now than non-Emoji characters.

Thanks to some vendors, now you can have Emoji joining (ZWJ) sequences that just keep changing their colors and small details, many of which are not necessarily visible to the naked eye.

Also, the definition of Emoji is expanding fast, every single year, which is not something a system-level programming languages that wants to be stable and reliable needs.

So, although it's cute, I don't think it sits well with Rust goals. But, rust-based scripting/educational languages may benefit from allowing Emojis, depending on their goals.

@ryankurte There's a semantic problem in your example—you're transcribing mathematical formulae, but you used U+0394 GREEK CAPITAL LETTER DELTA rather than U+2206 INCREMENT. The former is a letter of the Greek alphabet, and as such has casemapping; the latter is a mathematical symbol and does not.

I'd like to cross-link this comment: https://github.com/rust-lang/rust/issues/4928#issuecomment-343137316

I haven't seen the possibility of enabling homoglyph-based attacks here (If somebody mentioned them please ignore the noise), but I just filled a clippy issue to request a lint that warns on code like this:

#![feature(non_ascii_idents)]
fn main() {
    let a = 2;
    let а = 3;
    assert_eq!(a, 2);  // OK
    assert_eq!(а, 3);  // OK
}

In a nutshell, those two as are different unicode characters so the second let binding does not shadow the first one, and both asserts pass (the playground doesn't seem to support unicode identifiers though so the only way to try this is locally; works for me).

This "feature" can be used to introduce exploits in Rust programs that are harder to detect, in particular given that shadowing let bindings are considered idiomatic Rust by many, myself included.

P.S.: this "feature" might be useful in underhanded Rust contests, although that #![feature(non_ascii_idents)] should raise some eyebrows :)

@gnzlbg I believe there's already some support for confusables detection to stop people swapping out your semicolons for Greek question marks and such, but I don't know if it applies to identifiers. If it does, then that solves that problem; if it doesn't, at least we have the tooling to do it ready to go.

I'm a little concerned that this is a candidate for being closed and the code removed from the compiler because it's not had significant movement for a while and requires an RFC. I care a fair amount about Rust being a language of the 21st century, which means Unicode, and about Rust being friendly to non-English-speaking programmers. What I lack is the ability to actually write an RFC.

@Ketsuban

I believe there's already some support for confusables detection to stop people swapping out your semicolons for Greek question marks and such, but I don't know if it applies to identifiers.

yes, I think that, as suggested by @oli-obk in the clippy issue, Rust implementation would instead just use the latest official confusable list:

http://www.unicode.org/Public/security/revision-06/confusables.txt

homoglyph-based attacks can be prevented. This list would need to be kept in sync though, but that is something that can be automated as part of the build system.

@Ketsuban

If you care about this, there are other languages that support unicode in their identifiers, and these languages have processes similar to the RFC process. You could start by checking those. Who knows, maybe you can just merge them together with the feedback in this issue, and get a pre-RFC in the internals forum going? From that point on, it is just about incorporating/arguing feedback with others, and before you know it you will have an RFC ready.

In a way I hope we stick with ASCII identifiers forever. Handling unicode identifiers is such a massive interoperability pain. Some of the more bizarre examples of NFKC mappings is that things like this map to the same identifier:

>>> ℌ = 1
>>> H
1
>>> Ⅸ = 42
>>> IX
42
>>> ℕ = 23
>>> N
23
>>> import math
>>> ℯ = math.e
>>> e
2.718281828459045
>>> ℨ = 2
>>> Z
2

@mitsuhiko The real world has that kind of pain. We can't just ignore this problem because it's hard to deal with and involves a feature that you _personally_ have no use for.

Also, the current RFC explicitly proposes NFC over NFKC, after a _lot_ of discussion about examples very similar to those.

Was this page helpful?
0 / 5 - 0 ratings