Openrefine: Text facet sort by name should use case & diacritic insensitive collation

Created on 15 Oct 2012  ·  3Comments  ·  Source: OpenRefine/OpenRefine

_Original author: tfmorris (November 12, 2011 19:51:37)_

Currently lowercase characters sort after all upper case characters so 'T' and 't' are in wildly different spots and international characters collate at the very end so that 'Österreichische' is miles from the 'O's.

We should fold both case and diacritics to their base forms.

_Original issue: http://code.google.com/p/google-refine/issues/detail?id=482_

bug facets imported from old code repo localization Medium

Most helpful comment

Waiting 8 years has its advantages - there's now ECMAscript support for Intl.Collator which collates letter case and diacritic forms together (according to locale specific rules).

All 3 comments

_From tfmorris on November 12, 2011 20:33:31:_
r2371 makes the sorting order case insensitive, but Javascript doesn't appear to have a built-in diacritic folding method, so that'll be a little more work.

After I committed the "fix" I discovered that this may actually be a browser-specific bug/difference, but it doesn't appear that there's been much progress in fixing it, so we probably should assume that the current state is going to exist for a while.
http://code.google.com/p/v8/issues/detail?id=459

There's a code snippet here that can be used to scrub diacritics: http://lehelk.com/2011/05/06/script-to-remove-diacritics/

Waiting 8 years has its advantages - there's now ECMAscript support for Intl.Collator which collates letter case and diacritic forms together (according to locale specific rules).

The default localeCompare() implementation collates diacritics together, at least for the en-US locale with Chrome, but presumably collates things the way users expect in all locales, so I think we can close this.

Was this page helpful?
0 / 5 - 0 ratings