Only a tiny part of the Unicode database, the normalization tables. Problem is t...

cryptonector · on May 16, 2023

> Problem is these tables have to be updated every year, and they dont with similar tables, such as the glibc.

The language standard can avoid this by committing to a single Unicode version for each language version.

> And Unicode identifiers are entirely insecure, ...

Eh, it depends on what we're talking about using them for. If it's symbols in object files, it's not really a problem. More importantly confusables are unavoidable -- even ASCII all by itself has confusable characters (1 and l for example). The thing to do is to forbid arbitrary mixing of scripts, allowing only those scripts that are relevant in the relevant contexts. For DNS this means that DNS TLD registries should come up with per-registry rules, for example.

rurban · on May 17, 2023

See my https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2932.htm

cryptonector · on May 17, 2023

Can you expand on what the security concerns are as to confusables in symbol names in C source code? Clearly there's a security concern with cut-n-paste, but that's true regardless of what the rules might be for C identifiers.

It's not like it's obvious that UTR#39 applies literally everywhere that there are "identifiers".

Also, can you speak to what is the security concern with form-insensitivity (rather than confusables) as to symbols in input source files? I just don't see a concern at all there, but maybe I'm missing something.

Lastly, I think `#include` is the most important place to get this right since that does interface with the world outside the compiler (specifically: the filesystem), but as you note the filesystems mostly are just-use-8bit -- very few filesystems normalize on create (HFS+) or are form-insensitive (ZFS). The other place to get this right is on the object file output side, where symbols definitely must be normalized.

Oh, one more thing: the platform might impose some rules regarding symbols in ELF and any other object file formats. Are they known to? I suppose C can't necessarily cater to all platform-imposed limitations on symbol naming, but it'd be useful to know about them.