c++ - How to test a u32string for letters only (with locale) -
i'm writing compiler (for own programming language) , want allow users use of characters in unicode letter categories define identifiers (modern languages, go allow such syntax already). i've read lot character encoding in c++11 , based on informations i've found out, fine use utf32 encoding (it fast iterate on in lexer , has better support utf8 in c++).
in c++ there isalpha function. how can test wchar32_t if letter (a unicode code point classified "letter" in language)?
is possible?
use icu iterate on string , check whether appropriate unicode properties fulfilled. here example in c checks whether utf-8 command line argument valid identifier:
#include <stdint.h> #include <stdlib.h> #include <string.h> #include <unicode/uchar.h> #include <unicode/utf8.h> int main(int argc, char **argv) { if (argc != 2) return exit_failure; const char *const str = argv[1]; int32_t off = 0; // u8_next has bug causing length < 0 not work characters in [u+0080, u+07ff] const size_t actual_len = strlen(str); if (actual_len > int32_max) return exit_failure; const int32_t len = actual_len; if (!len) return exit_failure; uchar32 ch = -1; u8_next(str, off, len, ch); if (ch < 0 || !u_isidstart(ch)) return exit_failure; while (off < len) { u8_next(str, off, len, ch); if (ch < 0 || !u_isidpart(ch)) return exit_failure; } } note icu here uses java definitions, different in uax #31. in real application might want normalize nfc before.
Comments
Post a Comment