c++ - How to test a u32string for letters only (with locale) -


i'm writing compiler (for own programming language) , want allow users use of characters in unicode letter categories define identifiers (modern languages, go allow such syntax already). i've read lot character encoding in c++11 , based on informations i've found out, fine use utf32 encoding (it fast iterate on in lexer , has better support utf8 in c++).

in c++ there isalpha function. how can test wchar32_t if letter (a unicode code point classified "letter" in language)?

is possible?

use icu iterate on string , check whether appropriate unicode properties fulfilled. here example in c checks whether utf-8 command line argument valid identifier:

#include <stdint.h> #include <stdlib.h> #include <string.h>  #include <unicode/uchar.h> #include <unicode/utf8.h>  int main(int argc, char **argv) {   if (argc != 2) return exit_failure;   const char *const str = argv[1];   int32_t off = 0;   // u8_next has bug causing length < 0 not work characters in [u+0080, u+07ff]   const size_t actual_len = strlen(str);   if (actual_len > int32_max) return exit_failure;   const int32_t len = actual_len;   if (!len) return exit_failure;   uchar32 ch = -1;   u8_next(str, off, len, ch);   if (ch < 0 || !u_isidstart(ch)) return exit_failure;   while (off < len) {     u8_next(str, off, len, ch);     if (ch < 0 || !u_isidpart(ch)) return exit_failure;   } } 

note icu here uses java definitions, different in uax #31. in real application might want normalize nfc before.


Comments

Popular posts from this blog

java - pagination of xlsx file to XSSFworkbook using apache POI -

Unlimited choices in BASH case statement -

apache - How do I stop my index.php being run twice for every user -