c++ - how to read utf-8 characters from windows console ? Seems that ReadConsoleOutputCharacter() can't handle them -
here code isolate problem:
#include <iostream> #include <windows.h> using namespace std; int main() { setconsoleoutputcp(cp_utf8); _wsystem(l"echo pure ascii, naïveté"); coord pos = {0,0}; tchar* attempt1 = new tchar[14]; dword charnum1; readconsoleoutputcharacter(getstdhandle(std_output_handle), attempt1, 14, pos, &charnum1); wcout << endl << "charnum1: " << charnum1 << ", attempt1: " << attempt1 << endl; wcout << "getlasterror: " << getlasterror(); tchar* attempt2 = new tchar[16]; dword charnum2; readconsoleoutputcharacter(getstdhandle(std_output_handle), attempt2, 16, pos, &charnum2); wcout << endl << "charnum2: " << charnum2 << ", attempt2: " << attempt2 << endl; wcout << "getlasterror: " << getlasterror(); system("pause > nul"); } output is:
pure ascii, naïveté charnum1: 14, attempt1: pure ascii, na getlasterror: 0 charnum2: 0, attempt2: x > getlasterror: 0 first attempt works ok, when function try read on position non-ascii char returns nothing, nor error indicated. ?
caveat: on system, cp_utf8 not available, , when run code echo command results in "the system cannot write specified device."
however, if remove setconsoleoutputcp() call , leave @ default codepage, 437, string displayed correctly.
note, there separate read , write codepages. tried various combinations of 437, 850, 1252, , 28591 -- latter 2 more-or-less map unicode's first 255 codepoints. if cp_utf8 working you, re-try code call setconsolecp(cp_utf8).
note readconsoleoutputcharacter() not place null after last read character, you've got problem in code when output tchar array: have no guarantee it's null-terminated , crash. (also, you're not deleting allocated tchar arrays.) so, changed allocation lines this:
tchar attempt1[] = l"____________________"; // 20 underscores which (with no call setconsoleoutputcp()) yielded this:
charnum1: 14, attempt1: pure ascii, na______ charnum2: 16, attempt2: pure ascii, na∩v____ that next-to-last glyph in second line isn't "n", it's character 0xef codepage 437. "ï" character 0xef unicode. what's happening here is, correct codepoint (0xef) read console, stream output continues use 437 codepage. stream output selects character based on locale setting of stream, not codepage that's been set in console.
i don't know why desired codepoint value read console when console's read codepage still 437. puzzled why, if setconsoleoutputcp(1252) (or 28591), output of echo command looks it's using cp 437: pure ascii, na∩vitΘ
Comments
Post a Comment