In general, the encoding of a
char
, and thus the encoding of a
std::string
, which is just a wrapper for a sequence of
char
's, is simply
undefined. It stores whatever "bytes" you put into it! Most string functions are
agnostic of a specific encoding. For example,
strlen()
(or
std::string::length()
) just counts the number of
bytes before the first NUL (
0x00
) byte. Therefore, for
multi-byte character encodings, such as UTF-8,
strlen()
returns the length
in bytes, rather than computing the actual number of encoded characters.
Now, things get interesting (messy) when you receive strings from an "external" source, or when you pass strings to an "external" destination. That's because, at this point, you need to agree on a specific encoding with the "external" entity. One such situation is when you read strings from a file, or when you write strings to a file. Here you need to know which encoding is stored in the file, or which encoding will be expected by whoever is going to read your file. Another important situation is when you call
OS functions that deal with strings!
On Windows, the
Win32 API, has two variants of functions that deal with strings: One "ANSI" (
char*
) variant and one "Unicode" (
wchar_t*
) variant. The "ANSI" variants of the Win32 API functions expect or return strings in whatever multi-byte character encoding (ANSI Codepage) that happens to be configured on your system. It's usually something like Windows-1252 (Latin-1) on systems in the "Western" world, but could be something entirely different, even UTF-8. Note that support for UTF-8 in the "ANSI" APIs is a relatively new invention in Windows! Meanwhile, the "Unicode" Win32 API functions expect or return Unicode strings, always using the UTF-16 (UCS-2) encoding.
In Linux,
OS (kernel) functions generally use the
char*
type for passing around strings. But the Linux kernel developers are very reluctant to assume or enforce any specific character encoding. So, in Linux, most, if not all, OS (kernel) functions that deal with strings in some way are again
agnostic of a particular character encoding! For example, in Linux, a file name is simply defined as a sequence of non-NUL bytes. The Linux kernel therefore leaves it up to applications or the specific file-system implementation to deal with the details... 🙄
To make things even more complicated, many
locale-aware programs or libraries use the so-called
"locale" to deal with text input/output. It is configured with environment variables, like
LANGUAGE
,
LC_xxx
and
LANG
. The "locale" covers a bunch of other things, such as the formatting of numbers and the time/date format to be used, but it also includes the character set. Most commonly, UTF-8 is used these days.
https://www.gnu.org/software/gettext/manual/html_node/Locale-Environment-Variables.html
https://www.baeldung.com/linux/locale-environment-variables
You could obviously use it to store other encodings than UTF-8 even on Linux (some libraries might have special functions for different encodings) but in general when dealing with the file system or printing something to the console etc. it will be treated as UTF-8. |
Not necessarily. As pointed out above, the Linux kernel and the syscalls that it provides are
agnostic of a particular character encoding as much as possible. Meanwhile, most
locale-aware programs or libraries, including the terminal emulator, will probably use or assume the character encoding that is indicated by the active "locale" – most commonly UTF-8 these days, but this can
not be relied upon.