Section 0) Introduction
This article is to go over being Unicode Friendly in WinAPI.
I don't normally encourage programming in WinAPI, since you're typically better off with a crossplatform widgetry lib such as wxWidgets or QT or whatever. But a lot of people still like to use WinAPI directly... so I should at least point them in the right direction. Besides, a lot of the stuff here applies to wx as well (and possibly to QT, though I've never used QT so I can't say for sure).
I didn't put a lot of work into formatting or proofreading this article. So my apologies there. I still think it gets the idea across pretty well, even if my throughts are unorganized.
Unicode forever! Spread the love!
Section 1) The UNICODE macro
The UNICODE macro (and/or _UNICODE macro -- usually both) is scattered throughout all of WinAPI. It redefines some types and functions to use either char* strings (if it's not defined) or wchar_t* Unicode strings (if it is defined).
If you use MSVS, these macros are often automatically defined by the compiler before it begins compiling if you set your project settings to make the program a Unicode program. Otherwise you can do it yourself by #defining them before you include Windows.h:
1 2 3 4 5 6 7 8 9
|
#ifndef UNICODE
#define UNICODE
#endif
#ifndef _UNICODE
#define _UNICODE
#endif
#include <Windows.h>
|
You don't need to #define either of them to use Unicode in your program. It just changes around some types to make it easier to use the Unicode parts of WinAPI.
Further in this article, "Unicode build" refers to UNICODE and _UNICODE being defined, whereas "ANSI build" refers to neither of them being defined.
Section 2) LPSTR, LPCSTR, LPTSTR, LPCTSTR, LPWTFISALLTHIS
Anybody who's looked at WinAPI has probably seen the above types... but what exactly are they?
An inexperienced C/C++ coder might think they're strings, like std::string. It can certainly look that way from the documentation and examples. And since WinAPI pages doesn't ever really seem to tell you exactly what they are, it's a logical conclusion.
However, this is not the case. All of the above are
macros which #define different types.
Now you might look at "LPCTSTR" and see the "STR" in there, but the rest might look like random letter combinations that make no sense. Rest assured there's a method to the madness.
- The starting 'LP' stands for "Long Pointer". Without getting too much into what a Long Pointer is (or really what it used to be, it doesn't have as much meaning in modern computing), we'll just say that this is basically a pointer. This means that the LP is telling you that this type is not a string by itself, but is a POINTER to a string (or really, a C-style string).
- The 'C' means that the string is constant
- The 'W' means the string is wide (Unicode)
- The 'T' means the string is TCHARs (see section on TCHAR below)
So really, the #defines are the following:
1 2 3 4 5 6 7 8
|
#define LPSTR char*
#define LPCSTR const char*
#define LPWSTR wchar_t*
#define LPWCSTR const wchar_t*
#define LPTSTR TCHAR*
#define LPCTSTR const TCHAR*
|
Section 3) TCHAR, _T(), T(), TEXT()
TCHAR is #defined as either a char or a wchar_t depending on whether or not the UNICODE macro was defined.
By using TCHARs properly, you can create both ANSI and Unicode builds of your program. All you have to do is #define UNICODE if you want a Unicode build, or don't define it if you want an ANSI build.
This presents a bit of a problem, though. String literals in C++ can take 2 forms, either char or wchar_t:
1 2
|
const char* a = "Foo";
const wchar_t* b = L"Bar"; // <-- note the L. That makes it wide.
|
The compiler doesn't auto-detect... so things like this would throw compiler errors:
1 2
|
const char* a = L"Foo"; // <-- error, can't point char* to a wide string
const wchar_t* b = "Bar"; // <-- error, can't point wchar_t* to a non-wide string
|
So what about this?:
Remember that TCHAR is char or wchar_t depending on Unicode. So the above code will work
only if you are not building Unicode. If you are building Unicode you'll get an error.
Likewise, the following won't work
unless you're building Unicode:
To get around this problem... WinAPI provides some other macros, _T(), T(), and TEXT(), all of which do the same thing. In a Unicode build, they put the L before the string literal to make it wide, and in non-Unicode, they do nothing. Therefore they will always work hand in hand with TCHARs:
|
const TCHAR* d = _T("foo"); // works in both Unicode and ANSI builds
|
Section 4) Function and Structure Name Aliases
A lot of Windows functions take strings as parameters. But because char and wchar_t strings are two distinctly different types, the same function can't be used for both of them.
Take for example, the WinAPI function "DeleteFile" which takes a single parameter. Let's say you want to delete "myfile.txt":
|
DeleteFile( _T("myfile.txt") ); // notice _T because DeleteFile takes a LPC<b>T</b>STR
|
The trick here is that the function DeleteFile doesn't really exist! There are actually two different functions:
1 2
|
DeleteFileA( LPCSTR ); // ANSI version, taking a LPCSTR
DeleteFileW( LPCWSTR ); // Unicode version, taking LPCWSTR
|
DeleteFile is actually a
macro defined as either DeleteFileA or DeleteFileW, depending on whether or not this is a Unicode build.
As such... for WinAPI functions that take a C style string... there are, in a sense, 3 different versions, each taking a different type of C string:
1 2 3
|
DeleteFile <- Takes a TCHAR string (LPCTSTR)
DeleteFileA <- Takes a char string (LPCSTR)
DeleteFileW <- Takes a wchar_t string (LPCWSTR)
|
This is true of virtually all WinAPI functions that take a C string as a param.
But it doesn't stop there! There are also some structs that have strings in them, as well. For instance, the OPENFILENAME structure contains various C strings for use with the open file dialog box. As you might expect, There are 3 versions of this struct as well:
1 2 3
|
OPENFILENAME <- has TCHAR strings
OPENFILENAMEA <- has char strings
OPENFILENAMEW <- has wchar_t strings
|
And again... note that OPENFILENAME doesn't
really exist, but is just a #define of one of the other two depending on the build.
Section 5) Being Unicode friendly
So what does it take to be Unicode friendly in WinAPI?
For most programs... not very much. Just stick to the following and you'll be fine:
-) Use TCHAR for characters and C strings instead of char
-) Use std::basic_string<TCHAR> instead of std::string. You can even typedef your own kind of tstring:
typedef std::basic_string<TCHAR> tstring;
-) Don't use std::string, as this is a char string.
-) Put all string literals in
_T()
macros. UNLESS you are dealing with libs other than WinAPI. For example, standard lib functions like fstream ctors take char* strings -- so don't put those strings in _T() macros. Really, though, if you're using WinAPI, you shouldn't be using standard lib file I/O because the standard lib is not Unicode friendly.
-) Don't use standard lib C string functions like strcpy, strcat, sprintf, etc. These all work with char -- they don't work with wchar_t or TCHAR. Alternatively you can use 'tstring' member functions, and Windows specific TCHAR functions like _tcscpy, _tcscat, etc.
-)
Never ever ever C style cast C strings from one type to another. C style casts mask very important compiler errors. Avoid C++ style casts also. Basically if you're getting type errors with your strings -- it's because you're doing something wrong. Don't try to cast around the problem.
-) Switch between ANSI builds and Unicode builds often to make sure that your program will compile in both. If that's too much of a hassle, make Unicode builds all the time and forget about ANSI builds.
For other programs where you do a lot of text manipulation, it gets a little trickier....
-) Be careful when reading or writing text to a file. Don't use TCHAR for this, since its size is variable. Use char if you're reading 8-bit characters from a file, and wchar_t if reading 16-bit characters.
-) Ideally if text is going in an output file, you should use a Unicode encoding, such as UTF-8 or UTF-16. However that is beyond the scope of this article (perhaps another day!)
-) If you need to use char or wchar_t directly (for instance the above situation), be very careful about how you move those strings to a TCHAR string. You'll typically have to copy the string over 1 character at a time or write your own string copy function to do that. I don't think WinAPI has any functions to help with such a case, and I know the standard lib doesn't.
For example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
|
// this function copies a char C string to a TCHAR C string:
void ustrcpy(TCHAR* dst, const char* src)
{
while(*src)
{
*dst = *src;
++dst;
++src;
}
*dst = *src;
}
//---------
// then, say you need to read a string from a file and put it in a text box with SetWindowText:
char str[500] = {0}; // note I'm using char because I specifically want 8-bit characers
ifstream myfile("myfile.txt"); // note no _T() macro because I'm dealing with std lib
// ideally you'd open the file with WinAPI's CreateFile and read
// that way because that is Unicode friendly. However I'm trying
// to keep this example simple
myfile >> str; // read the string
TCHAR buffer[500]; // need to copy to a TCHAR buffer in order to give it to SetWindowText
ustrcpy( buffer, str );
// give it to WinAPI
SetWindowText( hMyTextBox, buffer );
|
A better approach would be to make template functions for ustrcpy and similar so you can convert to/from all sorts of different types and sizes:
1 2 3 4 5
|
template <typename T, typename TT>
void ustrcpy( T* dst, const TT* src )
{
//.. same as above
}
|
Alternatively... you can avoid the TCHAR version of the WinAPI function and use the ANSI version directly. This let's Windows take care of the conversion:
1 2 3 4 5 6
|
char str[500] = {0};
myfile >> str;
// note here we specifically call SetWindowTextA, not SetWindowText.
// this is because we're giving a char string and not a TCHAR string.
SetWindowTextA( hMyTextBox, str );
|
More to come? ???