Professional C__ - Marc Gregoire [246]
Chapter 14
Using Strings and Regular Expressions
WHAT’S IN THIS CHAPTER?
The differences between C-style strings and C++ strings
How you can localize your applications to reach a worldwide audience
How to use regular expressions to do powerful pattern matching
Every program that you write will use strings of some kind. With the old C language there is not much choice but to use a dumb null-terminated character array to represent an ASCII string. Unfortunately, doing so can cause a lot of problems, such as buffer overflows, which can result in security vulnerabilities. The C++ STL includes a safe and easy-to-use string class that does not have these disadvantages.
The first section of this chapter discusses strings in more detail. It starts with a discussion of the old C-style strings, explains their disadvantages, and ends with the C++ string class. It also mentions raw string literals, which are new in C++11.
The second section discusses localization, which is becoming more and more important these days to allow you to write software that can be localized to different regions around the world.
The last section introduces the new C++11 regular expressions library, which makes it easy to perform pattern matching on strings. They allow you to search for sub-strings matching a given pattern, but also to validate, parse, and transform strings. They are really powerful and it’s recommended that you start using them instead of manually writing your own string processing code.
DYNAMIC STRINGS
Strings in languages that have supported them as first-class objects tend to have a number of attractive features, such as being able to expand to any size, or have sub-strings extracted or replaced. In other languages, such as C, strings were almost an afterthought; there was no really good “string” data type, just fixed arrays of bytes. The “string library” was nothing more than a collection of rather primitive functions without even bounds checking. C++ provides a string type as a first-class data type, and the strings are implemented using templates and operator overloading.
C-Style Strings
In the C language, strings are represented as an array of characters. The last character of a string is a null character ('\0') so that code operating on the string can determine where it ends. This null character is officially known as NUL, spelled with one L, not two. NUL is not the same as the NULL pointer. Even though C++ provides a better string abstraction, it is important to understand the C technique for strings because they still arise in C++ programming. One of the most common situations is where a C++ program has to call a C-based interface in some third-party library or as part of interfacing to the operating system.
By far, the most common mistake that programmers make with C strings is that they forget to allocate space for the '\0' character. For example, the string "hello" appears to be five characters long, but six characters worth of space are needed in memory to store the value, as shown in Figure 14-1.
FIGURE 14-1
C++ contains several functions from the C language that operate on strings. As a general rule of thumb, these functions do not handle memory allocation. For example, the strcpy() function takes two strings as parameters. It copies the second string onto the first, whether it fits or not. The following code attempts to build a wrapper around strcpy() that allocates the correct amount of memory and returns the result, instead of taking in an already allocated string. It uses the strlen() function to obtain the length of the string.
char* copyString(const char* inString)
{
char* result = new char[strlen(inString)]; // BUG! Off by one!
strcpy(result, inString);
return result;
}
Code snippet from CStrings\strcpy.cpp
The copyString() function as written