MindView Inc.
[ Viewing Hints ] [ Revision History ] [ Book Home Page ] [ Free Newsletter ]
[ Seminars ] [ Seminars on CD ROM ] [ Consulting ]

Thinking in C++, 2nd edition, Volume 2
Revision 4.0

by Bruce Eckel & Chuck Allison
©2001 MindView, Inc.

[ Previous Chapter ] [ Short TOC ] [ Table of Contents ] [ Index ] [ Next Chapter ]

4: Strings in Depth

[8]One of the biggest time-wasters in C is character arrays: keeping track of the difference between static quoted strings and arrays created on the stack and the heap, and the fact that sometimes you’re passing around a char* and sometimes you must copy the whole array.

(This is the general problem of shallow copy vs. deep copy.) Especially because string manipulation is so common, character arrays are a great source of misunderstandings and bugs. [ Comment ]

Despite this, creating string classes remained a common exercise for beginning C++ programmers for many years. The Standard C++ library string class solves the problem of character array manipulation once and for all, keeping track of memory even during assignments and copy-constructions. You simply don’t need to think about it. [ Comment ]

This chapter examines the Standard C++ string class, beginning with a look at what constitutes a C++ string and how the C++ version differs from a traditional C character array. You’ll learn about operations and manipulations using string objects, and see how C++ strings accommodate variation in character sets and string data conversion. [ Comment ]

Handling text is perhaps one of the oldest of all programming applications, so it’s not surprising that the C++ string draws heavily on the ideas and terminology that have long been used for this purpose in C and other languages. As you begin to acquaint yourself with C++ strings this fact should be reassuring, in the respect that no matter what programming idiom you choose, there are really only about three things you can do with a string: create or modify the sequence of characters stored in the string, detect the presence or absence of elements within the string, and translate between various schemes for representing string characters. [ Comment ]

You’ll see how each of these jobs is accomplished using C++ string objects. [ Comment ]

What’s in a string

In C, a string is simply an array of characters that always includes a binary zero (often called the null terminator) as its final array element. There are two significant differences between C++ strings and their C progenitors. First, C++ string objects associate the array of characters which constitute the string with methods useful for managing and operating on it. A string also contains certain “housekeeping” information about the size and storage location of its data. Specifically, a C++ string object knows its starting location in memory, its content, its length in characters, and the length in characters to which it can grow before the string object must resize its internal data buffer. This gives rise to the second big difference between C char arrays and C++ strings. C++ strings do not include a null terminator, nor do the C++ string handling member functions rely on the existence of a null terminator to perform their jobs. C++ strings greatly reduce the likelihood of making three of the most common and destructive C programming errors: overwriting array bounds, trying to access arrays through uninitialized or incorrectly valued pointers, and leaving pointers “dangling” after an array ceases to occupy the storage that was once allocated to it. [ Comment ]

The exact implementation of memory layout for the string class is not defined by the C++ Standard. This architecture is intended to be flexible enough to allow differing implementations by compiler vendors, yet guarantee predictable behavior for users. In particular, the exact conditions under which storage is allocated to hold data for a string object are not defined. String allocation rules were formulated to allow but not require a reference-counted implementation, but whether or not the implementation uses reference counting, the semantics must be the same. To put this a bit differently, in C, every char array occupies a unique physical region of memory. In C++, individual string objects may or may not occupy unique physical regions of memory, but if reference counting is used to avoid storing duplicate copies of data, the individual objects must look and act as though they do exclusively own unique regions of storage. For example: [ Comment ]

//: C04:StringStorage.cpp
//{L} ../TestSuite/Test
#include <string>
#include <iostream>
#include "../TestSuite/Test.h"
using namespace std;

class StringStorageTest : public Test {
public:
  void run() {
    string s1("12345");
    // Set the iterator indicate the first element
    string::iterator it = s1.begin();
    // This may copy the first to the second or 
    // use reference counting to simulate a copy 
    string s2 = s1;
    test_(s1 == s2);
    // Either way, this statement may ONLY modify first
    *it = '0';
    cout << "s1 = " << s1 << endl;
    cout << "s2 = " << s2 << endl;
    test_(s1 != s2);
  }
};

int main() {
  StringStorageTest sst;
  sst.setStream(&cout);
  sst.run();
  return sst.report();
} ///:~

Reference counting may serve to make an implementation more memory efficient, but it is transparent to users of the string class. [ Comment ]

Creating and initializing C++ strings

Creating and initializing strings is a straightforward proposition, and fairly flexible as well. In the example shown below, the first string, imBlank, is declared but contains no initial value. Unlike a C char array, which would contain a random and meaningless bit pattern until initialization, imBlank does contain meaningful information. This string object has been initialized to hold “no characters,” and can properly report its 0 length and absence of data elements through the use of class member functions. [ Comment ]

The next string, heyMom, is initialized by the literal argument "Where are my socks?". This form of initialization uses a quoted character array as a parameter to the string constructor. By contrast, standardReply is simply initialized with an assignment. The last string of the group, useThisOneAgain, is initialized using an existing C++ string object. Put another way, this example illustrates that string objects let you: [ Comment ]

//: C04:SmallString.cpp
//{L} ../TestSuite/Test
//{-msc} Execution error
#include <string>
using namespace std;

int main() {
  string imBlank;
  string heyMom("Where are my socks?");
  string standardReply = "Beamed into deep "
    "space on wide angle dispersion?";
  string useThisOneAgain(standardReply);
} ///:~

These are the simplest forms of string initialization, but there are other variations which offer more flexibility and control. You can : [ Comment ]

//: C04:SmallString2.cpp
//{L} ../TestSuite/Test
//{-msc} Execution error
#include <string>
#include <iostream>
using namespace std;

int main() {
  string s1
    ("What is the sound of one clam napping?");
  string s2
    ("Anything worth doing is worth overdoing.");
  string s3("I saw Elvis in a UFO.");
  // Copy the first 8 chars
  string s4(s1, 0, 8);
  // Copy 6 chars from the middle of the source
  string s5(s2, 15, 6);
  // Copy from middle to end
  string s6(s3, 6, 15);
  // Copy all sorts of stuff
  string quoteMe = s4 + "that" +  
  // substr() copies 10 chars at element 20
  s1.substr(20, 10) + s5 +
  // substr() copies up to either 100 char
  // or eos starting at element 5 
  "with" + s3.substr(5, 100) +
  // OK to copy a single char this way 
  s1.substr(37, 1);
  cout << quoteMe << endl;
} ///:~

The string member function substr( ) takes a starting position as its first argument and the number of characters to select as the second argument. Both of these arguments have default values and if you say substr( ) with an empty argument list you produce a copy of the entire string, so this is a convenient way to duplicate a string. [ Comment ]

Here’s what the string quoteMe contains after the initialization shown above : [ Comment ]

"What is that one clam doing with Elvis in a UFO.?"

Notice the final line of example above. C++ allows string initialization techniques to be mixed in a single statement, a flexible and convenient feature. Also note that the last initializer copies just one character from the source string. [ Comment ]

Another slightly more subtle initialization technique involves the use of the string iterators string.begin( ) and string.end( ). This treats a string like a container object (which you’ve seen primarily in the form of vector so far in this book – you’ll see many more containers soon) which has iterators indicating the start and end of the “container.” This way you can hand a string constructor two iterators and it will copy from one to the other into the new string: [ Comment ]

//: C04:StringIterators.cpp
//{L} ../TestSuite/Test
//{-msc} Execution error
#include <string>
#include <iostream>
using namespace std;

int main() {
  string source("xxx");
  string s(source.begin(), source.end());
  cout << s << endl;
} ///:~

The iterators are not restricted to begin( ) and end( ), so you can choose a subset of characters from the source string. [ Comment ]

Initialization limitations

C++ strings may not be initialized with single characters or with ASCII or other integer values. [ Comment ]

//: C04:UhOh.cpp
//{L} ../TestSuite/Test
//{-msc} Execution error
#include <string>
using namespace std;

int main() {
  // Error: no single char inits
  //! string nothingDoing1('a');
  // Error: no integer inits
  //! string nothingDoing2(0x37);
} ///:~

This is true both for initialization by assignment and by copy constructor. [ Comment ]

Operating on strings

If you’ve programmed in C, you are accustomed to the convenience of a large family of functions for writing, searching, rearranging, and copying char arrays. However, there are two unfortunate aspects of the Standard C library functions for handling char arrays. First, there are three loosely organized families of them: the “plain” group, the group that manipulates the characters without respect to case, and the ones which require you to supply a count of the number of characters to be considered in the operation at hand. The roster of function names in the C char array handling library literally runs to several pages, and though the kind and number of arguments to the functions are somewhat consistent within each of the three groups, to use them properly you must be very attentive to details of function naming and parameter passing. [ Comment ]

The second inherent trap of the standard C char array tools is that they all rely explicitly on the assumption that the character array includes a null terminator. If by oversight or error the null is omitted or overwritten, there’s very little to keep the C char array handling functions from manipulating the memory beyond the limits of the allocated space, sometimes with disastrous results. [ Comment ]

C++ provides a vast improvement in the convenience and safety of string objects. For purposes of actual string handling operations, there are a modest two or three dozen member function names. It’s worth your while to become acquainted with these. Each function is overloaded, so you don’t have to learn a new string member function name simply because of small differences in their parameters. [ Comment ]

Appending, inserting and concatenating strings

One of the most valuable and convenient aspects of C++ strings is that they grow as needed, without intervention on the part of the programmer. Not only does this make string handling code inherently more trustworthy, it also almost entirely eliminates a tedious “housekeeping” chore – keeping track of the bounds of the storage in which your strings live. For example, if you create a string object and initialize it with a string of 50 copies of ‘X’, and later store in it 50 copies of “Zowie”, the object itself will reallocate sufficient storage to accommodate the growth of the data. Perhaps nowhere is this property more appreciated than when the strings manipulated in your code change in size, and you don’t know how big the change is. Appending, concatenating, and inserting strings often give rise to this circumstance, but the string member functions append( ) and insert( ) transparently reallocate storage when a string grows. [ Comment ]

//: C04:StrSize.cpp
//{L} ../TestSuite/Test
#include <string>
#include <iostream>
using namespace std;

int main() {
  string bigNews("I saw Elvis in a UFO. ");
  cout << bigNews << endl;
  // How much data have we actually got?
  cout << "Size = " << bigNews.size() << endl;
  // How much can we store without reallocating
  cout << "Capacity = " 
    << bigNews.capacity() << endl;
  // Insert this string in bigNews immediately
  // before bigNews[1]
  bigNews.insert(1, " thought I ");
  cout << bigNews << endl;
  cout << "Size = " << bigNews.size() << endl;
  cout << "Capacity = " 
    << bigNews.capacity() << endl;
  // Make sure that there will be this much space
  bigNews.reserve(500);
  // Add this to the end of the string
  bigNews.append("I've been working too hard.");
  cout << bigNews << endl;
  cout << "Size = " << bigNews.size() << endl;
  cout << "Capacity = " 
    << bigNews.capacity() << endl;
} ///:~

Here is the output: [ Comment ]

I saw Elvis in a UFO.
Size = 21
Capacity = 31
I thought I saw Elvis in a UFO.
Size = 32
Capacity = 63
I thought I saw Elvis in a UFO. I've been 
working too hard.
Size = 66
Capacity = 511

This example demonstrates that even though you can safely relinquish much of the responsibility for allocating and managing the memory your strings occupy, C++ strings provide you with several tools to monitor and manage their size. The size( ), resize( ), capacity( ), and reserve( ) member functions can be very useful when its necessary to work back and forth between data contained in C++ style strings and traditional null terminated C char arrays. Note the ease with which we changed the size of the storage allocated to the string. [ Comment ]

The exact fashion in which the string member functions will allocate space for your data is dependent on the implementation of the library. When one implementation was tested with the example above, it appeared that reallocations occurred on even word boundaries, with one byte held back. The architects of the string class have endeavored to make it possible to mix the use of C char arrays and C++ string objects, so it is likely that figures reported by StrSize.cpp for capacity reflect that in this particular implementation, a byte is set aside to easily accommodate the insertion of a null terminator. [ Comment ]

Replacing string characters

insert( ) is particularly nice because it absolves you of making sure the insertion of characters in a string won’t overrun the storage space or overwrite the characters immediately following the insertion point. Space grows and existing characters politely move over to accommodate the new elements. Sometimes, however, this might not be what you want to happen. If the data in string needs to retain the ordering of the original characters relative to one another or must be a specific constant size, use the replace( ) function to overwrite a particular sequence of characters with another group of characters. There are quite a number of overloaded versions of replace( ), but the simplest one takes three arguments: an integer telling where to start in the string, an integer telling how many characters to eliminate from the original string, and the replacement string (which can be a different number of characters than the eliminated quantity). Here’s a very simple example: [ Comment ]

//: C04:StringReplace.cpp
// Simple find-and-replace in strings
//{L} ../TestSuite/Test
#include <string>
#include <iostream>
using namespace std;

int main() {
  string s("A piece of text");
  string tag("$tag$");
  s.insert(8, tag + ' ');
  cout << s << endl;
  int start = s.find(tag);
  cout << "start = " << start << endl;
  cout << "size = " << tag.size() << endl;
  s.replace(start, tag.size(), "hello there");
  cout << s << endl;
} ///:~

The tag is first inserted into s (notice that the insert happens before the value indicating the insert point, and that an extra space was added after tag), then it is found and replaced. [ Comment ]

You should actually check to see if you’ve found anything before you perform a replace( ). The above example replaces with a char*, but there’s an overloaded version that replaces with a string. Here’s a more complete demonstration replace( )Comment ]

//: C04:Replace.cpp
//{L} ../TestSuite/Test
#include <string>
#include <iostream>
using namespace std;

void replaceChars(string& modifyMe, 
  string findMe, string newChars){
  // Look in modifyMe for the "find string"
  // starting at position 0
  int i = modifyMe.find(findMe, 0);
  // Did we find the string to replace?
  if(i != string::npos)
    // Replace the find string with newChars
    modifyMe.replace(i,newChars.size(),newChars);
}

int main() {
  string bigNews = 
   "I thought I saw Elvis in a UFO. "
   "I have been working too hard.";
  string replacement("wig");
  string findMe("UFO");
  // Find "UFO" in bigNews and overwrite it:
  replaceChars(bigNews, findMe,  replacement);
  cout << bigNews << endl;
} ///:~

Now the last line of output from replace.cpp looks like this: [ Comment ]

I thought I saw Elvis in a wig. I have been
working too hard.

If replace doesn’t find the search string, it returns npos. npos is a static constant member of the basic_string class. [ Comment ]

Unlike insert( ), replace( ) won’t grow the string’s storage space if you copy new characters into the middle of an existing series of array elements. However, it will grow the storage space if you make a “replacement” that writes beyond the end of an existing array. Here’s an example: [ Comment ]

//: C04:ReplaceAndGrow.cpp
//{L} ../TestSuite/Test
#include <string>
#include <iostream>
using namespace std;

int main() {
  string bigNews("I saw Elvis in a UFO. "
    "I have Been working too hard.");
  string replacement("wig");
  // The first arg says "replace chars 
  // beyond the end of the existing string":
  bigNews.replace(bigNews.size(), 
    replacement.size(), replacement);
  cout << bigNews << endl;
} ///:~

The call to replace( ) begins “replacing” beyond the end of the existing array. The output looks like this: [ Comment ]

I saw Elvis in a UFO. I have 
been working too hard.wig

Notice that replace( ) expands the array to accommodate the growth of the string due to “replacement” beyond the bounds of the existing array. [ Comment ]

Simple character replacement using the STL replace( ) algorithm

You may have been hunting through this chapter trying to do something relatively simple like replace all the instances of one character with a different character. Upon finding the above section on replacing, you thought you found the answer but then you started seeing groups of characters and counts and other things that looked a bit too complex. Doesn’t string have a way to just replace one character with another everywhere? [ Comment ]

The string class by itself doesn’t solve all possible problems. The remainder are relegated to the STL algorithms, because the string class can look just like an STL container (the STL algorithms work with anything that looks like an STL container). All the STL algorithms work on a “range” of elements within a container. Usually that range is just “from the beginning of the container to the end.” A string object looks like a container of characters: to get the beginning of the range you use string::begin( ) and to get the end of the range you use string::end( ). The following example shows the use of the STL replace( ) algorithm to replace all the instances of ‘X’ with ‘Y’: [ Comment ]

//: C04:StringCharReplace.cpp
//{L} ../TestSuite/Test
#include <string>
#include <algorithm>
#include <iostream>
using namespace std;

int main() {
  string s("aaaXaaaXXaaXXXaXXXXaaa");
  cout << s << endl;
  replace(s.begin(), s.end(), 'X', 'Y');
  cout << s << endl;
} ///:~

Notice that this replace( ) is not called as a member function of string. Also, unlike the string::replace( ) functions which only perform one replacement, the STL replace is replacing all instances of one character with another. [ Comment ]

The STL replace( ) algorithm only works with single objects (in this case, char objects), and will not perform replacements of quoted char arrays or of string objects. [ Comment ]

Since a string looks like an STL container, there are a number of other STL algorithms that can be applied to it, which may solve other problems you have that are not directly addressed by the string member functions. See Chapter XX for more information on the STL algorithms. [ Comment ]

Concatenation using non-member overloaded operators

One of the most delightful discoveries awaiting a C programmer learning about C++ string handling is how simply strings can be combined and appended using operator+ and operator+=. These operators make combining strings syntactically equivalent to adding numeric data. [ Comment ]

//: C04:AddStrings.cpp
//{L} ../TestSuite/Test
#include <string>
#include <iostream>
using namespace std;

int main() {
  string s1("This ");
  string s2("That ");
  string s3("The other ");
  // operator+ concatenates strings
  s1 = s1 + s2;
  cout << s1 << endl;
  // Another way to concatenates strings
  s1 += s3;
  cout << s1 << endl;
  // You can index the string on the right
  s1 += s3 + s3[4] + "oh lala";
  cout << s1 << endl;
} ///:~

The output looks like this: [ Comment ]

This
This That
This That The other
This That The other ooh lala

operator+ and operator+= are a very flexible and convenient means of combining string data. On the right hand side of the statement, you can use almost any type that evaluates to a group of one or more characters. [ Comment ]

Searching in strings

The find family of string member functions allows you to locate a character or group of characters within a given string. Here are the members of the find family and their general usage:

string find member function

What/how it finds

find( )

Searches a string for a specified character or group of characters and returns the starting position of the first occurrence found or npos if no match is found. (npos is a const of –1 and indicates that a search failed.)

find_first_of( )

Searches a target string and returns the position of the first match of any character in a specified group. If no match is found, it returns npos.

find_last_of( )

Searches a target string and returns the position of the last match of any character in a specified group. If no match is found, it returns npos.

find_first_not_of( )

Searches a target string and returns the position of the first element that doesn’t match any character in a specified group. If no such element is found, it returns npos.

find_last_not_of( )

Searches a target string and returns the position of the element with the largest subscript that doesn’t match of any character in a specified group. If no such element is found, it returns npos.

rfind( )

Searches a string from end to beginning for a specified character or group of characters and returns the starting position of the match if one is found. If no match is found, it returns npos.

String searching member functions and their general uses

The simplest use of find( ) searches for one or more characters in a string. This overloaded version of find( ) takes a parameter that specifies the character(s) for which to search, and optionally one that tells it where in the string to begin searching for the occurrence of a substring. (The default position at which to begin searching is 0.) By setting the call to find inside a loop, you can easily move through a string, repeating a search in order to find all of the occurrences of a given character or group of characters within the string. [ Comment ]

Notice that we define the string object sieveChars using a constructor idiom which sets the initial size of the character array and writes the value ‘P’ to each of its member. [ Comment ]

//: C04:Sieve.cpp
//{L} ../TestSuite/Test
#include <string>
#include <iostream>
using namespace std;

int main() {
  // Create a 50 char string and set each 
  // element to 'P' for Prime
  string sieveChars(50, 'P');
  // By definition neither 0 nor 1 is prime.
  // Change these elements to "N" for Not Prime
  sieveChars.replace(0, 2, "NN");
  // Walk through the array:
  for(int i = 2;  
    i <= (sieveChars.size() / 2) - 1; i++)
    // Find all the factors:
    for(int factor = 2;
      factor * i < sieveChars.size();factor++)
      sieveChars[factor * i] = 'N';
     
  cout << "Prime:" << endl;
  // Return the index of the first 'P' element:
  int j = sieveChars.find('P');
  // While not at the end of the string:
  while(j != sieveChars.npos) {
    // If the element is P, the index is a prime
    cout << j << " ";
    // Move past the last prime
    j++;
    // Find the next prime
    j = sieveChars.find('P', j);
  }
  cout << "\n Not prime:" << endl;
  // Find the first element value not equal P:
  j = sieveChars.find_first_not_of('P');
  while(j != sieveChars.npos) {
    cout << j << " ";
    j++;
    j = sieveChars.find_first_not_of('P', j);
  }
} ///:~

The output from Sieve.cpp looks like this: [ Comment ]

Prime:
2 3 5 7 11 13 17 19 23 29 31 37 41 43 47
Not prime:
0 1 4 6 8 9 10 12 14 15 16 18 20 21 22 
24 25 26 27 28 30 32 33 34 35 36 38 39 
40 42 44 45 46 48 49

find( ) allows you to walk forward through a string, detecting multiple occurrences of a character or group of characters, while find_first_not_of( ) allows you to test for the absence of a character or group. [ Comment ]

The find member is also useful for detecting the occurrence of a sequence of characters in a string: [ Comment ]

//: C04:Find.cpp
// Find a group of characters in a string
//{L} ../TestSuite/Test
#include <string>
#include <iostream>
using namespace std;

int main() {
  string chooseOne("Eenie, meenie, miney, mo");
  int i = chooseOne.find("een");
  while(i != string::npos) {
    cout << i << endl;
    i++;
    i = chooseOne.find("een", i);
  }
} ///:~

Find.cpp produces a single line of output : [ Comment ]

 8 

This tells us that the first ‘e’ of the search group “een” was found in the word “meenie,” and is the eighth element in the string. Notice that find passed over the “Een” group of characters in the word “Eenie”. The find member function performs a case sensitive search. [ Comment ]

There are no functions in the string class to change the case of a string, but these functions can be easily created using the Standard C library functions toupper( ) and tolower( ), which change the case of one character at a time. A few small changes will make Find.cpp perform a case insensitive search: [ Comment ]

//: C04:NewFind.cpp
//{L} ../TestSuite/Test
#include <string>
#include <iostream>
using namespace std;

// Make an uppercase copy of s:
string upperCase(string& s) {
  char* buf = new char[s.length()];
  s.copy(buf, s.length());
  for(int i = 0; i < s.length(); i++)
    buf[i] = toupper(buf[i]);
  string r(buf, s.length());
  delete buf;
  return r;
}

// Make a lowercase copy of s:
string lowerCase(string& s) {
  char* buf = new char[s.length()];
  s.copy(buf, s.length());
  for(int i = 0; i < s.length(); i++)
    buf[i] = tolower(buf[i]);
  string r(buf, s.length());
  delete buf;
  return r;
}

int main() {
  string chooseOne("Eenie, meenie, miney, mo");
  cout << chooseOne << endl;
  cout << upperCase(chooseOne) << endl;
  cout << lowerCase(chooseOne) << endl;
  // Case sensitive search
  int i = chooseOne.find("een");
  while(i != string::npos) {
    cout << i << endl;
    i++;
    i = chooseOne.find("een", i);
  }
  // Search lowercase:
  string lcase = lowerCase(chooseOne);
  cout << lcase << endl;
  i = lcase.find("een");
  while(i != lcase.npos) {
    cout << i << endl;
    i++;
    i = lcase.find("een", i);
  }
  // Search uppercase:
  string ucase = upperCase(chooseOne);
  cout << ucase << endl;
  i = ucase.find("EEN");
  while(i != ucase.npos) {
    cout << i << endl;
    i++;
    i = ucase.find("EEN", i);
  }
} ///:~

Both the upperCase( ) and lowerCase( ) functions follow the same form: they allocate storage to hold the data in the argument string, copy the data and change the case. Then they create a new string with the new data, release the buffer and return the result string. The c_str( ) function cannot be used to produce a pointer to directly manipulate the data in the string because c_str( ) returns a pointer to const. That is, you’re not allowed to manipulate string data with a pointer, only with member functions. If you need to use the more primitive char array manipulation, you should use the technique shown above. [ Comment ]

The output looks like this: [ Comment ]

Eenie, meenie, miney, mo
EENIE, MEENIE, MINEY, MO
eenie, meenie, miney, mo
8
eenie, meenie, miney, mo
0
8
EENIE, MEENIE, MINEY, MO
0
8

The case insensitive searches found both occurrences on the “een” group. [ Comment ]

NewFind.cpp isn’t the best solution to the case sensitivity problem, so we’ll revisit it when we examine string comparisons. [ Comment ]

Finding in reverse

Sometimes it’s necessary to search through a string from end to beginning, if you need to find the data in “last in / first out “ order. The string member function rfind( ) handles this job. [ Comment ]

//: C04:Rparse.cpp
// Reverse the order of words in a string
//{L} ../TestSuite/Test
#include <string>
#include <iostream>
#include <vector>
using namespace std;

int main() {
  // The ';' characters will be delimiters
  string s("now.;sense;make;to;going;is;This");
  cout << s << endl;
  // To store the words:
  vector<string> strings;
  // The last element of the string:
  int last = s.size();
  // The beginning of the current word:
  int current = s.rfind(';');
  // Walk backward through the string:
  while(current != string::npos){
    // Push each word into the vector.
    // Current is incremented before copying to 
    // avoid copying the delimiter:
    ++current;
    strings.push_back(
      s.substr(current, last - current));
    // Back over the delimiter we just found, 
    // and set last to the end of the next word:
    current -= 2;
    last = current;
    // Find the next delimiter
    current = s.rfind(';', current);
  }
  // Pick up the first word - it's not 
  // preceded by a delimiter
  strings.push_back(s.substr(0, last - current));
  // Print them in the new order:
  for(int j = 0; j < strings.size(); j++)
    cout << strings[j] << " ";
} ///:~

Here’s how the output from Rparse.cpp looks: [ Comment ]

now.;sense;make;to;going;is;This
This is going to make sense now.

rfind( ) backs through the string looking for tokens, reporting the array index of matching characters or string::npos if it is unsuccessful. [ Comment ]

Finding first/last of a set

The find_first_of( ) and find_last_of( ) member functions can be conveniently put to work to create a little utility that will strip whitespace characters off of both ends of a string. Notice it doesn’t touch the original string, but instead returns a new string: [ Comment ]

//: C04:trim.h
#ifndef TRIM_H
#define TRIM_H
#include <string>
// General tool to strip spaces from both ends:
inline std::string trim(const std::string& s) {
  if(s.length() == 0)
    return s;
  int b = s.find_first_not_of(" \t");
  int e = s.find_last_not_of(" \t");
  if(b == -1) // No non-spaces
    return "";
  return std::string(s, b, e - b + 1);
}
#endif // TRIM_H ///:~

The first test checks for an empty string; in that case no tests are made and a copy is returned. Notice that once the end points are found, the string constructor is used to build a new string from the old one, giving the starting count and the length. This form also utilizes the “return value optimization” (see the index for more details). [ Comment ]

Testing such a general-purpose tool needs to be thorough: [ Comment ]

//: C04:TrimTest.cpp
//{L} ../TestSuite/Test
#include "trim.h"
#include <iostream>
using namespace std;

string s[] = {
  " \t abcdefghijklmnop \t ",
  "abcdefghijklmnop \t ",
  " \t abcdefghijklmnop",
  "a", "ab", "abc", "a b c",
  " \t a b c \t ", " \t a \t b \t c \t ",
  "", // Must also test the empty string
};

void test(string s) {
  cout << "[" << trim(s) << "]" << endl;
}

int main() {
  for(int i = 0; i < sizeof s / sizeof *s; i++)
    test(s[i]);
} ///:~

In the array of string s, you can see that the character arrays are automatically converted to string objects. This array provides cases to check the removal of spaces and tabs from both ends, as well as ensuring that spaces and tabs do not get removed from the middle of a string. [ Comment ]

Removing characters from strings

My word processor/page layout program (Microsoft Word) will save a document in HTML, but it doesn’t recognize that the code listings in this book should be tagged with the HTML “preformatted” tag (<PRE>), and it puts paragraph marks (<P> and </P>) around every listing line. This means that all the indentation in the code listings is lost. In addition, Word saves HTML with reduced font sizes for body text, which makes it hard to read. [ Comment ]

To convert the book to HTML form[9], then, the original output must be reprocessed, watching for the tags that mark the start and end of code listings, inserting the <PRE> and </PRE> tags at the appropriate places, removing all the <P> and </P> tags within the listings, and adjusting the font sizes. Removal is accomplished with the erase( ) member function, but you must correctly determine the starting and ending points of the substring you wish to erase. Here’s the program that reprocesses the generated HTML file: [ Comment ]

//: C04:ReprocessHTML.cpp
// Take Word's html output and fix up 
// the code listings and html tags
#include "../require.h"
#include <iostream>
#include <fstream>
#include <string>
using namespace std;

// Produce a new string which is the original
// string with the html paragraph break marks
// stripped off:
string stripPBreaks(string s) {
  int br;
  while((br = s.find("<P>")) != string::npos)
    s.erase(br, strlen("<P>"));
  while((br = s.find("</P>")) != string::npos)
    s.erase(br, strlen("</P>"));
  return s;
}

// After the beginning of a code listing is
// detected, this function cleans up the listing
// until the end marker is found. The first line
// of the listing is passed in by the caller, 
// which detects the start marker in the line.
void fixupCodeListing(istream& in, 
  ostream& out, string& line, int tag) {
  out << line.substr(0, tag)
    << "<PRE>" // Means "preformatted" in html
    << stripPBreaks(line.substr(tag)) << endl;
  string s;
  while(getline(in, s)) {
    int endtag = s.find("/""/""/"":~");
    if(endtag != string::npos) {
      endtag += strlen("/""/""/"":~");
      string before = s.substr(0, endtag);
      string after = s.substr(endtag);
      out << stripPBreaks(before) << "</PRE>" 
        << after << endl;
      return;
    }
    out << stripPBreaks(s) << endl;
  }
}

string removals[] = {
  "<FONT SIZE=2>",
  "<FONT SIZE=1>",
  "<FONT FACE=\"Times\" SIZE=1>",
  "<FONT FACE=\"Times\" SIZE=2>",
  "<FONT FACE=\"Courier\" SIZE=1>",
  "SIZE=1", // Eliminate all other '1' & '2' size
  "SIZE=2",
};
const int rmsz = 
  sizeof(removals)/sizeof(*removals);

int main(int argc, char* argv[]) {
  requireArgs(argc, 2);
  ifstream in(argv[1]);
  assure(in, argv[1]);
  ofstream out(argv[2]);
  string line;
  while(getline(in, line)) {
    // The "Body" tag only appears once:
    if(line.find("<BODY") != string::npos) {
      out << "<BODY BGCOLOR=\"#FFFFFF\" "
      "TEXT=\"#000000\">" << endl;
      continue; // Get next line
    }
    // Eliminate each of the removals strings:
    for(int i = 0; i < rmsz; i++) {
      int find = line.find(removals[i]);
      if(find != string::npos)
        line.erase(find, removals[i].size());
    }
    int tag1 = line.find("/""/"":");
    int tag2 = line.find("/""*"":");
    if(tag1 != string::npos)
      fixupCodeListing(in, out, line, tag1);
    else if(tag2 != string::npos)
      fixupCodeListing(in, out, line, tag2);
    else
      out << line << endl;
  }
} ///:~

Notice the lines that detect the start and end listing tags by indicating them with each character in quotes. These tags are treated in a special way by the logic in the Extractcode.cpp tool for extracting code listings. To present the code for the tool in the text of the book, the tag sequence itself must not occur in the listing. This was accomplished by taking advantage of a C++ preprocessor feature that causes text strings delimited by adjacent pairs of double quotes to be merged into a single string during the preprocessor pass of the build. [ Comment ]

int tag1 = line.find("/""/"":");

The effect of the sequence of char arrays is to produce the starting tag for code listings. [ Comment ]

Stripping HTML tags

Sometimes it’s useful to take an HTML file and strip its tags so you have something approximating the text that would be displayed in the Web browser, only as an ASCII text file. The string class once again comes in handy. The following has some variation on the theme of the previous example: [ Comment ]

//: C04:HTMLStripper.cpp
// Filter to remove html tags and markers
#include "../require.h"
#include <fstream>
#include <iostream>
#include <string>
using namespace std;

string replaceAll(string s, string f, string r) {
  unsigned int found = s.find(f);
  while(found != string::npos) {
    s.replace(found, f.length(), r);
    found = s.find(f);
  }
  return s;
}

string stripHTMLTags(string s) {
  while(true) {
    unsigned int left = s.find('<');
    unsigned int right = s.find('>');
    if(left==string::npos || right==string::npos)
      break;
    s = s.erase(left, right - left + 1);
  }
  s = replaceAll(s, "&lt;", "<");
  s = replaceAll(s, "&gt;", ">");
  s = replaceAll(s, "&amp;", "&");
  s = replaceAll(s, "&nbsp;", " ");
  // Etc...
  return s;
}

int main(int argc, char* argv[]) {
  requireArgs(argc, 1, 
    "usage: HTMLStripper InputFile");
  ifstream in(argv[1]);
  assure(in, argv[1]);
  const int sz = 4096;
  char buf[sz];
  while(in.getline(buf, sz)) {
    string s(buf);
    cout << stripHTMLTags(s) << endl;
  }
} ///:~

The string class can replace one string with another but there’s no facility for replacing all the strings of one type with another, so the replaceAll( ) function does this for you, inside a while loop that keeps finding the next instance of the find string f. That function is used inside stripHTMLTags after it uses erase( ) to remove everything that appears inside angle braces (‘<‘ and ‘>‘). Note that I probably haven’t gotten all the necessary replacement values, but you can see what to do (you might even put all the find-replace pairs in a table...). In main( ) the arguments are checked, and the file is read and converted. It is sent to standard output so you must redirect it with ‘>‘ if you want to write it to a file. [ Comment ]

Comparing strings

Comparing strings is inherently different than comparing numbers. Numbers have constant, universally meaningful values. To evaluate the relationship between the magnitude of two strings, you must make a lexical comparison. Lexical comparison means that when you test a character to see if it is “greater than” or “less than” another character, you are actually comparing the numeric representation of those characters as specified in the collating sequence of the character set being used. Most often, this will be the ASCII collating sequence, which assigns the printable characters for the English language numbers in the range from 32 to 127 decimal. In the ASCII collating sequence, the first “character” in the list is the space, followed by several common punctuation marks, and then uppercase and lowercase letters. With respect to the alphabet, this means that the letters nearer the front have lower ASCII values than those nearer the end. With these details in mind, it becomes easier to remember that when a lexical comparison that reports s1 is “greater than” s2, it simply means that when the two were compared, the first differing character in s1 came later in the alphabet than the character in that same position in s2. [ Comment ]

C++ provides several ways to compare strings, and each has their advantages. The simplest to use are the non member overloaded operator functions operator ==, operator != operator >, operator <, operator >=, and operator <=. Comment ]

//: C04:CompStr.cpp
//{L} ../TestSuite/Test
#include <string>
#include <iostream>
using namespace std;

int main() {
  // Strings to compare
  string s1("This ");
  string s2("That ");
  for(int i = 0; i< s1.size() &&
    i < s2.size(); i++)
    // See if the string elements are the same:
    if(s1[i] == s2[i])
      cout << s1[i] << "  " << i << endl;
  // Use the string inequality operators
  if(s1 != s2) { 
    cout << "Strings aren't the same:" << " ";
    if(s1 > s2)
      cout << "s1 is > s2" << endl;
    else
      cout << "s2 is > s1" << endl;
  }
} ///:~

Here’s the output from CompStr.cpp: [ Comment ]

T 0
h 1
  4
Strings aren’t the same: s1 is > s2 

The overloaded comparison operators are useful for comparing both full strings and individual string elements. [ Comment ]

Notice in the code fragment below the flexibility of argument types on both the left and right hand side of the comparison operators. The overloaded operator set allows the direct comparison of string objects, quoted literals, and pointers to C style strings. [ Comment ]

// The lvalue is a quoted literal and 
// the rvalue is a string
if("That " == s2)
  cout << "A match" << endl;
// The lvalue is a string and the rvalue is a
// pointer to a c style null terminated string
if(s1 != s2.c_str())
  cout << "No match" << endl;

You won’t find the logical not (!) or the logical comparison operators (&& and ||) among operators for string. (Neither will you find overloaded versions of the bitwise C operators &, |, ^, or ~.) The overloaded non member comparison operators for the string class are limited to the subset which has clear, unambiguous application to single characters or groups of characters. [ Comment ]

The compare( ) member function offers you a great deal more sophisticated and precise comparison than the non member operator set, because it returns a lexical comparison value, and provides for comparisons that consider subsets of the string data. It provides overloaded versions that allow you to compare two complete strings, part of either string to a complete string, and subsets of two strings. This example compares complete strings: [ Comment ]

//: C04:Compare.cpp
// Demonstrates compare(), swap()
//{L} ../TestSuite/Test
#include <string>
#include <iostream>
using namespace std;

int main() {
  string first("This");
  string second("That");
  // Which is lexically greater?
  switch(first.compare(second)) {
    case 0: // The same
      cout << first << " and " << second <<
        " are lexically equal" << endl;
      break;
    case -1: // Less than
      first.swap(second);
      // Fall through this case...
    case 1: // Greater than
      cout << first <<
        " is lexically greater than " <<
        second << endl;
  }
} ///:~

The output from Compare.cpp looks like this: [ Comment ]

This is lexically greater than That

To compare a subset of the characters in one or both strings, you add arguments that define where to start the comparison and how many characters to consider. For example, we can use the overloaded version of compare( ): [ Comment ]

s1.compare(s1StartPos, s1NumberChars, s2, s2StartPos, s2NumberChars);Comment ]

If we substitute the above version of compare( ) in the previous program so that it only looks at the first two characters of each string, the program becomes: [ Comment ]

//: C04:Compare2.cpp
// Overloaded compare()
//{L} ../TestSuite/Test
//{-g++295}
#include <string>
#include <iostream>
using namespace std;

int main() {
  string first("This");
  string second("That");
  // Compare first two characters of each string:
  switch(first.compare(0, 2, second, 0, 2)) {
    case 0: // The same
      cout << first << " and " << second <<
        " are lexically equal" << endl;
      break;
    case -1: // Less than
      first.swap(second);
      // Fall through this case...
    case 1: // Greater than
      cout << first <<
        " is lexically greater than " <<
        second << endl;
  }
} ///:~

The output is: [ Comment ]

This and That are lexically equal

which is true, for the first two characters of “This” and “That.” [ Comment ]

Indexing with [ ] vs. at( )

In the examples so far, we have used C style array indexing syntax to refer to an individual character in a string. C++ strings provide an alternative to the s[n] notation: the at( ) member. These two idioms produce the same result in C++ if all goes well: [ Comment ]

//: C04:StringIndexing.cpp
//{L} ../TestSuite/Test
#include <string>
#include <iostream>
using namespace std;
int main(){
  string s("1234");
  cout << s[1] << " ";
  cout << s.at(1) << endl;
} ///:~

The output from this code looks like this: [ Comment ]

2 2

However, there is one important difference between [ ] and at( ). When you try to reference an array element that is out of bounds, at( ) will do you the kindness of throwing an exception, while ordinary [ ] subscripting syntax will leave you to your own devices: [ Comment ]

//: C04:BadStringIndexing.cpp
//{L} ../TestSuite/Test
#include <string>
#include <iostream>
using namespace std;

int main(){
  string s("1234");
  // Runtime problem: goes beyond array bounds:
  //! s[5];
  // Saves you by throwing an exception:
  try {
    s.at(5);
  } catch(...) {
    cerr << "Went beyond array bounds!" << endl;
  }
} ///:~

Using at( ) in place of [ ] will give you a chance to gracefully recover from references to array elements that don’t exist. at( ) throws an object of class out_of_range. By catching this object in an exception handler, you can take appropriate remedial actions such as recalculating the offending subscript or growing the array. (You can read more about Exception Handling in Chapter XX) [ Comment ]

Using iterators

In the example program NewFind.cpp, we used a lot of messy and rather tedious C char array handling code to change the case of the characters in a string and then search for the occurrence of matches to a substring. Sometimes the “quick and dirty” method is justifiable, but in general, you won’t want to sacrifice the advantages of having your string data safely and securely encapsulated in the C++ object where it lives. [ Comment ]

Here is a better, safer way to handle case insensitive comparison of two C++ string objects. Because no data is copied out of the objects and into C style strings, you don’t have to use pointers and you don’t have to risk overwriting the bounds of an ordinary character array. In this example, we use the string iterator. Iterators are themselves objects which move through a collection or container of other objects, selecting them one at a time, but never providing direct access to the implementation of the container. Iterators are not pointers, but they are useful for many of the same jobs. [ Comment ]

//: C04:CmpIter.cpp
// Find a group of characters in a string
//{L} ../TestSuite/Test
#include <string>
#include <iostream>
using namespace std;

// Case insensitive compare function:
int 
stringCmpi(const string& s1, const string& s2) {
  // Select the first element of each string:
  string::const_iterator 
    p1 = s1.begin(), p2 = s2.begin();
  // Don’t run past the end:
  while(p1 != s1.end() && p2 != s2.end()) {
    // Compare upper-cased chars:
    if(toupper(*p1) != toupper(*p2))
      // Report which was lexically  greater:
      return (toupper(*p1)<toupper(*p2))? -1 : 1;
    p1++;
    p2++;
  }
  // If they match up to the detected eos, say 
  // which was longer. Return 0 if the same.
  return(s2.size() - s1.size());
}

int main() {
  string s1("Mozart");
  string s2("Modigliani");
  cout << stringCmpi(s1, s2) << endl;
} ///:~

Notice that the iterators p1 and p2 use the same syntax as C pointers – the ‘*’ operator makes the value of element at the location given by the iterators available to the toupper( ) function. toupper( ) doesn’t actually change the content of the element in the string. In fact, it can’t. This definition of p1 tells us that we can only use the elements p1 points to as constants. [ Comment ]

string::const_iterator p1 = s1.begin();

The way toupper( ) and the iterators are used in this example is called a case preserving case insensitive comparison. This means that the string didn’t have to be copied or rewritten to accommodate case insensitive comparison. Both of the strings retain their original data, unmodified. [ Comment ]

Iterating in reverse

Just as the standard C pointer gives us the increment (++) and decrement (--) operators to make pointer arithmetic a bit more convenient, C++ string iterators come in two basic varieties. You’ve seen end( ) and begin( ), which are the tools for moving forward through a string one element at a time. The reverse iterators rend( ) and rbegin( ) allow you to step backwards through a string. Here’s how they work: [ Comment ]

//: C04:RevStr.cpp
// Print a string in reverse
//{L} ../TestSuite/Test
#include <string>
#include <iostream>
using namespace std;
int main() {
  string s("987654321");
  // Use this iterator to walk backwards:
  string::reverse_iterator rev;
  // "Incrementing" the reverse iterator moves 
  // it to successively lower string elements:
  for(rev = s.rbegin(); rev != s.rend(); rev++)
    cout << *rev << " ";
} ///:~

The output from RevStr.cpp looks like this: [ Comment ]

1 2 3 4 5 6 7 8 9

Reverse iterators act like pointers to elements of the string’s character array, except that when you apply the increment operator to them, they move backward rather than forward. rbegin( ) and rend( ) supply string locations that are consistent with this behavior, to wit, rbegin( ) locates the position just beyond the end of the string, and rend( ) locates the beginning. Aside from this, the main thing to remember about reverse iterators is that they aren’t type equivalent to ordinary iterators. For example, if a member function parameter list includes an iterator as an argument, you can’t substitute a reverse iterator to get the function to perform it’s job walking backward through the string. Here’s an illustration: [ Comment ]

// The compiler won’t accept this
string sBackwards(s.rbegin(), s.rend());

The string constructor won’t accept reverse iterators in place of forward iterators in its parameter list. This is also true of string members such as copy( ), insert( ), and assign( ).Comment ]

Strings and character traits

We seem to have worked our way around the margins of case insensitive string comparisons using C++ string objects, so maybe it’s time to ask the obvious question: “Why isn’t case-insensitive comparison part of the standard string class ?” The answer provides interesting background on the true nature of C++ string objects. [ Comment ]

Consider what it means for a character to have “case.” Written Hebrew, Farsi, and Kanji don’t use the concept of upper and lower case, so for those languages this idea has no meaning at all. This the first impediment to built-in C++ support for case-insensitive character search and comparison: the idea of case sensitivity is not universal, and therefore not portable. [ Comment ]

It would seem that if there were a way of designating that some languages were “all uppercase” or “all lowercase” we could design a generalized solution. However, some languages which employ the concept of “case” also change the meaning of particular characters with diacritical marks: the cedilla in Spanish, the circumflex in French, and the umlaut in German. For this reason, any case-sensitive collating scheme that attempts to be comprehensive will be nightmarishly complex to use. [ Comment ]

Although we usually treat the C++ string as a class, this is really not the case. string is a typedef of a more general constituent, the basic_string< > template. Observe how string is declared in the standard C++ header file: [ Comment ]

typedef basic_string<char> string;

To really understand the nature of strings, it’s helpful to delve a bit deeper and look at the template on which it is based. Here’s the declaration of the basic_string< > template: [ Comment ]

template<class charT,
  class traits = char_traits<charT>,
  class allocator = allocator<charT> >
  class basic_string;

Earlier in this book, templates were examined in a great deal of detail. The main thing to notice about the two declarations above are that the string type is created when the basic_string template is instantiated with char. Inside the basic_string< > template declaration, the line [ Comment ]

class traits = char_traits<charT>,

tells us that the behavior of the class made from the basic_string< > template is specified by a class based on the template char_traits< >. Thus, the basic_string< > template provides for cases where you need string oriented classes that manipulate types other than char (wide characters or unicode, for example). To do this, the char_traits< > template controls the content and collating behaviors of a variety of character sets using the character comparison functions eq( ) (equal), ne( ) (not equal), and lt( ) (less than) upon which the basic_string< > string comparison functions rely. [ Comment ]

This is why the string class doesn’t include case insensitive member functions: That’s not in its job description. To change the way the string class treats character comparison, you must supply a different char_traits< > template, because that defines the behavior of the individual character comparison member functions. [ Comment ]

This information can be used to make a new type of string class that ignores case. First, we’ll define a new case insensitive char_traits< > template that inherits the existing one. Next, we’ll override only the members we need to change in order to make character-by-character comparison case insensitive. (In addition to the three lexical character comparison members mentioned above, we’ll also have to supply new implementation of find( ) and compare( ).) Finally, we’ll typedef a new class based on basic_string, but using the case insensitive ichar_traits template for its second argument. [ Comment ]

//: C04:ichar_traits.h
// Creating your own character traits
#ifndef ICHAR_TRAITS_H
#define ICHAR_TRAITS_H
#include <string>
#include <cctype>

struct ichar_traits : std::char_traits<char> {
  // We'll only change character by 
  // character comparison functions
  static bool eq(char c1st, char c2nd) {
    return 
      std::toupper(c1st) == std::toupper(c2nd);
  }
  static bool ne(char c1st, char c2nd) {
    return 
      std::toupper(c1st) != std::toupper(c2nd);
  }
  static bool lt(char c1st, char c2nd) {
    return 
      std::toupper(c1st) < std::toupper(c2nd);
  }
  static int compare(const char* str1, 
    const char* str2, size_t n) {
    for(int i = 0; i < n; i++) {
      if(std::tolower(*str1)>std::tolower(*str2))
        return 1;
      if(std::tolower(*str1)<std::tolower(*str2))
        return -1;
      if(*str1 == 0 || *str2 == 0)
        return 0;
      str1++; str2++; // Compare the other chars
    }
    return 0;
  }
  static const char* find(const char* s1, 
    int  n, char c) {
    while(n-- > 0 &&  
      std::toupper(*s1) != std::toupper(c))
      s1++;
    return s1;
  }
};
#endif // ICHAR_TRAITS_H  ///:~

If we typedef an istring class like this: [ Comment ]

typedef basic_string<char, ichar_traits, 
  allocator<char> > istring;

Then this istring will act like an ordinary string in every way, except that it will make all comparisons without respect to case. Here’s an example: [ Comment ]

[ Something’s wrong with this, so it’s currently left out of the compile by removing the ‘:’] [ Comment ]

// C04:ICompare.cpp
//{L} ../TestSuite/Test
#include "ichar_traits.h"
#include <string>
#include <iostream>
using namespace std;

typedef basic_string<char, ichar_traits, 
  allocator<char> > istring;

int main() {
  // The same letters except for case:
  istring first = "tHis";
  istring second = "ThIS";
  cout << first.compare(second) << endl;
} ///:~

The output from the program is “0”, indicating that the strings compare as equal. This is just a simple example – in order to make istring fully equivalent to string, we’d have to create the other functions necessary to support the new istring type. [ Comment ]

A string application

My friend Daniel (who designed the cover and page layout for this book) does a lot of work with Web pages. One tool he uses creates a “site map” consisting of a Java applet to display the map and an HTML tag that invoked the applet and provided it with the necessary data to create the map. Daniel wanted to use this data to create an ordinary HTML page (sans applet) that would contain regular links as the site map. The resulting program turns out to be a nice practical application of the string class, so it is presented here. [ Comment ]

The input is an HTML file that contains the usual stuff along with an applet tag with a parameter that begins like this: [ Comment ]

<param name="source_file" value="

The rest of the line contains encoded information about the site map, all combined into a single line (it’s rather long, but fortunately string objects don’t care). Each entry may or may not begin with a number of ‘#’ signs; each of these indicates one level of depth. If no ‘#’ sign is present the entry will be considered to be at level one. After the ‘#’ is the text to be displayed on the page, followed by a ‘%’ and the URL to use as the link. Each entry is terminated by a ‘*’. Thus, a single entry in the line might look like this: [ Comment ]

###|Useful Art%./Build/useful_art.html*

The ‘|’ at the beginning is an artifact that needs to be removed. [ Comment ]

My solution was to create an Item class whose constructor would take input text and create an object that contains the text to be displayed, the URL and the level. The objects essentially parse themselves, and at that point you can read any value you want. In main( ), the input file is opened and read until the line contains the parameter that we’re interested in. Everything but the site map codes are stripped away from this string, and then it is parsed into Item objects: [ Comment ]

//: C04:SiteMapConvert.cpp
// Using strings to create a custom conversion
// program that generates HTML output
#include "../require.h"
#include <iostream>
#include <fstream>
#include <string>
#include <cstdlib>
using namespace std;

class Item {
  string id, url;
  int depth;
  string removeBar(string s) {
    if(s[0] == '|')
      return s.substr(1);
    else return s;
  }
public:
  Item(string in, int& index) : depth(0) {
    while(in[index] == '#' && index < in.size()){
      depth++;
      index++;
    }
    // 0 means no '#' marks were found:
    if(depth == 0) depth = 1;
    while(in[index] != '%' && index < in.size())
      id += in[index++];
    id = removeBar(id);
    index++; // Move past '%'
    while(in[index] != '*' && index < in.size())
      url += in[index++];
    url = removeBar(url);
    index++; // To move past '*'
  }
  string identifier() { return id; }
  string path() { return url; }
  int level() { return depth; }
};

int main(int argc, char* argv[]) {
  requireArgs(argc, 1,
    "usage: SiteMapConvert inputfilename");
  ifstream in(argv[1]);
  assure(in, argv[1]);
  ofstream out("plainmap.html");
  string line;
  while(getline(in, line)) {
    if(line.find("<param name=\"source_file\"")
       != string::npos) {
      // Extract data from start of sequence
      // until the terminating quote mark:
      line = line.substr(line.find("value=\"") 
             + string("value=\"").size());
      line = line.substr(0, 
               line.find_last_of("\""));
      int index = 0;
      while(index < line.size()) {
        Item item(line, index);
        string startLevel, endLevel;
        if(item.level() == 1) {
          startLevel = "<h1>";
          endLevel = "</h1>";
        } else
          for(int i = 0; i < item.level(); i++)
            for(int j = 0; j < 5; j++)
              out << "&nbsp;";
        string htmlLine = "<a href=\""
          + item.path() + "\">"
          + item.identifier() + "</a><br>";
        out << startLevel << htmlLine 
            << endLevel << endl;
      }
      break; // Out of while loop
    }
  }
} ///:~

Item contains a private member function removeBar( ) that is used internally to strip off the leading bars, if they appear. [ Comment ]

The constructor for Item initializes depth to 0 to indicate that no ‘#’ signs were found yet; if none are found then it is assumed the Item should be displayed at level one. Each character in the string is examined using operator[ ] to find the depth, id and url values. The other member functions simply return these values. [ Comment ]

After opening the files, main( ) uses string::find( ) to locate the line containing the site map data. At this point, substr( ) is used in order to strip off the information before and after the site map data. The subsequent while loop performs the parsing, but notice that the value index is passed by reference into the Item constructor, and that constructor increments index as it parses each new Item, thus moving forward in the sequence. [ Comment ]

If an Item is at level one, then an HTML h1 tag is used, otherwise the elements are indented using HTML non-breaking spaces. Note in the initialization of htmlLine how easy it is to construct a string – you can just combine quoted character arrays and other string objects using operator+. [ Comment ]

When the output is written to the destination file, startLevel and endLevel will only produce results if they have been given any value other than their default initialization values. [ Comment ]

Summary

C++ string objects provide developers with a number of great advantages over their C counterparts. For the most part, the string class makes referring to strings through the use of character pointers unnecessary. This eliminates an entire class of software defects that arise from the use of uninitialized and incorrectly valued pointers. C++ strings dynamically and transparently grow their internal data storage space to accommodate increases in the size of the string data. This means that when the data in a string grows beyond the limits of the memory initially allocated to it, the string object will make the memory management calls that take space from and return space to the heap. Consistent allocation schemes prevent memory leaks and have the potential to be much more efficient than “roll your own” memory management. [ Comment ]

The string class member functions provide a fairly comprehensive set of tools for creating, modifying, and searching in strings. string comparisons are always case sensitive, but you can work around this by copying string data to C style null terminated strings and using case insensitive string comparison functions, temporarily converting the data held in sting objects to a single case, or by creating a case insensitive string class which overrides the character traits used to create the basic_string object. [ Comment ]

Exercises

  1. A palindrome is a word or group of words that read the same forward and backward. For example “madam” or “wow”. Write a program that takes a string argument from the command line and returns TRUE if the string was a palindrome. [ Comment ]
  2. Sometimes the input from a file stream contains a two character sequence to represent a newline. These two characters (0x0a 0x0d) produce extra blank lines when the stream is printed to standard out. Write a program that finds the character 0x0d (ASCII carriage return) and deletes it from the string. [ Comment ]
  3. Write a program that reverses the order of the characters in a string. [ Comment ]

[8] Much of the material in this chapter was originally created by Nancy Nicolaisen

[9] I subsequently found better tools to accomplish this task, but the program is still interesting.

[ Previous Chapter ] [ Short TOC ] [ Table of Contents ] [ Index ] [ Next Chapter ]
Last Update:08/19/2001