Выбрать главу

//: C03:Trim.h

#ifndef TRIM_H

#define TRIM_H

#include <string>

// General tool to strip spaces from both ends:

inline std::string trim(const std::string& s) {

  if(s.length() == 0)

    return s;

  int beg = s.find_first_not_of(" \a\b\f\n\r\t\v");

  int end = s.find_last_not_of(" \a\b\f\n\r\t\v");

  if(beg == std::string::npos) // No non-spaces

    return "";

  return std::string(s, beg, end - beg + 1);

}

#endif // TRIM_H ///:~

The first test checks for an empty string; in that case, no tests are made, and a copy is returned. Notice that once the end points are found, the string constructor builds a new string from the old one, giving the starting count and the length.

Testing such a general-purpose tool needs to be thorough:.

//: C03:TrimTest.cpp

//{L} ../TestSuite/Test

#include <iostream>

#include "Trim.h"

#include "../TestSuite/Test.h"

using namespace std;

string s[] = {

  " \t abcdefghijklmnop \t ",

  "abcdefghijklmnop \t ",

  " \t abcdefghijklmnop",

  "a", "ab", "abc", "a b c",

  " \t a b c \t ", " \t a \t b \t c \t ",

  "\t \n \r \v \f",

  "" // Must also test the empty string

};

class TrimTest : public TestSuite::Test {

public:

  void testTrim() {

    test_(trim(s[0]) == "abcdefghijklmnop");

    test_(trim(s[1]) == "abcdefghijklmnop");

    test_(trim(s[2]) == "abcdefghijklmnop");

    test_(trim(s[3]) == "a");

    test_(trim(s[4]) == "ab");

    test_(trim(s[5]) == "abc");

    test_(trim(s[6]) == "a b c");

    test_(trim(s[7]) == "a b c");

    test_(trim(s[8]) == "a \t b \t c");

    test_(trim(s[9]) == "");

    test_(trim(s[10]) == "");

  }

  void run() {

    testTrim();

  }

};

int main() {

  TrimTest t;

  t.run();

  return t.report();

} ///:~

In the array of strings, you can see that the character arrays are automatically converted to string objects. This array provides cases to check the removal of spaces and tabs from both ends, as well as ensuring that spaces and tabs are not removed from the middle of a string.

Removing characters from strings

Removing characters is easy and efficient with the erase( ) member function, which takes two arguments: where to start removing characters (which defaults to 0), and how many to remove (which defaults to string::npos). If you specify more characters than remain in the string, the remaining characters are all erased anyway (so calling erase( ) without any arguments removes all characters from a string). Sometimes it’s useful to take an HTML file and strip its tags and special characters so that you have something approximating the text that would be displayed in the Web browser, only as a plain text file. The following uses erase( ) to do the job:.

//: C03:HTMLStripper.cpp

//{L} ReplaceAll

// Filter to remove html tags and markers

#include <cassert>

#include <cmath>

#include <cstddef>

#include <fstream>

#include <iostream>

#include <string>

#include "../require.h"

using namespace std;

string& replaceAll(string& context, const string& from,

  const string& to);

string& stripHTMLTags(string& s) {

  static bool inTag = false;

  bool done = false;

  while (!done) {

    if (inTag) {

      // The previous line started an HTML tag

      // but didn't finish. Must search for '>'.

      size_t rightPos = s.find('>');

      if (rightPos != string::npos) {

        inTag = false;

        s.erase(0, rightPos + 1);

      }

      else {

        done = true;

        s.erase();

      }

    }

    else {

      // Look for start of tag:

      size_t leftPos = s.find('<');

      if (leftPos != string::npos) {

        // See if tag close is in this line

        size_t rightPos = s.find('>');

        if (rightPos == string::npos) {

          inTag = done = true;

          s.erase(leftPos);

        }

        else

          s.erase(leftPos, rightPos - leftPos + 1);

      }

      else

        done = true;

    }

  }

  // Remove all special HTML characters

  replaceAll(s, "&lt;", "<");

  replaceAll(s, "&gt;", ">");

  replaceAll(s, "&amp;", "&");

  replaceAll(s, "&nbsp;", " ");

  // Etc...

  return s;

}

int main(int argc, char* argv[]) {

  requireArgs(argc, 1,

    "usage: HTMLStripper InputFile");

  ifstream in(argv[1]);

  assure(in, argv[1]);

  string s;

  while(getline(in, s))

    if (!stripHTMLTags(s).empty())

      cout << s << endl;

} ///:~

This example will even strip HTML tags that span multiple lines.[32] This is accomplished with the static flag, inTag, which is true whenever the start of a tag is found, but the accompanying tag end is not found in the same line. All forms of erase( ) appear in the stripHTMLFlags( ) function.[33] The version of getline( ) we use here is a global function declared in the <string> header and is handy because it stores an arbitrarily long line in its string argument. You don’t have to worry about the dimension of a character array as you do with istream::getline( ). Notice that this program uses the replaceAll( ) function from earlier in this chapter. In the next chapter, we’ll use string streams to create a more elegant solution.

Comparing strings

Comparing strings is inherently different from comparing numbers. Numbers have constant, universally meaningful values. To evaluate the relationship between the magnitudes of two strings, you must make a lexical comparison. Lexical comparison means that when you test a character to see if it is "greater than" or "less than" another character, you are actually comparing the numeric representation of those characters as specified in the collating sequence of the character set being used. Most often this will be the ASCII collating sequence, which assigns the printable characters for the English language numbers in the range 32 through 127 decimal. In the ASCII collating sequence, the first "character" in the list is the space, followed by several common punctuation marks, and then uppercase and lowercase letters. With respect to the alphabet, this means that the letters nearer the front have lower ASCII values than those nearer the end. With these details in mind, it becomes easier to remember that when a lexical comparison that reports s1 is "greater than" s2, it simply means that when the two were compared, the first differing character in s1 came later in the alphabet than the character in that same position in s2.

вернуться

32

To keep the exposition simple, this version does not handle nested tags, such as comments.

вернуться

33

It is tempting to use mathematics here to factor out some of these calls to erase(В ), but since in some cases one of the operands is string::npos (the largest unsigned integer available), integer overflow occurs and wrecks the algorithm.