需要一個正則表達式來從字符串中只提取字母和空白字符

我正在構建一個小的實用程序方法，它解析一行（字符串）並返回所有單詞的向量。我在下面的istringstream代碼工作正常，除了當有標點符號時，我的修復就是想在我通過while循環運行之前「清理」該行。需要一個正則表達式來從字符串中只提取字母和空白字符

我將不勝感激在C++中使用正則表達式庫的一些幫助。我最初的解決方案是我們substr（）並去鎮上，但似乎很複雜，因爲我必須迭代和測試每個字符，看看它是什麼，然後執行一些操作。

vector<string> lineParser(Line * ln) 
{ 
    vector<string> result; 
    string word; 
    string line = ln->getLine(); 
    istringstream iss(line); 
    while(iss) 
    { 
     iss >> word; 
     result.push_back(word); 
    } 
    return result; 
}

來源

2011-04-04 Pete

您應該指定_which_ regex庫用於您要使用的C++。 STL中沒有正則表達式 - 你在使用這個：http://www.boost.org/doc/libs/1_46_1/libs/regex/doc/html/index.html？ – 2011-04-04 14:35:43

我在Visual Studio 2010中使用#include 。我沒有安裝任何特別的東西，並認爲它是STL的一部分。如果情況並非如此，那麼我不知道。 – Pete 2011-04-04 14:39:33

是C++ 11中的新增功能。 C++ 11剛剛在幾周前被投票出去，並且只是被官方羞辱。 VC++ 2010支持這個新的C++ 11功能。 – 2011-04-04 14:44:57

[^A-Za-z\s]應該做你需要的，如果你更換不受任何匹配的字符。它應該刪除所有不是字母和空格的字符。或者[^A-Za-z0-9\s]如果你想保留數字。

你可以使用這樣的在線工具：http://gskinner.com/RegExr/來測試你的模式（替換標籤）。事實上，根據您使用的正則表達式庫可能需要進行一些修改。

來源

2011-04-04 14:59:04 Valkea

我最終使用了這個正則表達式的變體（[^ A-Za-z0-9 \\ s \\']），這個變體也包含了像「I've」和「did not」這樣的單詞的撇號。該鏈接也非常有幫助。 – Pete 2011-04-12 05:20:12

與'[^ A-Za-z \ s]'，編譯器警告我'未知的轉義序列：'\ s'[默認啓用]'。兩次逃跑解決了這個問題。 – 2015-08-21 15:43:56

我還不能肯定，但我認爲這是你在找什麼：

#include<iostream> 
#include<regex> 
#include<vector> 

int 
main() 
{ 
    std::string line("some words: with some punctuation."); 
    std::regex words("[\\w]+"); 
    std::sregex_token_iterator i(line.begin(), line.end(), words); 
    std::vector<std::string> list(i, std::sregex_token_iterator()); 
    for (auto j = list.begin(), e = list.end(); j != e; ++j) 
     std::cout << *j << '\n'; 
} 

some 
words 
with 
some 
punctuation

來源

2011-04-04 14:57:41

不需要使用正則表達式只是標點符號：

// Replace all punctuation with space character. 
std::replace_if(line.begin(), line.end(), 
       std::ptr_fun<int, int>(&std::ispunct), 
       ' ' 
       );

或者，如果你想要的一切，但字母和數字變成了空間：

std::replace_if(line.begin(), line.end(), 
       std::not1(std::ptr_fun<int,int>(&std::isalphanum)), 
       ' ' 
       );

雖然我們在這裏：
你的while循環被破壞，並將最後一個值推入向量兩次。

它應該是：

while(iss) 
{ 
    iss >> word; 
    if (iss)     // If the read of a word failed. Then iss state is bad. 
    { result.push_back(word);// Only push_back() if the state is not bad. 
    } 
}

還是比較常見的版本：

while(iss >> word) // Loop is only entered if the read of the word worked. 
{ 
    result.push_back(word); 
}

或者你可以使用STL：

std::copy(std::istream_iterator<std::string>(iss), 
      std::istream_iterator<std::string>(), 
      std::back_inserter(result) 
     );

來源

2011-04-04 15:15:13

順便說一句，'ispunct'方法比'boost :: regex'快10倍，在我的機器 – 2015-08-21 16:49:58

這是因爲ispuct只是一個數組查找。 – 2015-08-21 17:08:13

最簡單的解決辦法可能是創建一個過濾 streambuf將所有非字母數字字符轉換爲空格，然後使用std :: copy讀取：

class StripPunct : public std::streambuf 
{ 
    std::streambuf* mySource; 
    char   myBuffer; 

protected: 
    virtual int underflow() 
    { 
     int result = mySource->sbumpc(); 
     if (result != EOF) { 
      if (!::isalnum(result)) 
       result = ' '; 
      myBuffer = result; 
      setg(&myBuffer, &myBuffer, &myBuffer + 1); 
     } 
     return result; 
    } 

public: 
    explicit StripPunct(std::streambuf* source) 
     : mySource(source) 
    { 
    } 
}; 

std::vector<std::string> 
LineParser(std::istream& source) 
{ 
    StripPunct    sb(source.rdbuf()); 
    std::istream    src(&sb); 
    return std::vector<std::string>(
     (std::istream_iterator<std::string>(src)), 
     (std::istream_iterator<std::string>())); 
}

來源

2011-04-04 16:30:20

需要一個正則表達式來從字符串中只提取字母和空白字符

回答

相關問題