2017-10-17 47 views
1

我有從程序的TSV文件管分隔文件,但是我有在那裏它們放置不同的信息在由所述管限定符號一個小區的問題。迭代通過標籤,然後使用C++

XP_017347145.1 GO:0003676|GO:0005524|GO:0006139|GO:0008026|GO:0016818 
XP_017347145.1 GO:0003677|GO:0004003|GO:0005524 
XP_017347145.1 GO:0005524 
XP_017347145.1 GO:0004003|GO:0016818 
XP_017347145.1 GO:0003676|GO:0005524|GO:0006139|GO:0008026|GO:0016818 
XP_017350967.1 GO:0005515 

我想將它轉換成只有兩列像下面,但它似乎我不理解如何使用,則對getline()函數在C++中。

我有經驗其實並不多,但輸出應該看起來象下面這樣:

XP_017347145.1 = GO:0003676 
XP_017347145.1 = GO:0005524 
XP_017347145.1 = GO:0006139 
XP_017347145.1 = GO:0008026 
XP_017347145.1 = GO:0016818 
XP_017347145.1 = GO:0003677 
XP_017347145.1 = GO:0004003 
XP_017347145.1 = GO:0005524 
XP_017347145.1 = GO:0005524 
XP_017347145.1 = GO:0004003 
XP_017347145.1 = GO:0016818 
XP_017347145.1 = GO:0003676 
XP_017347145.1 = GO:0005524 
XP_017347145.1 = GO:0006139 
XP_017347145.1 = GO:0008026 
XP_017347145.1 = GO:0016818 
XP_017350967.1 = GO:0005515 

我在C++當前代碼失敗,錯過在某些地方等號,並返回一個標籤來代替。

#include <fstream> 
#include <iostream> 
#include <sstream> 
#include <string> 

int main() { 

    using namespace std; 
    string stringIn; 
    string stringOut; 
    string value; 
    string value2; 

    cout << "Input the name of the file: " << endl; 
    getline(cin, stringIn); 
    cout << "The output file name is " << endl; 
    getline(cin, stringOut); 

    ifstream inputFile(stringIn); 
    ofstream outputFile(stringOut); 

    // Let the user know if the file exists 
    if (!inputFile) { 
     cout << "Cannot open input file" << endl; 
    } 

    if (!outputFile) { 
     cout << "Can not save output file" << endl; 
    } 

    // It should iterate through the values using column 
    // and column2 delimited by the pipe sign. 
    // For example, GO:0005524|GO:0008026 and this could be of unknown length. 
    while (getline(inputFile,value,'\t')) { 
     while (getline(inputFile,value2,'|')) { 
      outputFile << value + " = " + value2 << endl; 
     } 
    } 

    outputFile.close(); 
    inputFile.close(); 
    cin.get(); 

    return 0; 
} 

我現在的代碼返回下面的輸出和數據,如下所示。任何建議,將不勝感激。

GO:0016818\nXP_017347145.1\tGO:0003677 
     ^
      | 
      | 
     newline captured 

所以然後它打印整個記錄而不等號,因爲它是先前俘獲value2的一部分:因爲getline(inputFile,value2,'|')正在捕獲以下會發生

XP_017347145.1 = GO:0003676 
XP_017347145.1 = GO:0005524 
XP_017347145.1 = GO:0006139 
XP_017347145.1 = GO:0008026 
XP_017347145.1 = GO:0016818 
XP_017347145.1 GO:0003677 
XP_017347145.1 = GO:0004003 
XP_017347145.1 = GO:0005524 
XP_017347145.1 GO:0005524 
XP_017347145.1 GO:0004003 
XP_017347145.1 = GO:0016818 
XP_017347145.1 GO:0003676 
XP_017347145.1 = GO:0005524 
XP_017347145.1 = GO:0006139 
XP_017347145.1 = GO:0008026 
XP_017347145.1 = GO:0016818 
XP_017350967.1 GO:0005515 
+0

問題是什麼? –

回答

1

問題。

對於具有默認\n換行符分隔符的每行,getline(inputFile,line)會更好。然後使用line創建std::stringstream ss{line},然後最後運行getline(ss,value2,'|')


順便說一句,我用正則表達式玩,我想下面可能是一個更優雅的和通用的解決方案:

#include <iostream> 
#include <regex> 
#include <sstream> 
#include <string> 
#include <algorithm> 
#include <vector> 

std::stringstream input{R"(XP_017347145.1 GO:0003676|GO:0005524|GO:0006139|GO:0008026|GO:0016818 
XP_017347145.1 GO:0003677|GO:0004003|GO:0005524 
XP_017347145.1 GO:0005524 
XP_017347145.1 GO:0004003|GO:0016818 
XP_017347145.1 GO:0003676|GO:0005524|GO:0006139|GO:0008026|GO:0016818 
XP_017350967.1 GO:0005515)"}; 

struct Record{ 
    std::string xp; 
    std::string go; 
}; 

std::ostream& operator<<(std::ostream& os, const Record& r) 
{ 
    return os << "XP_" << r.xp << " = GO:" << r.go << '\n'; 
} 

int main() 
{ 
    std::vector<Record> records; 
    for(std::string line; getline(input, line);) { 
     std::regex r{R"(^XP_(\d*\.\d))"}; // match xp 
     std::smatch m; 
     if(std::regex_search(line, m, r)){ 
      auto xp = m[1].str(); 
      std::regex go_r{R"(GO:(\d*)\|?)"}; // match go 
      auto begin = std::sregex_iterator{line.begin(), line.end(), go_r}; 
      auto end = std::sregex_iterator{}; 
      std::for_each(begin, end, [&records, &xp](const auto& i){records.emplace_back(Record{xp, i[1].str()}); }); 
     } 
    } 
    for(const auto& i : records) 
     std::cout << i; 
} 

輸出:

XP_017347145.1 = GO:0003676 
XP_017347145.1 = GO:0005524 
XP_017347145.1 = GO:0006139 
XP_017347145.1 = GO:0008026 
XP_017347145.1 = GO:0016818 
XP_017347145.1 = GO:0003677 
XP_017347145.1 = GO:0004003 
XP_017347145.1 = GO:0005524 
XP_017347145.1 = GO:0005524 
XP_017347145.1 = GO:0004003 
XP_017347145.1 = GO:0016818 
XP_017347145.1 = GO:0003676 
XP_017347145.1 = GO:0005524 
XP_017347145.1 = GO:0006139 
XP_017347145.1 = GO:0008026 
XP_017347145.1 = GO:0016818 
XP_017350967.1 = GO:0005515 
+0

感謝您的幫助 – user1238097

2

就可以解決問題通過使用sregex_token_iterator像:

std::regex re("\\s+|\\|"); 
    sregex_token_iterator reg_end; 
    while (getline(inputFile,value)) { 
     sregex_token_iterator it(value.begin(), value.end(), re, -1); 
     std::string p1 = (it++)->str(); 
     for (; it != reg_end; ++it) { 
      outputFile << p1 << " = " << it->str() << endl; 
     } 
    } 
+0

問題正則表達式「\ S」應該是空間正確的,但什麼額外的「\」是什麼意思? – user1238097