搜索HTML線和刪除線不與</form></td><a

I have an HTML file with very bad formatted code that I get from a website, I want to extract some very small pieces of information.搜索HTML線和刪除線不與</form></td><a

I am only interested in lines that start like this:

</form></td><td><a href="http://www.mysite.com/users/user897" class="username"> <b>user897</b></a></td></tr><tr><td>HouseA</td><td>2</td><td class="entriesTableRow-gamename">HouseA Type12 <span class="entriesTableRow-moredetails"></span></td><td>1 of 2</td><td>user123</td><td>10</td><td>

and I want to extract 3 fields:

A:HouseA 
    B:HouseA Type12 
    C:user123 
    D:10

I know I've seen people recommend HTML Agility Pack and lib2xml but I really don't think I need all that. My app is in C/C++.

I am already using getline to start reading lines, I am just not sure what's the best way to proceed. Thanks!

std::ifstream data("Home.html"); 
    std::string line; 
    while(std::getline(data,line)) 
    { 
     linenum++; 
     std::stringstream lineStream(line); 
     std::string  user; 
     if (strncmp(line.c_str(), "</form></td><td>",strlen("</form></td><td>")) == 0) 
     { 

      printf("found a wanted line in line:%d\n", linenum); 
     } 

    }

來源

2011-02-17 emge

你有沒有嘗試用正則表達式解析你的HTML？ :-p – 2011-02-17 22:48:43

你有什麼庫可以使用C++ stdlib嗎？你的目標是什麼平臺？ – Macke 2011-02-17 22:51:14

In the general case, an XML/HTML parser is likely the best way here, as it will be robust against differing input. (Whatever you do, don't use regexps開始！）

更新

但是，如果你靶向特定的輸入，因爲它似乎你做的，你可以使用sscanf（如你所建議的）或cin.read（）或regexp手動掃描。

只要注意，這段代碼可以在任何時候中斷HTML的更改（即使只是使用空格）。

因此，我/我們的建議是使用適當的工具來完成這項工作。 XML/HTML不是原始文本，不應該這樣對待。

如何編寫一個python腳本呢？ :)

來源

2011-02-17 22:50:43 Macke

搜索HTML線和刪除線不與</form></td>​​<a

回答

相關問題

搜索HTML線和刪除線不與</form></td><a