查找字符串中子字符串的所有實例

在my last question我詢問了解析HTML頁面中的鏈接的問題。因爲我還沒有找到解決方案，所以我想我在嘗試其他方法：搜索每個<a href=並複製所有內容，直到我點擊</a>。查找字符串中子字符串的所有實例

現在，我的C有點生疏，但我記得我可以使用strstr()來獲取該字符串的第一個實例，但是如何獲取其餘的？

任何幫助表示讚賞。

PS：不。這不是學校的家庭作業或類似的東西。就這樣你知道。

2011-03-02 Mr Aleph

不好，壞主意，註定要失敗。當你點擊一個'' tag? Use an XML parser. – meagar 2011-03-02 15:24:23

Thanks. I know it's a bad idea but I haven't found an XML parser that it's not uber complicated that has a good example of how to do this. If you know of one (plus an example code) please do send it my way – 2011-03-02 15:35:50

您可以使用一個循環：

char *ptr = haystack; 
size_t nlen = strlen (needle); 

while (ptr != NULL) { 
    ptr = strstr (ptr, needle); 
    if (ptr != NULL) { 
    // do whatever with ptr 
    ptr += nlen; // hat tip to @larsman 
    } 
}

來源

2011-03-02 15:22:55 chrisaycock

Loops infinitely if 'needle' is found at least once. You have to move past the match in every iteration. Also, you have to check for 'NULL' *after* 'strstr'. – 2011-03-02 15:25:12

@larsman Ah, thanks. Corrected. – chrisaycock 2011-03-02 15:30:15

Given the OP's pattern, I'd do 'ptr += strlen(needle)' (or better, 'size_t nlen = strlen(needle)' before the loop. – 2011-03-02 15:31:13

C字符串只是指向第一個字符的指針;要獲得下一場比賽，只需再次調用它，並將指針傳遞給前一場比賽的結尾。

來源

2011-03-02 15:21:11 Arkku

爲什麼不使用libxml其中內置了非常好的HTML解析器？

來源

2011-03-02 15:22:37

I'm trying not to use external libs, specially if they are GPL but I did already check that lib. However I cannot find a good example of how to do this, if you have a good example of how to parse links out of an HTML page using libxml I am willing to use it. THanks – 2011-03-02 15:34:47

Here are examples: http://xmlsoft.org/tutorial/index.html What I would do personally is use libxml's XPath, because it is the easiest way to get array of ALL s in document with one query. I am a bit rusty on Xpath, but I think the query was simply: "/a" or something like that, to find all elements in the document. I would consider all the strstr examples as 19th century. This is not how things should be done nowadays anymore. – Gnudiff 2011-03-02 15:37:54

@Mr Aleph: If you don't want GPL, try [Apache Xerces](http://xerces.apache.org/). – chrisaycock 2011-03-02 15:40:47

這裏是我會做什麼（未測試，只是我的想法）：

char* hRef_start = "<a href="; 
char* hRef_end = "</a>";

假設你的文本是

char text[1000]; 
char * first = strstr(text , hRef_start); 
if(first) 
{ 
    char * last = strstr(first , hRef_end); 
    if(last) 
     last--; 
    else 
     //Error here. 

    char * link = malloc((last - first + 2) * sizeof(char)); 
    copy_link(link , first , last); 
} 

void copy_link(char * link , const char * first , const char * last) 
{ 

    while(first < last) 
    { 
      *link = *first; 
      ++first; 
    } 
    *link = 0; 
}

您應該檢查malloc()是否成功，並確保您的號碼爲free()，並確認copy_link()沒有任何參數是null。

來源

2011-03-02 15:28:12 Muggen

好的，最初的答案和我的評論似乎需要更多的信息比評論部分的舒適，所以我決定創建一個新的答案。

首先，你正在試圖做IS編程任務已經，這WILL需要一定的編程能力傾向，根據您的具體需求。其次，提供了一些答案，建議您使用char查找和正則表達式的循環。這些都是可怕的做錯事情的方式，如討論的，例如here。

現在解析HTML/XML東西的正常方法是使用爲此設計的外部庫。事實上，這些庫現在已經是標準的，並且在很多編程語言中都已經內置了。

你的特殊需要，我在C和XPath的生鏽要麼，但它應該工作大約是這樣的：

啓動一個XML/HTML解析器。
加載到它的HTML文檔作爲字符串
告訴解析器發現標籤的所有實例（使用XPath）
它將返回給你一個「節點集」
工藝節點的集合在循環中，每次你需要什麼

我發現了一些其他的例子，也許這是一個更好的標籤做：http://xmlsoft.org/example.html

正如你可以看到有，有一個XML文檔（不很重要，因爲HTML只是XML的子集，您的HTML文檔也應該工作）。

在Python或類似的語言，這將是非常容易的，在一些僞代碼，這將是這樣的：

p=new HTMLParser 
p->load(my html document) 
resultset=p->XPath_Search("//a") # this will find all A elements in the HTML document 
for each result of resultset: 
    write(result.href) 
end for

這一般會寫出文檔中的所有A類元素的HREF部分。一個體面的教程，你可以使用XPath的例子是here。

我恐怕在C這會更復雜些，但想法是一樣的，它是一個編程任務。

如果這是一些快速而骯髒的工作，則可以使用建議的strstr（）或regexp搜索，而不要使用外部庫。但是，請記住，根據您的確切任務，您很可能會錯過許多傳出鏈接或誤讀其內容。

來源

2011-03-02 16:28:16 Gnudiff

查找字符串中子字符串的所有實例

回答

相關問題