如何在HTML中搜索字符串模式，用C編碼？

我需要在HTML文件內搜索標題（字符串）。爲此，我做了strstr以獲得標籤「li」，其中包含標籤「title = \」，其中包含我想要的字符串。如何在HTML中搜索字符串模式，用C編碼？

例如：使用下面的這個數組，我需要得到書名的內部標題。但是，我需要HTML體內的所有標題，其中有數百個。

<li><i><a href="/wiki/Animal_Farm" title="Animal Farm">A Revolução dos Bichos</a></i> (<a href="/wiki/1945" title="1945">1945</a>), de <a href="/wiki/George_Orwell" title="George Orwell">George Orwell</a>.</li>

我試圖運行一個「for」使用strlen來獲得它的循環條件（行長度）。這裏面的，我用的strstr拿到冠軍=」字符串最後複製到引號結束

是這樣的：

for (i=0, i<len, i++){ 
    if(strstr(array[i] == " title=\""){ 
     do{ 
    temp[i] = array[i]; 
      }while((strcmp(array[i], "\"")); 
    } 
}

這就是我掙扎點如何。得到的字符串，字符串內，使用模式（條件）？有什麼建議？

預先感謝您！問候。

來源

2014-11-24 renan_c

你真正需要的是來自編譯器構造的前端。但我想這種工作是在你的技能atm上。你不能使用現有的HTML解析器庫嗎？ – bash0r 2014-11-24 15:06:21

strstr將它的第一個arg作爲字符串指針。它也返回一個字符串指針。因此，只需將整個文件加載到char數組中，查找title =「將開始的字符串設置爲」foundtitle「，然後爲」using「foundtitle做一個strstr」作爲開始字符串指針。使用指針算術來獲得找到的標題的大小，並將其指定到一個char *數組中，或存儲起始點和長度。然後重複，使用找到的標題的末尾作爲起點 – Vorsprung 2014-11-24 15:11:47

，您絕對是@ bash0r。正如我所看到的，它對我來說太複雜了。但我會更詳細地瞭解它，當然。謝謝！ – 2014-11-24 16:22:52

HTML解析「正確的方式」是方式比檢查一個更復雜一次一個字符串。我的代碼下面做了更多的事情不是正確的比其他方式 - 但這部分是由於缺乏信息。

您的HTML格式良好嗎？ title屬性可以包含字符串li或title，還是包含<或>個字符？您是否需要考慮標籤內可能會出現空格，例如<li>？所有屬性是用雙引號"寫的，還是可以有單引號'？

我的代碼顯示了通用有關HTML解析的想法：從一個<跳到下一個，並檢查後面的HTML命令。但正如你所看到的那樣，這真是醜陋，雖然它「做了這份工作」，但它在不可抗拒的情況下已經接近尾聲。

對於在明確定義的參數範圍內快速工作，它可能會做;對於其他所有人來說，尋找一個通用的HTML解析庫，它可以避免上述提示，併爲元素和屬性提供用戶友好的界面。

#include <stdio.h> 
#include <string.h> 
#include <ctype.h> 

int main() 
{ 
    char str[] = "<li><i><a href=\"/wiki/Animal_Farm\" title=\"Animal Farm\">A Revolução dos Bichos</a></i> (<a href=\"/wiki/1945\" title=\"1945\">1945</a>), de <a href=\"/wiki/George_Orwell\" title=\"George Orwell\">George Orwell</a>.</li>" 
       "<li><i><a href=\"/wiki/Animal_Farm_II\" title=\"Animal Farm II: Return of the Hog\">A Revolução dos Bichos</a></i> (<a href=\"/wiki/1945\" title=\"1945\">1945</a>), de <a href=\"/wiki/George_Orwell\" title=\"George Orwell\">George Orwell</a>.</li>"; 
    char *html_walker; 
    html_walker = str; 
    do 
    { 
     html_walker = strstr(html_walker, "<"); 
     if (!html_walker) 
      break; 
     /* Is this "LI"? */ 
     if (!strncasecmp(html_walker+1, "LI", 2) && 
      !isalnum(html_walker[3])) 
     { 
      /* Yes. Scan following HTML entries for 'title' until we find an "</LI>" */ 
      do 
      { 
       /* an "</LI>" code. Bye. */ 
       if (*html_walker == '<') 
       { 
        html_walker++; 
        if (!strncasecmp(html_walker+1, "/LI", 3) && 
         !isalnum(html_walker[4])) 
        { 
         while (*html_walker && *html_walker != '>') 
          html_walker++; 
         if (*html_walker == '>') 
          html_walker++; 
         break; 
        } 
        /* Not an "</LI>" code. Look for 'title' */ 
        while (*html_walker && *html_walker != '>') 
        { 
         if (isspace (*html_walker) && 
          !strncasecmp(html_walker+1, "TITLE=\"", 7)) 
         { 
          html_walker += 8; 
          printf ("title ["); 
          while (*html_walker && *html_walker != '"') 
          { 
           printf ("%c", *html_walker); 
           html_walker++; 
          } 
          printf ("]\n"); fflush (stdout); 
          /* We found a title, so skip to next </LI> */ 
          do 
          { 
           html_walker = strstr(html_walker, "<"); 
           if (!html_walker) 
            break; 
           /* Is this "/LI"? */ 
           if (!strncasecmp(html_walker+1, "/LI", 3) && 
            !isalnum(html_walker[4])) 
            break; 
           html_walker++; 
          } while (html_walker && *html_walker); 
          break; 
         } 
         html_walker++; 
        } 
        if (*html_walker == '>') 
         html_walker++; 
       } else 
       { 
        html_walker++; 
       } 
      } while (*html_walker); 
     } else 
     { 
      /* Skip forward to next '<' */ 
      html_walker++; 
     } 
    } while (html_walker && *html_walker); 
    return 0; 
}

來源

2014-11-24 15:53:49 usr2564301

如何在HTML中搜索字符串模式，用C編碼？

回答

相關問題