我一直在嘗試使用HtmlAgilityPack，Fizzler和Regular Expressions來完成某些事情，但沒有運氣。使用C＃解析HTML頁面有問題

我想湊和解析，以元素的頁面是這裏 http://www.sczg.unizg.hr/student-servis/vijest/2015-04-14-poslovi-u-administraciji/

Example of an item in item list: 
 
<p> 
 
    <b>1628/ SomeBoldedTitle 
 
    </b> 
 
    Some Description. 
 
    Some price 20,00kuna. 
 
    <strong>Contact somenumber 
 
     098/1234-567 some mail 
 
    </strong> 
 
</p>

我想這個項目解析到：

4/5數字ID> 1628/b元件
標題> SomeBoldedTitle在B>元素
說明>後在強烈>元素有時在B

這裏/ B

聯繫號碼和鏈接>有時是一些代碼我試圖讓至少一些輸出，我預計所有p元素與b的，但沒有出來。

using System; 
    using HtmlAgilityPack; 
    using Fizzler.Systems.HtmlAgilityPack; 

namespace Sample 
{ 
    class Program 
    { 

     static void Main(string[] args) 
     { 
      var web = new HtmlWeb(); 
      var document = web.Load("http://www.sczg.unizg.hr/student-servis/vijest/2015-04-14-poslovi-u-administraciji/"); 
      var page = document.DocumentNode; 
       foreach (var item in page.QuerySelectorAll("p.item")) 
      { 
       Console.WriteLine(item.QuerySelector("p:has(b)").InnerHtml); 
      } 
     } 
    } 
}

這裏是鏈接到fizzler「文檔」我曾經得到這個代碼 https://fizzlerex.codeplex.com/

來源

2016-05-03 J.Skeet

「我一直在試圖說服一些東西......」你一直在嘗試什麼？ – ClasG

我用我正在嘗試的代碼編輯問題 –

正向

我建議使用HTML解析模塊，因爲HTML可能會導致一些瘋狂的邊緣情況那會嚴重影響你的數據。但是如果你控制了源文本並且仍然需要/使用正則表達式，我提供了這種可能的解決方案。

說明

給出下面的文本

Example of an item in item list: 
<p> 
    <b>1628/ SomeBoldedTitle 
    </b> 
    Some Description. 
    Some price 20,00kuna. 
    <strong>Contact somenumber 
     098/1234-567 some mail 
    </strong> 
</p>

這個表達式

<p>(?:(?!<p>).)*<b>([0-9]+)/\s*((?:(?!</b>).)*?)\s*</b>\s*((?:(?!<strong>|<b>).)*?)\s*<(?:strong|b)>\s*((?:(?!</).)*?)\s*</

將解析您的文字到下面的捕捉組：

0組將是最的嚴格克
組1將多位數代碼
組2將標題
組3將描述
4組將是電話號碼

捕捉組

[0][0] = <p> 
    <b>1628/ SomeBoldedTitle 
    </b> 
    Some Description. 
    Some price 20,00kuna. 
    <strong>Contact somenumber 
     098/1234-567 some mail 
    </ 
[0][1] = 1628 
[0][2] = SomeBoldedTitle 
[0][3] = Some Description. 
    Some price 20,00kuna. 
[0][4] = Contact somenumber 
     098/1234-567 some mail

解釋

Regular expression visualization

注意：右鍵單擊圖像並選擇在新窗口中查看。

NODE      EXPLANATION 
---------------------------------------------------------------------- 
    <p>      '<p>' 
---------------------------------------------------------------------- 
    (?:      group, but do not capture (0 or more times 
          (matching the most amount possible)): 
---------------------------------------------------------------------- 
    (?!      look ahead to see if there is not: 
---------------------------------------------------------------------- 
     <p>      '<p>' 
---------------------------------------------------------------------- 
    )      end of look-ahead 
---------------------------------------------------------------------- 
    .      any character 
---------------------------------------------------------------------- 
)*      end of grouping 
---------------------------------------------------------------------- 
    <b>      '<b>' 
---------------------------------------------------------------------- 
    (      group and capture to \1: 
---------------------------------------------------------------------- 
    [0-9]+     any character of: '0' to '9' (1 or more 
          times (matching the most amount 
          possible)) 
---------------------------------------------------------------------- 
)      end of \1 
---------------------------------------------------------------------- 
/      '/' 
---------------------------------------------------------------------- 
    \s*      whitespace (\n, \r, \t, \f, and " ") (0 or 
          more times (matching the most amount 
          possible)) 
---------------------------------------------------------------------- 
    (      group and capture to \2: 
---------------------------------------------------------------------- 
    (?:      group, but do not capture (0 or more 
          times (matching the least amount 
          possible)): 
---------------------------------------------------------------------- 
     (?!      look ahead to see if there is not: 
---------------------------------------------------------------------- 
     </b>      '</b>' 
---------------------------------------------------------------------- 
    )      end of look-ahead 
---------------------------------------------------------------------- 
     .      any character 
---------------------------------------------------------------------- 
    )*?      end of grouping 
---------------------------------------------------------------------- 
)      end of \2 
---------------------------------------------------------------------- 
    \s*      whitespace (\n, \r, \t, \f, and " ") (0 or 
          more times (matching the most amount 
          possible)) 
---------------------------------------------------------------------- 
    </b>      '</b>' 
---------------------------------------------------------------------- 
    \s*      whitespace (\n, \r, \t, \f, and " ") (0 or 
          more times (matching the most amount 
          possible)) 
---------------------------------------------------------------------- 
    (      group and capture to \3: 
---------------------------------------------------------------------- 
    (?:      group, but do not capture (0 or more 
          times (matching the least amount 
          possible)): 
---------------------------------------------------------------------- 
     (?!      look ahead to see if there is not: 
---------------------------------------------------------------------- 
     <strong>     '<strong>' 
---------------------------------------------------------------------- 
     |      OR 
---------------------------------------------------------------------- 
     <b>      '<b>' 
---------------------------------------------------------------------- 
    )      end of look-ahead 
---------------------------------------------------------------------- 
     .      any character 
---------------------------------------------------------------------- 
    )*?      end of grouping 
---------------------------------------------------------------------- 
)      end of \3 
---------------------------------------------------------------------- 
    \s*      whitespace (\n, \r, \t, \f, and " ") (0 or 
          more times (matching the most amount 
          possible)) 
---------------------------------------------------------------------- 
    <      '<' 
---------------------------------------------------------------------- 
    (?:      group, but do not capture: 
---------------------------------------------------------------------- 
    strong     'strong' 
---------------------------------------------------------------------- 
    |      OR 
---------------------------------------------------------------------- 
    b      'b' 
---------------------------------------------------------------------- 
)      end of grouping 
---------------------------------------------------------------------- 
    >      '>' 
---------------------------------------------------------------------- 
    \s*      whitespace (\n, \r, \t, \f, and " ") (0 or 
          more times (matching the most amount 
          possible)) 
---------------------------------------------------------------------- 
    (      group and capture to \4: 
---------------------------------------------------------------------- 
    (?:      group, but do not capture (0 or more 
          times (matching the least amount 
          possible)): 
---------------------------------------------------------------------- 
     (?!      look ahead to see if there is not: 
---------------------------------------------------------------------- 
     </      '</' 
---------------------------------------------------------------------- 
    )      end of look-ahead 
---------------------------------------------------------------------- 
     .      any character 
---------------------------------------------------------------------- 
    )*?      end of grouping 
---------------------------------------------------------------------- 
)      end of \4 
---------------------------------------------------------------------- 
    \s*      whitespace (\n, \r, \t, \f, and " ") (0 or 
          more times (matching the most amount 
          possible)) 
---------------------------------------------------------------------- 
    </      '</'

來源

2016-05-04 01:29:58

使用C＃解析HTML頁面有問題

回答

正向

說明

解釋

相關問題