2016-05-03 19 views
0

我一直在嘗試使用HtmlAgilityPack,Fizzler和Regular Expressions來完成某些事情,但沒有運氣。使用C#解析HTML頁面有問題

我想湊和解析,以元素的頁面是這裏 http://www.sczg.unizg.hr/student-servis/vijest/2015-04-14-poslovi-u-administraciji/

Example of an item in item list: 
 
<p> 
 
    <b>1628/ SomeBoldedTitle 
 
    </b> 
 
    Some Description. 
 
    Some price 20,00kuna. 
 
    <strong>Contact somenumber 
 
     098/1234-567 some mail 
 
    </strong> 
 
</p>

我想這個項目解析到:

  • 4/5數字ID> 1628/b元件
  • 標題> SomeBoldedTitle在B>元素
  • 說明>後在強烈>元素有時在B

這裏/ B

  • 聯繫號碼和鏈接>有時是一些代碼我試圖讓至少一些輸出,我預計所有p元素與b的,但沒有出來。

    using System; 
        using HtmlAgilityPack; 
        using Fizzler.Systems.HtmlAgilityPack; 
    
    namespace Sample 
    { 
        class Program 
        { 
    
         static void Main(string[] args) 
         { 
          var web = new HtmlWeb(); 
          var document = web.Load("http://www.sczg.unizg.hr/student-servis/vijest/2015-04-14-poslovi-u-administraciji/"); 
          var page = document.DocumentNode; 
           foreach (var item in page.QuerySelectorAll("p.item")) 
          { 
           Console.WriteLine(item.QuerySelector("p:has(b)").InnerHtml); 
          } 
         } 
        } 
    } 
    

    這裏是鏈接到fizzler「文檔」我曾經得到這個代碼 https://fizzlerex.codeplex.com/

  • +0

    「我一直在試圖說服一些東西......」你一直在嘗試什麼? – ClasG

    +0

    我用我正在嘗試的代碼編輯問題 –

    回答

    1

    正向

    我建議使用HTML解析模塊,因爲HTML可能會導致一些瘋狂的邊緣情況那會嚴重影響你的數據。但是如果你控制了源文本並且仍然需要/使用正則表達式,我提供了這種可能的解決方案。

    說明

    給出下面的文本

    Example of an item in item list: 
    <p> 
        <b>1628/ SomeBoldedTitle 
        </b> 
        Some Description. 
        Some price 20,00kuna. 
        <strong>Contact somenumber 
         098/1234-567 some mail 
        </strong> 
    </p> 
    

    這個表達式

    <p>(?:(?!<p>).)*<b>([0-9]+)/\s*((?:(?!</b>).)*?)\s*</b>\s*((?:(?!<strong>|<b>).)*?)\s*<(?:strong|b)>\s*((?:(?!</).)*?)\s*</ 
    

    將解析您的文字到下面的捕捉組:

    • 0組將是最的嚴格克
    • 組1將多位數代碼
    • 組2將標題
    • 組3將描述
    • 4組將是電話號碼

    捕捉組

    [0][0] = <p> 
        <b>1628/ SomeBoldedTitle 
        </b> 
        Some Description. 
        Some price 20,00kuna. 
        <strong>Contact somenumber 
         098/1234-567 some mail 
        </ 
    [0][1] = 1628 
    [0][2] = SomeBoldedTitle 
    [0][3] = Some Description. 
        Some price 20,00kuna. 
    [0][4] = Contact somenumber 
         098/1234-567 some mail 
    

    解釋

    Regular expression visualization

    注意:右鍵單擊圖像並選擇在新窗口中查看。

    NODE      EXPLANATION 
    ---------------------------------------------------------------------- 
        <p>      '<p>' 
    ---------------------------------------------------------------------- 
        (?:      group, but do not capture (0 or more times 
              (matching the most amount possible)): 
    ---------------------------------------------------------------------- 
        (?!      look ahead to see if there is not: 
    ---------------------------------------------------------------------- 
         <p>      '<p>' 
    ---------------------------------------------------------------------- 
        )      end of look-ahead 
    ---------------------------------------------------------------------- 
        .      any character 
    ---------------------------------------------------------------------- 
    )*      end of grouping 
    ---------------------------------------------------------------------- 
        <b>      '<b>' 
    ---------------------------------------------------------------------- 
        (      group and capture to \1: 
    ---------------------------------------------------------------------- 
        [0-9]+     any character of: '0' to '9' (1 or more 
              times (matching the most amount 
              possible)) 
    ---------------------------------------------------------------------- 
    )      end of \1 
    ---------------------------------------------------------------------- 
    /      '/' 
    ---------------------------------------------------------------------- 
        \s*      whitespace (\n, \r, \t, \f, and " ") (0 or 
              more times (matching the most amount 
              possible)) 
    ---------------------------------------------------------------------- 
        (      group and capture to \2: 
    ---------------------------------------------------------------------- 
        (?:      group, but do not capture (0 or more 
              times (matching the least amount 
              possible)): 
    ---------------------------------------------------------------------- 
         (?!      look ahead to see if there is not: 
    ---------------------------------------------------------------------- 
         </b>      '</b>' 
    ---------------------------------------------------------------------- 
        )      end of look-ahead 
    ---------------------------------------------------------------------- 
         .      any character 
    ---------------------------------------------------------------------- 
        )*?      end of grouping 
    ---------------------------------------------------------------------- 
    )      end of \2 
    ---------------------------------------------------------------------- 
        \s*      whitespace (\n, \r, \t, \f, and " ") (0 or 
              more times (matching the most amount 
              possible)) 
    ---------------------------------------------------------------------- 
        </b>      '</b>' 
    ---------------------------------------------------------------------- 
        \s*      whitespace (\n, \r, \t, \f, and " ") (0 or 
              more times (matching the most amount 
              possible)) 
    ---------------------------------------------------------------------- 
        (      group and capture to \3: 
    ---------------------------------------------------------------------- 
        (?:      group, but do not capture (0 or more 
              times (matching the least amount 
              possible)): 
    ---------------------------------------------------------------------- 
         (?!      look ahead to see if there is not: 
    ---------------------------------------------------------------------- 
         <strong>     '<strong>' 
    ---------------------------------------------------------------------- 
         |      OR 
    ---------------------------------------------------------------------- 
         <b>      '<b>' 
    ---------------------------------------------------------------------- 
        )      end of look-ahead 
    ---------------------------------------------------------------------- 
         .      any character 
    ---------------------------------------------------------------------- 
        )*?      end of grouping 
    ---------------------------------------------------------------------- 
    )      end of \3 
    ---------------------------------------------------------------------- 
        \s*      whitespace (\n, \r, \t, \f, and " ") (0 or 
              more times (matching the most amount 
              possible)) 
    ---------------------------------------------------------------------- 
        <      '<' 
    ---------------------------------------------------------------------- 
        (?:      group, but do not capture: 
    ---------------------------------------------------------------------- 
        strong     'strong' 
    ---------------------------------------------------------------------- 
        |      OR 
    ---------------------------------------------------------------------- 
        b      'b' 
    ---------------------------------------------------------------------- 
    )      end of grouping 
    ---------------------------------------------------------------------- 
        >      '>' 
    ---------------------------------------------------------------------- 
        \s*      whitespace (\n, \r, \t, \f, and " ") (0 or 
              more times (matching the most amount 
              possible)) 
    ---------------------------------------------------------------------- 
        (      group and capture to \4: 
    ---------------------------------------------------------------------- 
        (?:      group, but do not capture (0 or more 
              times (matching the least amount 
              possible)): 
    ---------------------------------------------------------------------- 
         (?!      look ahead to see if there is not: 
    ---------------------------------------------------------------------- 
         </      '</' 
    ---------------------------------------------------------------------- 
        )      end of look-ahead 
    ---------------------------------------------------------------------- 
         .      any character 
    ---------------------------------------------------------------------- 
        )*?      end of grouping 
    ---------------------------------------------------------------------- 
    )      end of \4 
    ---------------------------------------------------------------------- 
        \s*      whitespace (\n, \r, \t, \f, and " ") (0 or 
              more times (matching the most amount 
              possible)) 
    ---------------------------------------------------------------------- 
        </      '</'