2016-02-05 40 views
1

我有如下一段HTML的:Jsoup標籤名()給出了錯誤的標籤

<p>       
    <a href="http://www.today.com/video/jill-martin-rescues-savannah-guthrie-from-her-guest-room-mess-604921923959" rel="nofollow"> Jill Martin rescues Savannah Guthrie from her guest room mess </a> 
    <a href="http://www.today.com/video/4-simple-ways-to-clear-your-clutter-this-year-596741699678" rel="nofollow"> 4 simple ways to clear your clutter this year </a> 
    <a href="http://www.today.com/video/staying-home-on-new-years-eve-great-ideas-to-celebrate-at-home-594027587814" rel="nofollow"> Staying home on New Year's Eve? Great ideas to celebrate at home </a> 
    <a href="http://www.today.com/video/heres-how-to-set-a-functional-christmas-table-591622211749" rel="nofollow"> Here's how to set a functional Christmas table </a>  
    </p>       

這件作品是從網頁http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861

而且一段代碼:

Document document = Jsoup.connect("http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861").get(); 
    String tag = null; 
    for (Element element : document.select("*")) { 
     tag = element.tagName(); 

     if ("a".equalsIgnoreCase(tag)) { 
      LOGGER.info("element : {}; nextElementSibling: {}", element.ownText(), element.nextElementSibling()); 
     } 


     if (StringUtils.containsIgnoreCase(element.ownText(), "Jill Martin rescues Savannah")) { 
      LOGGER.info("element : {}; nextElementSibling: {}", element.ownText(), element.nextElementSibling()); 
      LOGGER.info("tag : {}; nextNodeSibling: {}", tag, element.nextSibling()); 
      LOGGER.info("element : {}; previousElementSibling: {}", element.ownText(), element.previousElementSibling()); 
     } 

} 

輸出我得到:

element : Jill Martin rescues Savannah Guthrie from her guest room mess; nextElementSibling: null 
    tag : h2; nextNodeSibling: 
    element : Jill Martin rescues Savannah Guthrie from her guest room mess; previousElementSibling: null 

有許多的問題:

  1. 從主HTML源有標記爲a許多元素,但沒有從小型HTML一塊我覈對
  2. 看來<a>被捕獲爲<h2>
  3. element.nextElementSibling()在大多數情況下爲空

但是,如果單獨針對小塊進行測試,問題就會消失。因此,看起來Jsoup在出現在更大的HTML源代碼中時無法正確識別標籤。

任何想法爲什麼?

謝謝。

EDIT 2

演習背後的用意是清理網頁。這就是爲什麼我遍歷整個HTML,而不是像@Stephan所建議的特定部分。我只挑選了一個看起來有問題的特定部分。

但是在檢查@luksch的迴應之後,我重新查看了原始的HTML並找到了從中拍攝的異常情況。代碼全面查看所有標籤,但給出例外a。在的主要來源,我們有article隨後afigure(包含iimgimgsmallsmall),h2。這個問題似乎像所有的標籤(a除外)都被刪除(按要求工作),但他們的text被留下。這就是爲什麼我最終留下了​​這是不是原來的HTML源代碼。

吉爾·馬丁從她的客房亂搶救薩凡納格思裏是<h2>文本,但<h2>是被刪除,留下它的文本後面。有趣的是,Jsoup仍然認爲文本來自h2,儘管最終輸出沒有h2

+0

該片段是大型代碼的一部分。原始鏈接是「http:// www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861」。因此,較大的文檔應該是'Document doc = Jsoup.connect(「http://www.today.com/home/decorating-ideas-david-bromstad-shares- tips-living-luxury-less-t70861」) .get();' –

+0

URL給了我一個404 – luksch

+0

@luksch,當我複製粘貼時,它出現錯誤。這是調用:Jsoup.connect(「http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861」).get();. '生活'之後的單詞是'奢侈',但複製粘貼錯誤。 –

回答

0

我認爲選擇器需要更具體。

而不是document.select("*"),請嘗試document.select("a")

0

這對我來說是不可重現的。下面的程序打印出正是你所期望的:

String html = "" 
     +"<p>" 
     +" <a href=\"http://www.today.com/video/jill-martin-rescues-savannah-guthrie-from-her-guest-room-mess-604921923959\" rel=\"nofollow\"> Jill Martin rescues Savannah Guthrie from her guest room mess </a> " 
     +" <a href=\"http://www.today.com/video/4-simple-ways-to-clear-your-clutter-this-year-596741699678\" rel=\"nofollow\"> 4 simple ways to clear your clutter this year </a> " 
     +" <a href=\"http://www.today.com/video/staying-home-on-new-years-eve-great-ideas-to-celebrate-at-home-594027587814\" rel=\"nofollow\"> Staying home on New Year's Eve? Great ideas to celebrate at home </a> " 
     +" <a href=\"http://www.today.com/video/heres-how-to-set-a-functional-christmas-table-591622211749\" rel=\"nofollow\"> Here's how to set a functional Christmas table </a> " 
     +"</p>"; 

Document doc = Jsoup.parse(html); 

String tag = null; 
for (Element element : doc.select("*")) { 
    tag = element.tagName(); 

    if ("a".equalsIgnoreCase(tag)) { 
     System.out.println("element : "+element.ownText()+"; nextElementSibling: "+element.nextElementSibling()+""); 

    } 
    if (StringUtils.containsIgnoreCase(element.ownText(), "Jill Martin rescues Savannah")) { 
     System.out.println("element : "+element.ownText()+"; nextElementSibling: "+element.nextElementSibling()+""); 
     System.out.println("tag : "+tag+"; nextNodeSibling: "+element.nextSibling()+""); 
     System.out.println("element : "+element.ownText()+"; previousElementSibling: "+element.previousElementSibling()+""); 
    } 
} 

結果是:

element : Jill Martin rescues Savannah Guthrie from her guest room mess; nextElementSibling: <a href="http://www.today.com/video/4-simple-ways-to-clear-your-clutter-this-year-596741699678" rel="nofollow"> 4 simple ways to clear your clutter this year </a> 
tag : a; nextNodeSibling: 
element : Jill Martin rescues Savannah Guthrie from her guest room mess; previousElementSibling: null 
element : 4 simple ways to clear your clutter this year; nextElementSibling: <a href="http://www.today.com/video/staying-home-on-new-years-eve-great-ideas-to-celebrate-at-home-594027587814" rel="nofollow"> Staying home on New Year's Eve? Great ideas to celebrate at home </a> 
element : Staying home on New Year's Eve? Great ideas to celebrate at home; nextElementSibling: <a href="http://www.today.com/video/heres-how-to-set-a-functional-christmas-table-591622211749" rel="nofollow"> Here's how to set a functional Christmas table </a> 
element : Here's how to set a functional Christmas table; nextElementSibling: null 

也許你用一個錯誤的JSoup版本?上述與版本1.8.3

+0

這段代碼是大代碼的一部分。我剛剛提取了我認爲不起作用的部分。一般來說,我試圖在'http:// www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861'解析內容(其中包含我發佈的代碼片段)。而不是'Document doc = Jsoup.parse(html);'try' Document doc = Jsoup.connect(「http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living- luxury-less-t70861「)。get();' –

+0

以前的複製粘貼有問題。正確的調用是'Jsoup.connect(「http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861」).get();' –

1

你給的網址運行包含此元素:

<a class="player-tease-link" href="http://www.today.com/video/jill-martin-rescues-savannah-guthrie-from-her-guest-room-mess-604921923959"> 
<figure class="player-tease"> 
    <i class="player-tease-icon icon-video-play"></i> 
    <img class="tease-icon-play" src="http://nodeassets.today.com/img/svg/641a740d.video-play-white.svg" alt="Play"> 
    <img class="tease-image" src="http://media1.s-nbcnews.com/j/MSNBC/Components/Video/__NEW/tdy_guth_clutter_160120.today-vid-post-small-desktop.jpg" title="Jill Martin rescues Savannah Guthrie from her guest room mess" alt="Jill Martin rescues Savannah Guthrie from her guest room mess"> 
    <small class="tease-sponsored">Sponsored Content</small> 
    <small class="tease-playing">Now Playing</small> 
</figure> 
<h2 class="player-tease-headline">Jill Martin rescues Savannah Guthrie from her guest room mess</h2> 
</a> 

如此看來,你是比較桔子蘋果,這意味着HTML片段,你也給不原始HTML的一部分。我想你使用了一些工具來提取已經改變了HTML。請注意,a元素不包含任何自己的文本!

一個好主意是遵循@Stephan的建議並學習如何使用CSS selectors properly。這應該比選擇全部然後在程序代碼中手動過濾更有效。這裏是你可以做一個例子:

Elements interestingAs = document.select("a:matches(^Jill Martin)"); 

這將選擇包含文本的開始。「吉爾·馬丁」所有a元素。

+0

I回顧了HTML的源代碼,並與最終輸出結果進行比較,發現異常。簡而言之,一些標籤被刪除,但留下了他們的「文本」。如果父母沒有被刪除,留下的文本被分配給這個標籤(父母)。我們最終輸出的標籤文本錯誤。 –