Jsoup標籤名（）給出了錯誤的標籤

我有如下一段HTML的：Jsoup標籤名（）給出了錯誤的標籤

<p>       
    <a href="http://www.today.com/video/jill-martin-rescues-savannah-guthrie-from-her-guest-room-mess-604921923959" rel="nofollow"> Jill Martin rescues Savannah Guthrie from her guest room mess </a> 
    <a href="http://www.today.com/video/4-simple-ways-to-clear-your-clutter-this-year-596741699678" rel="nofollow"> 4 simple ways to clear your clutter this year </a> 
    <a href="http://www.today.com/video/staying-home-on-new-years-eve-great-ideas-to-celebrate-at-home-594027587814" rel="nofollow"> Staying home on New Year's Eve? Great ideas to celebrate at home </a> 
    <a href="http://www.today.com/video/heres-how-to-set-a-functional-christmas-table-591622211749" rel="nofollow"> Here's how to set a functional Christmas table </a>  
    </p>

這件作品是從網頁http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861

而且一段代碼：

Document document = Jsoup.connect("http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861").get(); 
    String tag = null; 
    for (Element element : document.select("*")) { 
     tag = element.tagName(); 

     if ("a".equalsIgnoreCase(tag)) { 
      LOGGER.info("element : {}; nextElementSibling: {}", element.ownText(), element.nextElementSibling()); 
     } 


     if (StringUtils.containsIgnoreCase(element.ownText(), "Jill Martin rescues Savannah")) { 
      LOGGER.info("element : {}; nextElementSibling: {}", element.ownText(), element.nextElementSibling()); 
      LOGGER.info("tag : {}; nextNodeSibling: {}", tag, element.nextSibling()); 
      LOGGER.info("element : {}; previousElementSibling: {}", element.ownText(), element.previousElementSibling()); 
     } 

}

輸出我得到：

element : Jill Martin rescues Savannah Guthrie from her guest room mess; nextElementSibling: null 
    tag : h2; nextNodeSibling: 
    element : Jill Martin rescues Savannah Guthrie from her guest room mess; previousElementSibling: null

有許多的問題：

從主HTML源有標記爲a許多元素，但沒有從小型HTML一塊我覈對
看來<a>被捕獲爲<h2>
element.nextElementSibling()在大多數情況下爲空

但是，如果單獨針對小塊進行測試，問題就會消失。因此，看起來Jsoup在出現在更大的HTML源代碼中時無法正確識別標籤。

任何想法爲什麼？

謝謝。

EDIT 2

演習背後的用意是清理網頁。這就是爲什麼我遍歷整個HTML，而不是像@Stephan所建議的特定部分。我只挑選了一個看起來有問題的特定部分。

但是在檢查@luksch的迴應之後，我重新查看了原始的HTML並找到了從中拍攝的異常情況。代碼全面查看所有標籤，但給出例外a。在的主要來源，我們有article隨後a，figure（包含i，img，img，small，small），h2。這個問題似乎像所有的標籤（a除外）都被刪除（按要求工作），但他們的text被留下。這就是爲什麼我最終留下了這是不是原來的HTML源代碼。

的吉爾·馬丁從她的客房亂搶救薩凡納格思裏是<h2>文本，但<h2>是被刪除，留下它的文本後面。有趣的是，Jsoup仍然認爲文本來自h2，儘管最終輸出沒有h2。

來源

2016-02-05 Mugoma J. Okomba

該片段是大型代碼的一部分。原始鏈接是「http：// www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861」。因此，較大的文檔應該是'Document doc = Jsoup.connect（「http://www.today.com/home/decorating-ideas-david-bromstad-shares- tips-living-luxury-less-t70861」） .get（）;' –

URL給了我一個404 – luksch

@luksch，當我複製粘貼時，它出現錯誤。這是調用：Jsoup.connect（「http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861」）.get（）;. '生活'之後的單詞是'奢侈'，但複製粘貼錯誤。 –

我認爲選擇器需要更具體。

而不是document.select("*")，請嘗試document.select("a")。

來源

2016-02-05 06:42:03 Stephan

這對我來說是不可重現的。下面的程序打印出正是你所期望的：

String html = "" 
     +"<p>" 
     +" <a href=\"http://www.today.com/video/jill-martin-rescues-savannah-guthrie-from-her-guest-room-mess-604921923959\" rel=\"nofollow\"> Jill Martin rescues Savannah Guthrie from her guest room mess </a> " 
     +" <a href=\"http://www.today.com/video/4-simple-ways-to-clear-your-clutter-this-year-596741699678\" rel=\"nofollow\"> 4 simple ways to clear your clutter this year </a> " 
     +" <a href=\"http://www.today.com/video/staying-home-on-new-years-eve-great-ideas-to-celebrate-at-home-594027587814\" rel=\"nofollow\"> Staying home on New Year's Eve? Great ideas to celebrate at home </a> " 
     +" <a href=\"http://www.today.com/video/heres-how-to-set-a-functional-christmas-table-591622211749\" rel=\"nofollow\"> Here's how to set a functional Christmas table </a> " 
     +"</p>"; 

Document doc = Jsoup.parse(html); 

String tag = null; 
for (Element element : doc.select("*")) { 
    tag = element.tagName(); 

    if ("a".equalsIgnoreCase(tag)) { 
     System.out.println("element : "+element.ownText()+"; nextElementSibling: "+element.nextElementSibling()+""); 

    } 
    if (StringUtils.containsIgnoreCase(element.ownText(), "Jill Martin rescues Savannah")) { 
     System.out.println("element : "+element.ownText()+"; nextElementSibling: "+element.nextElementSibling()+""); 
     System.out.println("tag : "+tag+"; nextNodeSibling: "+element.nextSibling()+""); 
     System.out.println("element : "+element.ownText()+"; previousElementSibling: "+element.previousElementSibling()+""); 
    } 
}

結果是：

element : Jill Martin rescues Savannah Guthrie from her guest room mess; nextElementSibling: <a href="http://www.today.com/video/4-simple-ways-to-clear-your-clutter-this-year-596741699678" rel="nofollow"> 4 simple ways to clear your clutter this year </a> 
tag : a; nextNodeSibling: 
element : Jill Martin rescues Savannah Guthrie from her guest room mess; previousElementSibling: null 
element : 4 simple ways to clear your clutter this year; nextElementSibling: <a href="http://www.today.com/video/staying-home-on-new-years-eve-great-ideas-to-celebrate-at-home-594027587814" rel="nofollow"> Staying home on New Year's Eve? Great ideas to celebrate at home </a> 
element : Staying home on New Year's Eve? Great ideas to celebrate at home; nextElementSibling: <a href="http://www.today.com/video/heres-how-to-set-a-functional-christmas-table-591622211749" rel="nofollow"> Here's how to set a functional Christmas table </a> 
element : Here's how to set a functional Christmas table; nextElementSibling: null

也許你用一個錯誤的JSoup版本？上述與版本1.8.3

來源

2016-02-05 10:38:15 luksch

這段代碼是大代碼的一部分。我剛剛提取了我認爲不起作用的部分。一般來說，我試圖在'http：// www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861'解析內容（其中包含我發佈的代碼片段）。而不是'Document doc = Jsoup.parse（html）;'try' Document doc = Jsoup.connect（「http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living- luxury-less-t70861「）。get（）;' –

以前的複製粘貼有問題。正確的調用是'Jsoup.connect（「http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861」）.get（）;' –

你給的網址運行包含此元素：

<a class="player-tease-link" href="http://www.today.com/video/jill-martin-rescues-savannah-guthrie-from-her-guest-room-mess-604921923959"> 
<figure class="player-tease"> 
    <i class="player-tease-icon icon-video-play"></i> 
    <img class="tease-icon-play" src="http://nodeassets.today.com/img/svg/641a740d.video-play-white.svg" alt="Play"> 
    <img class="tease-image" src="http://media1.s-nbcnews.com/j/MSNBC/Components/Video/__NEW/tdy_guth_clutter_160120.today-vid-post-small-desktop.jpg" title="Jill Martin rescues Savannah Guthrie from her guest room mess" alt="Jill Martin rescues Savannah Guthrie from her guest room mess"> 
    <small class="tease-sponsored">Sponsored Content</small> 
    <small class="tease-playing">Now Playing</small> 
</figure> 
<h2 class="player-tease-headline">Jill Martin rescues Savannah Guthrie from her guest room mess</h2> 
</a>

如此看來，你是比較桔子蘋果，這意味着HTML片段，你也給不原始HTML的一部分。我想你使用了一些工具來提取已經改變了HTML。請注意，a元素不包含任何自己的文本！

一個好主意是遵循@Stephan的建議並學習如何使用CSS selectors properly。這應該比選擇全部然後在程序代碼中手動過濾更有效。這裏是你可以做一個例子：

Elements interestingAs = document.select("a:matches(^Jill Martin)");

這將選擇包含文本的開始。「吉爾·馬丁」所有a元素。

來源

2016-02-06 13:18:39 luksch

I回顧了HTML的源代碼，並與最終輸出結果進行比較，發現異常。簡而言之，一些標籤被刪除，但留下了他們的「文本」。如果父母沒有被刪除，留下的文本被分配給這個標籤（父母）。我們最終輸出的標籤文本錯誤。 –

Jsoup標籤名（）給出了錯誤的標籤

回答

相關問題