在文本節點中獲取錨點中的文本

我正在解析亞馬遜上的產品評論，我希望獲取評論的完整文本，其中包含鏈接中的文本。在文本節點中獲取錨點中的文本

我目前正在使用jSoup，就像它一樣好，它會忽略錨點。當然，我可以通過使用選擇器來從錨點獲取所有文本，但是我會丟失關於該文本所處環境的信息。

我認爲一個例子是解釋自己的最佳方式。

樣品結構：

<div class="container"> 
    <div style="a">Something...</div> 
    <div style="b">...Nested spans and divs... </div> 
    <div class="tiny">_____ </div> 
    " From the makers of the incredible <a href="SOMELINK">SOMEPRODUCT</a> we have this other product that blablabla.... Amazing specs, but <a href="SOME_OTHER_LINK">this other product</a> is somehow better".

我得到什麼：「從不可思議的製造商，我們有blablabla這個其它產品...驚人的規格，但不知何故更好」。

我想要的是：「從令人難以置信的SOMEPRODUCT製造商那裏，我們有這款blablabla其他產品......令人驚歎的規格，但這種其他產品在某種程度上更好」。

使用jSoup我的代碼：

Elements allContainers = doc.select(".container"); 
for (Element container : allContainers) { 
    String reviewText = container.ownText(); // THIS EXCLUDES TEXT FROM LINKS 
StdOut.println(reviewText);

我找不到這樣做，因爲它看起來並不像jSoup的方式對待文本節點的實際節點，因此那些主播似乎並沒有被考慮下一個節點的孩子。

我也接受其他想法，比如嘗試使用：not選擇器來獲取它們，但我無法相信jSoup不允許保留鏈接文本，這太常見了相信他們忽略了這個功能。

來源

2012-10-24 Tex

它看起來並不像jSoup把文本節點的實際節點，

否 - JSoup文本節點是實際的節點，是元素。

您所描述的問題的方法，你有一個非常具體的要求，我同意，沒有內置在做的正是你在一個單一的呼叫想要的東西。然而，用簡單的幫助方法，問題是可以解決的。

首先讓我們回顧一下這個問題 - 父div有以下孩子：

div div div #text a #text a # text

過程和每個div和a元素還有其他的孩子，包括文本節點。根據你的例子，這聽起來像你想遍歷所有的孩子，忽略任何不是文本節點。找到第一個文本節點時，收集它的文本和任何後續節點的文本。

肯定是可行的，但我並不感到驚訝沒有內置的方法做到這一點。

這是一個實現解決的問題：

public static String textPlus(Element elem) 
    { 
     List<TextNode> textNodes = elem.textNodes(); 
     if (textNodes.isEmpty()) 
     return ""; 

     StringBuilder result = new StringBuilder(); 
     // start at the first text node 
     Node currentNode = textNodes.get(0); 
     while (currentNode != null) 
     { 
     // append deep text of all subsequent nodes 
     if (currentNode instanceof TextNode) 
     { 
      TextNode currentText = (TextNode) currentNode; 
      result.append(currentText.text()); 
     } 
     else if (currentNode instanceof Element) 
     { 
      Element currentElement = (Element) currentNode; 
      result.append(currentElement.text()); 
     } 
     currentNode = currentNode.nextSibling(); 
     } 
     return result.toString(); 
    }

要調用這個用途：

Elements allContainers = doc.select(".container"); 
for (Element container : allContainers) { 
    String reviewText = textPlus(container); 
    StdOut.println(reviewText); 
}

鑑於你的樣本HTML文本，此代碼返回：

「從令人難以置信的SOMEPRODUCT的製造商，我們有這種其他產品blablabla ....驚人的規格，但這種其他產品是以某種方式更好。「

希望這會有所幫助。

來源

2012-10-24 03:27:49

不幸的不是！如果你使用container.text（），我將獲得包含在div中的EVERYTHING。回到這個例子中，結果如下：「Something ...（text included in）嵌套跨度和divs ... ____ \」從令人難以置信的SOMEPRODUCT的製造商，我們有這種其他產品blablabla .. 。驚人的規格，但這種其他產品是以某種方式更好\「」 – Tex

明白了。我已經更新了答案。 –

非常接近，因此接受:-) – Tex

我接受了圭多的回答，因爲即使它不適合我，它肯定會讓我走上正軌。

Guido的代碼從第一個節點獲取文本，然後迭代通過兄弟。不幸的是，我的現實世界的例子有兩個更復雜的問題：

1 - 仍然沒有任何要求，特別是來自錨點的文本，而不是其他任何東西。我想要更強大的東西，所以我在Guido的結構中加入了這個選擇。

2 - 這仍然會從不需要的鏈接中獲得文本，例如每個亞馬遜評論結束時的「評論」和「永久鏈接」鏈接。其他選擇器在那裏清除它們。

我發佈的代碼確實對我有用，供將來參考。希望它可以幫助:-)

public static String textPlus(Element elem) 
{ 
    List<TextNode> textNodes = elem.textNodes(); 
    if (textNodes.isEmpty()) 
     return ""; 

    StringBuilder result = new StringBuilder(); 

    Node currentNode = textNodes.get(0); 

    while (currentNode != null) 
    { 
     // append deep text of all subsequent nodes 
     if (currentNode instanceof TextNode) 
     { 
      TextNode currentText = (TextNode) currentNode; 
      String curtext = currentText.text(); 
      result.append("\n\n" + currentText.text()); 
     } 
     else if (currentNode instanceof Element) 
     { 
      Element currentElement = (Element) currentNode; 
      Elements anchorElements = currentElement.select("a[href]").select(":not(:contains(Comment))").select(":not(:contains(Permalink))"); 
      if (!anchorElements.isEmpty()) { 
       for (Element anchorElement : anchorElements) 
        result.append("\n\n" + anchorElement.text()); 
      } 
     } 
     currentNode = currentNode.nextSibling(); 
    } 
    return result.toString().trim();

來源

2012-10-24 22:05:26 Tex

我沒有測試過，但根據要素類，你應該使用方法的文字，而不是ownText

文本

公共字符串文本jsoup API文檔（）

Gets the combined text of this element and all its children. 

For example, given HTML <p>Hello <b>there</b> now!</p>, p.text() returns "Hello there now!" 

Returns: 
    unencoded text, or empty string if none. 
See Also: 
    ownText(), textNodes()

ownText

公共字符串ownText（）

Gets the text owned by this element only; does not get the combined text of all children. 

For example, given HTML <p>Hello <b>there</b> now!</p>, p.ownText() returns "Hello now!", whereas p.text() returns "Hello there now!". Note that the text within the b element is not returned, as it is not a direct child of the p element. 

Returns: 
    unencoded text, or empty string if none. 
See Also: 
    text(), textNodes()

來源

2012-11-05 00:06:56 mirek

是的，但不幸的是，DIV是外部DIV文本的子項，因此，僅使用文本（）將不起作用:-) 因此，最終我確實使用了文本（），但連同一個消除所有非鏈接節點的過濾器（element.select（「a [href]」）） – Tex

在文本節點中獲取錨點中的文本

回答

相關問題