與Jsoup

我想要實現KrovetzStemmer爲我下載的頁面整合一個詞幹。我有最大的問題是我不能簡單地用給定的文檔使用body().text()，然後幹所有的話。究其原因是因爲我需要href鏈接不應在所有梗。所以我想，也許如果我能與href環節得到身體，然後我可以HREF拆分，然後使用一個LinkedHashMap作爲Element和布爾或會指定Element無論是文字或鏈接枚舉類型。與Jsoup

所以問題是假設給定的HTML

<!DOCTYPE html> 
<html> 
<body> 
<h1> This is the heading part. This is for testing purposes only.</h1> 
<a href="http://www.firstsite.com/this is a sub directory/">First Link</a> 
<p>This is the first paragraph to be considered.</p> 
<a href="http://www.secondsite.com/it is the correct page/">Second Link</a> 
<p>This is the second paragraph to be considered.</p> 
<img border="0" src="/images/pulpit.jpg" alt="Pulpit rock" width="304" height="228"> 
<a href="http://www.thirdsite.com">Third Link</a> 
</body> 
</html>

我想只能夠得到這樣的：

This is the heading part. This is for testing purposes only. 
<a href="http://www.firstsite.com/this is a sub directory/">First Link</a> 
This is the first paragraph to be considered. 
<a href="http://www.secondsite.com/it is the correct page/">Second Link</a> 
This is the second paragraph to be considered. 
<a href="http://www.thirdsite.com">Third Link</a>

然後將它們分割，然後插入到LinkedHashMap所以如果我做是這樣的：

int i = 1; 
for (Entry<Element, Boolean> entry : splitedList.getEntry()) { 
     if(!entry.getValue()) { System.out.println(i + ": " + entry.getKey());} 
     i++;  
}

然後將打印：

1: This is the heading part. This is for testing purposes only. 
3: This is the first paragraph to be considered. 
5: This is the second paragraph to be considered.

這樣我就可以應用詞幹並保持迭代順序。

現在，我不知道如何實現這個，因爲我不知道如何：

一）獲取正文與href鏈接僅

B）拆分體（我知道有我們總是可以使用字符串split()，但我正在談論的是頁面正文的元素）

我將如何能夠完成上述兩件事？

而且我也不太清楚我的解決方案是一個很好的解決與否。有更好/更簡單的方法來做到這一點？

來源

2014-03-30 Sarp Kaya

如需更好的幫助，請嘗試添加輸入示例和預期輸出/結果，並附上一些解釋，爲什麼會這樣。 – Pshemo

@Pshemo我現在舉了一個例子。 –

現在，我明白你的要求，我更新了新的答案在這裏的帖子：

所以考慮你的HTML文檔doc通過解析給定HTML

你可以得到所有的a標籤和包起來<xmp>標籤（看here）

for (Element element : doc.body().select("a")) 
    element.wrap("<xmp></xmp>");

現在需要新的HTML加載到doc，所以Jsoup將避免解析裏面<xmp>標籤

doc = Jsoup.parse(doc.html()); 
System.out.println(doc.body().text());

內容的輸出將是：

This is the heading part. This is for testing purposes only. 
<a href="http://www.firstsite.com/this is a sub directory/">First Link</a> 
This is the first paragraph to be considered. 
<a href="http://www.secondsite.com/it is the correct page/">Second Link</a> 
This is the second paragraph to be considered. 
<a href="http://www.thirdsite.com">Third Link</a>

現在你可以繼續做你想要輸出的東西。

更新基於註釋的代碼，用於分離

for (Element element : doc.body().select("a")) 
    element.wrap("<xmp>split-me-here</xmp>split-me-here"); 

doc = Jsoup.parse(doc.html()); 

int cnt = 0; 
List<String> splitText = Arrays.asList(doc.body().text().split("split-me-here")); 
for (String text : splitText) { 
    cnt++; 
    if (!text.contains("</a>")) 
     System.out.println(cnt + "." + text.trim()); 
}

上面的代碼將打印輸出如下：

1.本是標題部分。這僅用於測試目的。

3.這是要考慮的第一段。

5.這是要考慮的第二段。

來源

2014-03-30 09:01:21 AKS

我不認爲你理解它是正確的。我不想從元素中刪除任何東西。正如我所提到的，我可以簡單地通過獲得已經返回廢棄文本的'.body（）。text（）'來幹掉所有的單詞。 –

那麼你需要正文文本或元素文本？ – AKS

我需要文檔正文中的文本和'href'元素。 –

回答

相關問題