從JSoup獲取多個頁面中的文本

我有一組1000個頁面（鏈接），我通過向Google發送查詢來獲取。我正在使用JSoup。我想擺脫圖像，鏈接，菜單，視頻等，並只採取從每一頁的主要文章。從JSoup獲取多個頁面中的文本

我的問題是每個頁面都有不同的DOM樹，所以我不能在每個頁面上使用相同的命令！你知道有什麼辦法可以同時做1000頁嗎？我想我必須使用正則表達式。類似的東西也許

textdoc.body().select("[id*=main]").text();//get id that contains the word main 
textdoc.body().select("[class*=main]").text();//get class that contains the word main 
textdoc.body().select("[id*=content]").text();//get id that contains the word content

但我覺得我總是會懷念這個東西。任何更好的想法？

來源

2012-01-19 argi

Element main = doc.select("div.main").first(); 
Elements links = main.select("a[href]");

所有不同的頁面都有主文章的主類？

來源

2012-01-19 11:56:44 JackTurky

這是問題...我猜不是... – argi

你有什麼類似的每一頁？ – JackTurky

我不能檢查1000頁：p：p – argi

從JSoup獲取多個頁面中的文本

回答

相關問題