我從抓取的html文件中獲取主題。這個解析器假設提取出主題標題,用戶帖子和總體視圖。我設法得到的HTML標籤,但問題是,它無法檢索所有的線程標題,而只是得到一些。Jsoup:從論壇獲取主題標題
HTML代碼(對不起,我可憐的對齊我從網站源代碼複製):
<tbody id="threadbits_forum_2">
<tr>
<td class="alt1" id="td_threadstatusicon_3396832">
<img src="http://www.hardwarezone.com.sg/img/forums/hwz/statusicon/thread_hot.gif" id="thread_statusicon_3396832" alt="" border="" />
</td>
<td class="alt2"> </td>
<td class="alt1" id="td_threadtitle_3396832" title="Updated on 3 October 2011
Please check Price Guides for latest prices
A PC Buyer’s Guide that is everything to everyone is simply not possible. This is a simple guide to putting together a PC with a local flavour. Be sure to read PC Buyer’s Guide from other media.
If you have any...">
<div>
<span style="float:right">
<img class="inlineimg" src="http://www.hardwarezone.com.sg/img/forums/hwz/misc/sticky.gif" alt="Sticky Thread" />
</span>
<font color=red><b>Sticky: </b></font>
<a href="showthread.php?s=2a7d1dc5bbc6bf85468a79ec2e6eb86e&t=3396832" id="thread_title_3396832">Buyer's Guide II: Extreme, High-End, Mid-Range, Budget, and Entry Level Systems - Part 2</a>
<span class="smallfont" style="white-space:nowrap">(<img class="inlineimg" src="http://www.hardwarezone.com.sg/img/forums/hwz/misc/multipage.gif" alt="Multi-page thread" border="0" /> <a href="showthread.php?s=2a7d1dc5bbc6bf85468a79ec2e6eb86e&t=3396832">1</a> <a href="showthread.php?s=2a7d1dc5bbc6bf85468a79ec2e6eb86e&t=3396832&page=2">2</a> <a href="showthread.php?s=2a7d1dc5bbc6bf85468a79ec2e6eb86e&t=3396832&page=3">3</a> <a href="showthread.php?s=2a7d1dc5bbc6bf85468a79ec2e6eb86e&t=3396832&page=4">4</a> <a href="showthread.php?s=2a7d1dc5bbc6bf85468a79ec2e6eb86e&t=3396832&page=5">5</a> ... <a href="showthread.php?s=2a7d1dc5bbc6bf85468a79ec2e6eb86e&t=3396832&page=17">Last Page</a>)</span>
</div>
<div class="smallfont">
<span style="cursor:pointer" onclick="window.open('member.php?s=2a7d1dc5bbc6bf85468a79ec2e6eb86e&u=39963', '_self')">adrianlee</span>
</div>
我的編碼迄今:
try(BufferedReader br = new BufferedReader(new FileReader(pageThread)))
{
String html = "";
while(br.readLine() != null)
{
html += br.readLine() + "\n";
}
Document doc = Jsoup.parse(html);
//To get the thread list
Elements threadsList = doc.select("tbody[id^=threadbits_forum]").select("tr");
for(Element e: threadsList)
{
//To get the title
System.out.println("Title: " + e.select("a[id^=thread_title]").text());
}
System.exit(0);
}catch(Exception e)
{
e.printStackTrace();
}
結果: 標題:
- 標題:想成爲HardwareZone編輯團隊的一員?
- 標題:
- 標題:pa9797回到PC新的鑽機!
- 標題:EPIC另一個先爲Andyson,白金模塊化PSU
- 標題:
- 標題:哪個店在SLS好購買一個新的CPU? 。 。 。 等
您是否有解決此問題的方法?
謝謝。
請提供一個鏈接,你要分析的網站! –