我試圖用pattern matching來解決這個問題。因此,您可以註釋網頁本身的來源,並將其用作匹配的示例,並且不需要編寫特殊規則。
例如,如果您在本頁面的源代碼看,你看:
<td class="postcell">
<div>
<div class="post-text" itemprop="text">
<p>Are there any web-crawlers adapted for parsing many unstructured websites (news, articles) and extracting a main block of content from them without previously defined rules?</p>
然後您刪除的文字,並添加{.}
,以紀念地方作爲相關並得到:
<td class="postcell">
<div>
<div class="post-text" itemprop="text">
{.}
(通常你也需要結束標籤,但對於單個元素則不是必需的)
然後你將它作爲模式傳遞給Xidel(SO似乎阻止了默認的我們呃劑,所以它需要改變),
xidel 'http://stackoverflow.com/questions/36066030/web-crawler-for-unstructured-data' --user-agent "Mozilla/5.0 (compatible; Xidel)" -e '<td class="postcell"><div><div class="post-text" itemprop="text">{.}'
它輸出文本
Are there any web-crawlers adapted for parsing many unstructured websites (news, articles) and extracting a main block of content from them without previously defined rules?
I mean when I'm parsing a news feed, I want to extract the main content block from each article to do some NLP stuff. I have a lot of websites and it will take forever to look into their DOM model and write rules for each of them.
I was trying to use Scrapy and get all text without tags and scripts, placed in a body, but it include a lot of un-relevant stuff, like menu items, ad blocks, etc.
site_body = selector.xpath('//body').extract_first()
But doing NLP over such kind of content will not be very precise.
So is there any other tools or approaches for doing such tasks?
您是否嘗試過視覺方法?我建議檢查[portia](http://scrapinghub.com/portia/) – eLRuLL