非結構化數據的網絡爬蟲

是否有任何網絡爬蟲適用於解析許多非結構化網站（新聞，文章）並從中提取主要內容塊，而無需事先定義的規則？非結構化數據的網絡爬蟲

我的意思是當我解析新聞提要時，我想從每篇文章中提取主要內容塊來做一些NLP的東西。我有很多網站，並且需要永久查看他們的DOM模型併爲其中的每一個編寫規則。

我試圖用Scrapy並獲得無標籤以及腳本的所有文字，放在身上，但它包含了大量的非相關的東西，如菜單項，廣告塊等

site_body = selector.xpath('//body').extract_first()

但是在這樣的內容上做NLP不會很精確。

那麼有沒有其他工具或方法來完成這些任務？

來源

2016-03-17 Vit D

您是否嘗試過視覺方法？我建議檢查[portia]（http://scrapinghub.com/portia/） – eLRuLL

我試圖用pattern matching來解決這個問題。因此，您可以註釋網頁本身的來源，並將其用作匹配的示例，並且不需要編寫特殊規則。

例如，如果您在本頁面的源代碼看，你看：

<td class="postcell"> 
<div> 
    <div class="post-text" itemprop="text"> 

<p>Are there any web-crawlers adapted for parsing many unstructured websites (news, articles) and extracting a main block of content from them without previously defined rules?</p>

然後您刪除的文字，並添加{.}，以紀念地方作爲相關並得到：

<td class="postcell"> 
<div> 
<div class="post-text" itemprop="text"> 
{.}

（通常你也需要結束標籤，但對於單個元素則不是必需的）

然後你將它作爲模式傳遞給Xidel（SO似乎阻止了默認的我們呃劑，所以它需要改變），

xidel 'http://stackoverflow.com/questions/36066030/web-crawler-for-unstructured-data' --user-agent "Mozilla/5.0 (compatible; Xidel)" -e '<td class="postcell"><div><div class="post-text" itemprop="text">{.}'

它輸出文本

Are there any web-crawlers adapted for parsing many unstructured websites (news, articles) and extracting a main block of content from them without previously defined rules? 

I mean when I'm parsing a news feed, I want to extract the main content block from each article to do some NLP stuff. I have a lot of websites and it will take forever to look into their DOM model and write rules for each of them. 

I was trying to use Scrapy and get all text without tags and scripts, placed in a body, but it include a lot of un-relevant stuff, like menu items, ad blocks, etc. 

site_body = selector.xpath('//body').extract_first() 


But doing NLP over such kind of content will not be very precise. 

So is there any other tools or approaches for doing such tasks?

來源

2016-03-17 17:04:31 BeniBela

使用這種方法，你仍然需要定義所有** div **塊和它們的ID。不適用於數百個網站。 –

這個想法是，你不必寫他們，只需從網頁上覆制它們。我在200多個圖書館網頁上使用這種方法 – BeniBela

您可以使用您parse()和get_text()內美麗的湯：

from bs4 import BeautifulSoup, Comment 

soup = BeautifulSoup(response.body, 'html.parser') 

yield {'body': soup.get_text() }

你可以也手動刪除你不想要的東西（如果你發現你喜歡一些標記，例如<H1>的或<b>的可能是有用的信號）

# Remove invisible tags 
#for i in soup.findAll(lambda tag: tag.name in ['script', 'link', 'meta']): 
#  i.extract()

你可以做類似的事情到白名單幾個標籤。

來源

2016-03-18 16:06:44 neverlastn

非結構化數據的網絡爬蟲

回答

相關問題