2011-03-22 67 views
7

我想從僅使用python Scrapy的身體刮擦文本,但還沒有任何運氣。Scrapy Body僅限文本

希望有些學者能夠幫我在這裏幫我刮掉<body>標籤中的所有文字。

回答

4

Scrapy使用XPath表示法來提取HTML文檔的某些部分。那麼,您是否嘗試過使用/html/body路徑來提取<body>? (假設它嵌套在<html>中)。這可能是更簡單的使用//body選擇:

x.select("//body").extract() # extract body 

您可以找到有關Scrapy提供here的選擇更多信息。

+0

感謝禮,我知道的部分。但我的問題與獲取純文本而不是html有關。你知道scrapy有什麼方法嗎? – mmrs151 2011-03-24 09:40:09

+0

@ mmrs151:嘗試追加'/ text()'到選擇器。 – 2011-03-24 11:19:27

+1

添加/ text()將獲得正文的文本,使用// text()將獲得正文的所有子元素的文本。但其中一些元素將包含不受歡迎的內容,如腳本標記。 – spazm 2012-06-09 02:25:12

2

得到類似於lynx -nolist -dump所產生的輸出會很好,該輸出渲染頁面然後轉儲可見文本。通過提取段落元素的所有子元素的文本,我已經接近完成了。

我從//body//text()開始,它將所有文本元素拉到身體內部,但是這包括腳本元素。 //body//p獲取正文內的所有段落元素,包括未標記文本週圍的隱含段落標籤。用//body//p/text()提取文本時忽略了來自分標籤的元素(如加粗,斜體,span,div)。 //body//p//text()似乎可以獲得大部分所需的內容,只要該頁面沒有嵌入段落中的腳本標記即可。

XPath /意味着一個直接的孩子,而//包括所有的後代。

% scrapy shell 
In[1]: fetch('http://stackoverflow.com/questions/5390133/scrapy-body-text-only') 
In[2]: hxs.select('//body//p//text()').extract() 

Out[2]: 
[u"I am trying to scrape the text only from body using python Scrapy, but haven't had any luck yet.", 
u'Wishing some scholars might be able to help me here scraping all the text from the ', 
u'&lt;body&gt;', 
u' tag.', 
u'Thank you in advance for your time.', 
u'Scrapy uses XPath notation to extract parts of a HTML document. So, have you tried just using the ', 
u'/html/body', 
u' path to extract ', 
u'&lt;body&gt;', 
u"? (assuming it's nested in ", 
u'&lt;html&gt;', 
u'). It might be even simpler to use the ', 
u'//body', 
u' selector:', 
u'You can find more information about the selectors Scrapy provides ', 
u'here', 

用空格加入串在一起,你有一個很好的輸出:

In [43]: ' '.join(hxs.select("//body//p//text()").extract()) 
Out[43]: u"I am trying to scrape the text only from body using python Scrapy, but haven't had any luck yet. Wishing some scholars might be able to help me here scraping all the text from the &lt;body&gt; tag. Thank you in advance for your time. Scrapy uses XPath notation to extract parts of a HTML document. So, have you tried just using the /html/body path to extract &lt;body&gt; ? (assuming it's nested in &lt;html&gt;). It might be even simpler to use the //body selector: You can find more information about the selectors Scrapy provides here . This is a collaboratively edited question and answer site for professional and enthusiast programmers . It's 100% free, no registration required. about \xbb \xa0\xa0\xa0 faq \xbb \r\n    tagged asked 1 year ago viewed 280 times active 1 year ago"