用Scrapy提取單獨的文本

這是一個來自網站的源代碼：http://www.example.com，我想用scrapy抓取工具提取所有這是一個文本。用Scrapy提取單獨的文本

<tr> 
<td> 
<table> 
    <tr> 
     <td colspan="5" style="text-align:left;padding-left:4px;" class="category"> <imgsrc="http://www.example.com/images/menu.gif"> 
     THIS IS A TEXT </td> 
    </tr> 
           <tr> 
     <td class="date" colspan="5">THIS IS A TEXT</td> 
    </tr> 
           <tr> 
     <td style="test-align:left;width:40px;">THIS IS A TEXT</td> 
     <td style="padding-right:4px; width:180px;text-align:right"> 
     THIS IS A TEXT </td> 
             <td style="width:40px;text-align:center"> <nobr><a id="I1" name="I1" 
href="javascript:MoreInformation(1,'1141','1563513','TT','home');"> 
     THIS IS A TEXT</a></nobr> 
     </td> 
     <td style="padding-left:5px; width:180px;text-align:left"> 
     THIS IS A TEXT </td> 
     <td style="width:40px;text-align:center"></td> 
    </tr> 
           <tr> 
     <td style="test-align:left;width:40px;">THIS IS A TEXT </td> 
     <td style="padding-right:4px; width:180px;text-align:right"> 
     THIS IS A TEXT </td> 
             <td style="width:40px;text-align:center"> THIS IS A TEXT </td> 
     <td style="padding-left:5px; width:180px;text-align:left"> 
     THIS IS A TEXT </td> 
     <td style="width:40px;text-align:center"></td> 
    </tr>* 
</table> 
</td> 
</tr>

這是我scrapy_project.py：我試圖提取了從TD：行數= hxs.select（「.// TD」），我不知道如何提取分離「這是一個文本」。我收到這個錯誤：u'\ n \ t \ t \ t \ t \ t \ t \ t \ t。有人可以幫助我嗎？

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from dirbot.items import Website 

class DmozSpider(BaseSpider): 
    name = "dmoz" 
    allowed_domains = ["example.com"] 
    start_urls = [ 
     "http://www.example.com/", 
     "", 
    ] 

    def parse(self, response): 

     hxs = HtmlXPathSelector(response) 
     rows = hxs.select('//table[@id="content"]//table/tr') 
     items = [] 

     for row in rows: 
      item = Website() 
      item ["job"] = row.select("td[1]/text()").extract() 
      item ["description"] = row.select("td[0]/a/nobr/text()").extract() 
      item ["name"] = row.select("td[2]/text()").extract() 
      items.append(item) 

     return items

另一個問題是：如何才能消除這種：u'\n\t\t\t\t\t\t\t\t

來源

2013-08-01 Floriano

_我不知道如何提取單獨的「這是一個文本」_對我來說是不清楚的。話雖如此，'td [0]/a/nobr/text（）'不會匹配任何內容，因爲XPath中的位置以'1'開頭。 'u'\ n \ t \ t \ t \ t \ t \ t \ t'僅僅是引導選擇器提取的空白。您可以使用'.strip（）'方法刪除前導和尾部的空白字符。如果可以的話，你應該爲'TD'使用更多的acurate選擇器，比如'td [@ class =「category」]'或'td [@ class =「date」]' –

我試試：rows = hxs。 select'（'// td [@ class =「category」]/a/@ href'）但我收到錯誤：'raise ValueError（'請求url中缺少方案：％s'％self._url） exceptions.ValueError：請求URL中缺少方案：'有人可以告訴我哪裏錯了嗎？ – Floriano

你可以發佈你的堆棧跟蹤/控制檯日誌嗎？（最好是一些pastebin服務）什麼代碼行會產生這個錯誤？你是否生成了「請求」？ –

，爲了除去\ n \ t \ t \ t \ t \ t \ t \ t \ t您可以使用正則表達式。就像在你的代碼，而不是.extract（） 喲可以使用.RE（），如：

row.select("td[0]/a/nobr/text()").re('[^\t\n]+')

它會刪除你的\ n \ t。希望這可以幫助:)

來源

2013-08-21 14:13:38 Tushar

用Scrapy提取單獨的文本

回答

相關問題