2016-09-15 33 views
1
分組的Python Scrapy動態項目

我的頁面是如下與Xpath的

<div style="width:100%;" id="innerTSpec"> 
 
     <table width="100%" cellpadding="0" cellspacing="0" class="PrintIE7in80PercentWidth PrintIE6in80PercentWidth"> 
 
      <tr><td ></td><td class="techspecheading"> Header1</td></tr> 
 
      <tr><td ></td><td class="techspecdata"> </td><td width="10px"></td><td class="">  </td></tr> 
 
      <tr><td ></td><td class="techspecheading"> </td></tr> 
 
      <tr><td ></td><td class="techspecdata"> My Attribute1: </td><td width="10px"></td><td class="techspecdata"> Value1 </td></tr> 
 
      <tr><td ></td><td class="techspecheading"> </td></tr> 
 
      <tr><td ></td><td class="techspecdata"> My Attribute2: </td><td width="10px"></td><td class="techspecdata"> Value2  </td></tr> 
 
      <tr><td ></td><td class="techspecheading"> </td></tr> 
 
--->  <tr><td ></td><td class="techspecheading"> <hr></td></tr> 
 
      <tr><td ></td><td class="techspecdata"> </td><td width="10px"></td><td class="">  </td></tr> 
 
      <tr><td ></td><td class="techspecheading"> Header2</td></tr> 
 
      <tr><td ></td><td class="techspecdata"> </td><td width="10px"></td><td class="">  </td></tr> 
 
      <tr><td ></td><td class="techspecheading"> </td></tr> 
 
      <tr><td ></td><td class="techspecdata"> My Attribute3: </td><td width="10px"></td><td class="techspecdata"> More Value1  </td></tr> 
 
      <tr><td ></td><td class="techspecheading"> </td></tr> 
 
      <tr><td ></td><td class="techspecdata"> My Attribute4: </td><td width="10px"></td><td class="techspecdata"> More Value2  </td></tr> 
 
      <tr><td ></td><td class="techspecheading"> </td></tr> 
 
      <tr><td ></td><td class="techspecdata"> My Attribute5: </td><td width="10px"></td><td class="techspecdata"> More Value3  </td></tr> 
 
--->  <tr><td ></td><td class="techspecheading"> <hr></td></tr> 
 
      
 
     </table> 
 
    </div>

頁眉和屬性是不是在固定的位置它的變化與頁面每次。 我試圖讓象下面這樣:

Header1    | Header2     |... 
 
---------------------------------------------- 
 
My Attribute1:Value1|My Attribute3:More Value1|... 
 
My Attribute2:Value2|My Attribute4:More Value2|... 
 
        |My Attribute5:More Value3|...

注:我使用這將像

My Item is as below 
 
-------------------------------------- 
 
class Website(Item): 
 
    def __setitem__(self, key, value): 
 
     if key not in self.fields: 
 
      self.fields[key] = Field() 
 
     self._values[key] = value 
 
-------------------------------------- 
 
and in spider adding as below 
 
-------------------------------------- 
 
item[Heading]=Body.xpath('..........').extract()

回答

0

加入動態的項目我不沒有安裝scrapy,但我認爲您可以輕鬆修改它以使用scrapy Items

from lxml.html import fromstring 


html = """ 
<div style="width:100%;" id="innerTSpec"> 
     <table width="100%" cellpadding="0" cellspacing="0" class="PrintIE7in80PercentWidth PrintIE6in80PercentWidth"> 
      <tr><td ></td><td class="techspecheading"> Header1</td></tr> 
      <tr><td ></td><td class="techspecdata"> </td><td width="10px"></td><td class="">  </td></tr> 
      <tr><td ></td><td class="techspecheading"> </td></tr> 
      <tr><td ></td><td class="techspecdata"> My Attribute1: </td><td width="10px"></td><td class="techspecdata"> Value1 </td></tr> 
      <tr><td ></td><td class="techspecheading"> </td></tr> 
      <tr><td ></td><td class="techspecdata"> My Attribute2: </td><td width="10px"></td><td class="techspecdata"> Value2  </td></tr> 
      <tr><td ></td><td class="techspecheading"> </td></tr> 
--->  <tr><td ></td><td class="techspecheading"> <hr></td></tr> 
      <tr><td ></td><td class="techspecdata"> </td><td width="10px"></td><td class="">  </td></tr> 
      <tr><td ></td><td class="techspecheading"> Header2</td></tr> 
      <tr><td ></td><td class="techspecdata"> </td><td width="10px"></td><td class="">  </td></tr> 
      <tr><td ></td><td class="techspecheading"> </td></tr> 
      <tr><td ></td><td class="techspecdata"> My Attribute3: </td><td width="10px"></td><td class="techspecdata"> More Value1  </td></tr> 
      <tr><td ></td><td class="techspecheading"> </td></tr> 
      <tr><td ></td><td class="techspecdata"> My Attribute4: </td><td width="10px"></td><td class="techspecdata"> More Value2  </td></tr> 
      <tr><td ></td><td class="techspecheading"> </td></tr> 
      <tr><td ></td><td class="techspecdata"> My Attribute5: </td><td width="10px"></td><td class="techspecdata"> More Value3  </td></tr> 
--->  <tr><td ></td><td class="techspecheading"> <hr></td></tr> 
     </table> 
    </div> 
""" 
body = fromstring(html) 

heading = None 
item = {} 
for tr in body.xpath(r'//div[@id="innerTSpec"]//tr'): 
    # Extract row data. Skip rows without data. 
    data = tr.xpath(r'.//td[@class]/text()') 
    data = list(filter(None, [txt.strip() for txt in data])) 
    if not data: 
     continue 

    # Populate item. 
    if len(data) == 1: 
     heading = data[0] 
    else: 
     item.setdefault(heading, []).append(''.join(data)) 
print(item) 

item

{ 
    'Header1': ['My Attribute1:Value1', 'My Attribute2:Value2'], 
    'Header2': ['My Attribute3:More Value1', 'My Attribute4:More Value2', 'My Attribute5:More Value3'] 
} 
+1

真棒..謝謝@DJV – rajnish3patel