提取href scrapy - 抓取但不提取

我正在使用硒和scrapy導航到一個數據表，我想提取鏈接/ href到csv文件。到目前爲止，我所嘗試過的所有內容似乎都不起作用，而且我不確定要嘗試什麼或如何獲取鏈接。提取href scrapy - 抓取但不提取

這裏是我試圖從鏈接/ HREF表的重要組成部分：

<tr class="even"> 

<td class="paddingColumnValue"> </td> 

<td class="nameColumnValue"><a href="/m/app?service=external/sdata_details&sp=12812" class="sdata" title="Click here for additional details.">click</a></td> 

<td class="amountColumnValue">$600,000.00</td> 

<td class="myListColumnValue"><a href="" onclick="doMyListButton(this.firstChild.getAttribute('src'),this.name);myListHandler(this.name);return false;" önmouseover="return true" name="12812"><img src="/m/images/add.gif" border="0" title="Click to add this to your list" name="A12812"></a></td> 


</tr>

我已經得到了實際得到的數據最接近的是與此代碼...（注表ID = search_results）

import time 
from scrapy.item import Item, Field 
from selenium import webdriver 
from scrapy.spider import BaseSpider 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.selector import HtmlXPathSelector 

class ElyseAvenueItem(Item): 
    link = Field() 

class ElyseAvenueSpider(BaseSpider): 
    name = "elyse" 
    allowed_domains = ["domain.com"] 
    start_urls = [ 
'http://www.domain.com'] 

def __init__(self): 
    self.driver = webdriver.Firefox() 

def parse(self, response): 
    self.driver.get(response.url) 
    el1 = self.driver.find_element_by_xpath("//*[@id='headerRelatedLinks']/ul/li[5]/a") 
    el1.click() 
    time.sleep(2) 
    el2 = self.driver.find_element_by_xpath("/html/body/form/table/tbody/tr[2]/td[2]/table/tbody/tr/td[3]/p[3]/a[1]") 
    if el2: 
     el2.click() 
     time.sleep(2) 
    el3 = self.driver.find_element_by_xpath("/html/body/form/table/tbody/tr[2]/td[2]/table[1]/tbody/tr/td[3]/a") 
    if el3: 
     el3.click() 
     time.sleep(20) 


     titles = self.driver.find_elements_by_class_name("sdata") 
     items = [] 
     for titles in titles: 
      item = ElyseAvenueItem() 
      item ["link"] = titles.find_element_by_xpath("//*[@id='search_results']/tbody/tr[2]/td[2]/a") 
      items.append(item) 
      return item

輸出到CSV：在0x03F16E90

selenium.webdriver.remote.webelement.WebElement對象感謝你的幫助。如果這會有所幫助，我可以發佈更多我的嘗試和他們的輸出。就像我說的，我需要的是href，我只是無法弄清楚如何去做。

來源

2013-07-25 user2608626

你在刮selenium webelement實例，而不是它的文本。替換：

item ["link"] = titles.find_element_by_xpath("//*[@id='search_results']/tbody/tr[2]/td[2]/a")

與

link = titles.find_element_by_xpath("//*[@id='search_results']/tbody/tr[2]/td[2]/a") 
item ["link"] = link.get_attribute('href')

希望有所幫助。

來源

2013-07-25 21:39:52 alecxe

謝謝你的幫助。它很接近，但是它提取了標籤之間的東西...而不是標籤之間的東西。我需要在href =「」中的東西。 – user2608626

謝謝，奇怪的是這次沒有刮東西。 – user2608626

這可能聽起來很奇怪，但我似乎與python有一個問題是格式化。我讀過這個標籤不太好用，但是使用空格鍵4x縮進一行。我正在使用記事本++，到目前爲止，我還沒有能夠自己輸入代碼。我一直不得不復制和粘貼其他代碼，然後修改它來做我所需要的。你有什麼建議嗎？這可能是它不起作用的原因嗎？ Firefox瀏覽器肯定是開放的，它正在經歷像它要做的那樣的鏈接。統計數據根本沒有提及正在執行的任何爬行。對不起，這樣的廢話q。 :( – user2608626

提取href scrapy - 抓取但不提取

回答

相關問題