全部廢除文本<a>使用scrapy的span標記下的標記

我正在使用scrapy從網頁中提取數據。我想提取的跨度標籤下錨標籤的文字如下圖所示：全部廢除文本<a>使用scrapy的span標記下的標記

<span>.....</span> 
<span id = "size_selection_list"> 
    <a>....</a> 
    <a>....</a> 
    . 
    . 
    . 
    <a> 
</span>

我使用以下XPath邏輯：

t = sel.xpath('//div[starts-with(@id,"size_selection_container")]/span[2]') 
for x in t.xpath('.//a'): 
....

是達到這個問題的跨度元素，但<a>標籤不會迭代。這裏有什麼錯誤？另外<a>有一個HREF有JavaScript。這是問題的原因嗎？

來源

2016-11-18 Neel Shah

你的邏輯將與您提供的樣本HTML：http://pastebin.com/hxSZ041j。因此，要麼不按原樣分享代碼，要麼示例HTML不是您正在使用的代碼。 –

如果我願意，我會使用requests和BeautifulSoup4。

請注意，此代碼未經測試，但應該工作。

import requests 
from bs4 import BeautifulSoup 
r = requests.get(yoururlhere).text 
soup = BeautifulSoup(r, 'html.parser') #You can use LXML or other things, I am using the standard parser for compatibility 
span = div.find('div', {'class': 'theclass'} 
tags = span.findAll('a', href=True) 
for i in tags: 
    print(i.getText()) #getText might not be a function, consider removing the extra() 
    print(i['href']) #<-- This is the links, above is the text

我希望這個作品，請讓我知道

來源

2016-11-18 01:06:21 Will

但我想爬蜘蛛。所以這就是爲什麼我更喜歡scrapy的一些解決方案。 –

請問爲什麼使用scrapy或蜘蛛？ – Will

這是一個我所能做的，你的HTML代碼是不完整的。

import lxml.html 
string = '''<span>.....</span> 
<span id = "size_selection_list"> 
    <a>....</a> 
    <a>....</a> 
    . 
    . 
    . 
    <a>....</a> 
</span>''' 

html = lxml.html.fromstring(string) 
for a in html.xpath('//span[@id="size_selection_list"]//a'): 
    print(a.tag)

出來：

a 
a 
a

來源

2016-11-18 05:29:00

這給出了錯誤 –

它給了什麼錯誤？ –

全部廢除文本<a>使用scrapy的span標記下的標記

回答

相關問題