在Python中的網頁刮表數據

我想從網頁上刮取數據表，我在網上找到的所有教程都太具體，並且不解釋每個參數/元素是什麼，所以我不能解釋瞭解如何爲我的例子工作。任何意見，在哪裏可以找到好的教程來刮這種數據，將不勝感激;在Python中的網頁刮表數據

query = urllib.urlencode({'q': company}) 
page = requests.get('http://www.hoovers.com/company-information/company-search.html?term=company') 
tree = html.fromstring(page.text) 

table =tree.xpath('//[@id="shell"]/div/div/div[2]/div[5]/div[1]/div/div[1]') 

#Can't get xpath correct 
#This will create a list of companies: 
companies = tree.xpath('//...') 
#This will create a list of locations 
locations = tree.xpath('//....')

我也曾嘗試：

hoover = 'http://www.hoovers.com/company-information/company-search.html?term=company' 
req = urllib2.Request(hoover) 
page = urllib2.urlopen(req) 
soup = BeautifulSoup(page) 

table = soup.find("table", { "class" : "clear data-table sortable-header dashed-table-tr alternate-rows" }) 

f = open('output.csv', 'w') 
for row in table.findAll('tr'): 
    f.write(','.join(''.join([str(i).replace(',','') for i in row.findAll('td',text=True) if i[0]!='&']).split('\n')[1;-1])+'\n') 
f.close()

但我在最後第二條

來源

2015-06-15 russell_i

是的，美麗的湯變得無效的語法錯誤。以下是獲取名稱的簡單示例：

hoover = 'http://www.hoovers.com/company-information/company-search.html?term=company' 
req = urllib2.Request(hoover) 
page = urllib2.urlopen(req) 
soup = BeautifulSoup(page.text) 
trs = soup.find("div", attrs={"class": "clear data-table sortable-header dashed-table-tr alternate-rows"}).find("table").findAll("tr") 
for tr in trs: 
    tds = tr.findAll("td") 
    if len(tds) < 1: 
     continue 
    name = tds[0].text 
    print name 
f.close()

來源

2015-06-15 14:33:19 cfraschetti

謝謝！非常有幫助，但我正在嘗試以這種方式完成頁面源代碼，而不是在html頁面中讀取，因爲目標是將其構建爲一個函數：hoovers ='http://www.hoovers.com/company-information/ company-search.html？term = company' req = urllib2.Request（hoovers） page = urllib2.urlopen（req） soup = BeautifulSoup（page），但是我得到一個Atttribute錯誤，運行解決方案的第三行！ –

BeautifulSoup構造函數接受一個流或字符串，所以你應該能夠傳遞page.text或一個流式版本。 – cfraschetti

對不起，我不明白你的意思是傳遞page.text或流版本？ –

在Python中的網頁刮表數據

回答

相關問題