如何使用R或Python刮取多個頁面的網頁

我想刮一個網頁來收集數據以便學習數據挖掘。這個網頁數據包含一個43頁的大表。而且它還會在展開式菜單的最右側隱藏一些股票。如何使用R或Python刮取多個頁面的網頁

enter image description here

該網頁如下。

http://data.10jqka.com.cn/market/longhu/yyb/

import bs4 
import requests 


url = r"http://data.10jqka.com.cn/market/longhu/yyb/" 

response = requests.get(url) 
if response.status_code == 200: 
    content = response.content 

soup = bs4.BeautifulSoup(content) 
table_results = soup.findAll("table", {"class": "m_table"}) 
for item in table_results: 
    company_name = item.findAll("td", {"class": "tl"})[0].text.strip() 
    detail = item.findAll("td", {"class": "tc"})[0].text.strip() 
    c_rise = item.findAll("td", {"class": "c_rise"})[0].text.strip() 
    c_fall = item.findAll("td", {"class": "c_fall"})[0].text.strip() 
    cur = item.findAll("td", {"class": "cur"})[0].text.strip() 
    lhb_stocklist = item.findAll("div", {"class": "lhb_stocklist"})[0].text.strip() 
    print company_name, detail, c_rise, c_fall, lhb_stocklist

來源

2014-11-04 Lu Yu

現在你做了什麼？任何代碼？ – Eric 2014-11-04 03:43:45

@ yan9yu，我用XML和Curl嘗試了R。因爲我比R更強大。但我仍然不知道如何刮這張桌子。我會在您嘗試的同時更新我的代碼。 – 2014-11-04 03:48:05

@ yan9yu，你好，你能幫我一下，謝謝！ – 2014-11-04 05:52:17

基於requests，BeautifulSoup的溶液，並將lxml：

import json 
import requests 
from bs4 import BeautifulSoup 

URL = 'http://data.10jqka.com.cn/interface/market/longhuyyb/stocknum/desc/%d/20' 
# config end_page as needed, or parse http://data.10jqka.com.cn/market/longhu/yyb/ to make it auto adapted 
end_page = 2 

result = [] 
for page_idx in range(1, end_page + 1): 
    print 'Extracting page', page_idx 
    raw_response = requests.get(URL % page_idx) 
    page_content = json.loads(raw_response.text)['data'] 
    html = BeautifulSoup(page_content, 'lxml') 
    for row in html.tbody.find_all('tr'): 
     company = row.find(class_='tl').text 
     detail_link = row.find(class_='tl').a['href'] 
     buy = float(row.find(class_='c_rise').text) 
     sell = float(row.find(class_='c_fall').text) 
     stock_cnt = int(row.find(class_='cur').text) 
     stocks = [] 
     for a in row.find(class_='lhb_stocklist_box hide').p.find_all('a'): 
      stocks.append((a.text, a['href'])) 
     result.append({ 
      'company': company, 
      'detail_link': detail_link, 
      'buy': buy, 
      'sell': sell, 
      'stock_cnt': stock_cnt, 
      'stocks': stocks, 
     }) 

print 'Company number:', len(result)

我把所有的數據到詞典列表，方便存取。您可以修改代碼以直接寫入CSV或其他內容

來源

2014-11-04 07:46:32 ZZY

，我剛纔更新了我的代碼。我運行你的代碼，你的代碼並不完全是我想要的，除了使用「當月」。你可以運行我的代碼，以便你能理解我想要的。謝謝！ – 2014-11-04 17:52:45

正如我所看到的，區別在於您使用的是「http://data.10jqka.com.cn/market/longhu/yyb/」。那個URL只能給你第一頁上的記錄，不是嗎？所以你已經解決了你的問題？ – ZZY 2014-11-05 02:03:39

差不多，我的意思是，你可以添加代碼write.to csv到桌面？ – 2014-11-05 03:24:54

如何使用R或Python刮取多個頁面的網頁

回答

相關問題