2011-06-21 77 views
4

以下網址:如何從以色列統計局網站查詢工具中提取數據?

http://www.cbs.gov.il/ts/ID40d250e0710c2f/databank/series_func_e_v1.html?level_1=31&level_2=1&level_3=7

給出了從以色列政府,限制提取一次一個最大的50個系列的數據點的數量的信息數據發生器。我想知道,是否有可能(如果是的話)編寫一個webscraper(用你最喜歡的語言/軟件),它可以跟蹤每個步驟的點擊,以便能夠獲得特定主題中的所有系列。

謝謝。

+3

你必須選擇一種語言,學習它。除非您已經知道python,perl和r,否則一旦有人回答,您將如何採取下一步的解決方案?是的,可以使用特定的語言?你提到的任何語言都會處理你的任務。你知道哪一個能夠很好地實施你的解決方案? – DavidO

+0

我知道R足以處理這個問題。但我不確定R是否可以處理它,所以我想問問其他語言的人是否有可能或者是否存在一些基本問題。 –

回答

3

要提交的表格,你可以使用Python's mechanize module

import mechanize 
import pprint 
import lxml.etree as ET 
import lxml.html as lh 
import urllib 
import urllib2 

browser=mechanize.Browser() 
browser.open("http://www.cbs.gov.il/ts/ID40d250e0710c2f/databank/series_func_e_v1.html?level_1=31&level_2=1&level_3=7") 
browser.select_form(nr=0) 

在這裏,我們偷看可用的選項:

pprint.pprint(browser.form.controls[-2].items) 
# [<Item name='1' id=None selected='selected' contents='Volume of orders for the domestic market' value='1' label='Volume of orders for the domestic market'>, 
# <Item name='2' id=None contents='Orders for export' value='2' label='Orders for export'>, 
# <Item name='3' id=None contents='The volume of production' value='3' label='The volume of production'>, 
# <Item name='4' id=None contents='The volume of sales' value='4' label='The volume of sales'>, 
# <Item name='5' id=None contents='Stocks of finished goods' value='5' label='Stocks of finished goods'>, 
# <Item name='6' id=None contents='Access to credit for the company' value='6' label='Access to credit for the company'>, 
# <Item name='7' id=None contents='Change in the number of employees' value='7' label='Change in the number of employees'>] 

choices=[item.attrs['value'] for item in browser.form.controls[-2].items] 
print(choices) 
# ['1', '2', '3', '4', '5', '6', '7'] 

browser.form['name_tatser']=['2'] 
browser.submit() 

我們可以對以下每種形式重複此操作:

browser.select_form(nr=1) 

choices=[item.attrs['value'] for item in browser.form.controls[-2].items] 
print(choices) 
# ['1576', '1581', '1594', '1595', '1596', '1598', '1597', '1593'] 

browser.form['name_ser']=['1576'] 
browser.submit() 

browser.select_form(nr=2) 

choices=[item.attrs['value'] for item in browser.form.controls[-2].items] 
print(choices) 
# ['32', '33', '34', '35', '36', '37', '38', '39', '40', '41'] 

browser.form['data_kind']=['33'] 
browser.submit() 

browser.select_form(nr=3) 
browser.form['ybegin']=['2010'] 
browser.form['mbegin']=['1'] 
browser.form['yend']=['2011'] 
browser.form['mend']=['5'] 
browser.submit() 

在這一點上,你有三種選擇:

  1. 從HTML源解析數據
  2. 下載.xls文件
  3. 下載一個XML文件

我沒有任何經驗在Python中解析.xls,所以我通過了這個選項。

解析HTML可能與BeautifulSouplxml。也許 這將是最短的解決方案,但找到HTML的正確XPath並不是立即清楚,所以我去了XML:

要從cbs.gov.il網站下載XML,點擊一個調用JavaScript函數的按鈕。呃哦 - 機械化無法執行JavaScript功能。謝天謝地,JavaScript只是組裝一個新的url。與lxml拉出參數很簡單:

content=browser.response().read() 
doc=lh.fromstring(content) 
params=dict((elt.attrib['name'],elt.attrib['value']) for elt in doc.xpath('//input')) 
params['king_format']=2 
url='http://www.cbs.gov.il/ts/databank/data_ts_format_e.xml' 
params=urllib.urlencode(dict((p,params[p]) for p in [ 
    'king_format', 
    'tod', 
    'time_unit_list', 
    'mend', 
    'yend', 
    'co_code_list', 
    'name_tatser_list', 
    'ybegin', 
    'mbegin', 
    'code_list', 
    'co_name_tatser_list', 
    'level_1', 
    'level_2', 
    'level_3'])) 

browser.open(url+'?'+params) 
content=browser.response().read() 

現在我們到達另一個絆腳石:XML被iso-8859-8-i編碼。 Python無法識別此編碼。我不知道該做什麼,只是用iso-8859-8代替iso-8859-8-i。我不知道這可能會導致什麼壞的副作用。

# A hack, since I do not know how to deal with iso-8859-8-i 
content=content.replace('iso-8859-8-i','iso-8859-8') 
doc=ET.fromstring(content) 

一旦你走這麼遠,解析XML很簡單:

for series in doc.xpath('/series_ts/Data_Set/Series'): 
    print(series.attrib) 
    # {'calc_kind': 'Weighted', 
    # 'name_ser': 'Number Of Companies That Answered', 
    # 'get_time': '2011-06-21', 
    # 'name_topic': "Business Tendency Survey - Distributions Of Businesses By Industry, Kind Of Questions And Answers - Manufacturing - Company'S Experience Over The Past Three Months - Orders For Export", 
    # 'time_unit': 'Month', 
    # 'code_series': '22978', 
    # 'data_kind': '5-10 Employed Persons', 
    # 'decimals': '0', 
    # 'unit_kind': 'Number'} 

    for elt in series.xpath('obs'): 
     print(elt.attrib) 
     # {'time_period': ' 2010-12', 'value': '40'} 
     # {'time_period': ' 2011-01', 'value': '38'} 
     # {'time_period': ' 2011-02', 'value': '40'} 
     # {'time_period': ' 2011-03', 'value': '36'} 
     # {'time_period': ' 2011-04', 'value': '30'} 
     # {'time_period': ' 2011-05', 'value': '33'} 
+0

好的,這是非常令人印象深刻的,謝謝! –

8

看看WWW::MechanizeWWW::HtmlUnit

#!/usr/bin/perl 

use strict; 
use warnings; 

use WWW::Mechanize; 

my $m = WWW::Mechanize->new; 

#get page 
$m->get("http://www.cbs.gov.il/ts/ID40d250e0710c2f/databank/series_func_e_v1.html?level_1=31&level_2=1&level_3=7"); 

#submit the form on the first page 
$m->submit_form(
    with_fields => { 
     name_tatser => 2, #Orders for export 
    } 
); 

#now that we have the second page, submit the form on it 
$m->submit_form(
    with_fields => { 
     name_ser => 1576, #Number of companies that answered 
    } 
); 

#and so on... 

#printing the source HTML is a good way 
#to find out what you need to do next 
print $m->content; 
+0

哇。令人印象深刻 - 謝謝! –