2015-12-21 81 views
1

我已經使用Selenium從http://www.fedsdatacenter.com/federal-pay-rates/index.php?n=&l=&a=SECURITIES+AND+EXCHANGE+COMMISSION&o=&y=all刮掉聯邦員工職位和薪資信息的動態Javascript表格。 (注意:這些都是公有領域的數據,所以不用擔心個人信息)。從刮臉的Javascript表格列表中創建數據框

我試圖將它放入熊貓DF進行分析。我的問題是我的硒的輸入數據是打印的清單:

[u'DOE,JON'], [u'14'], [u'SK'], [u'$176,571.00'], [u'$2,000.00'], [u'SECURITIES AND EXCHANGE COMMISSION'], [u'WASHINGTON'], [u'GENERAL ATTORNEY'], [u'2012']], ... 

我想要得到的是,處理的記錄 爲任意數量DF:

NAME  GRADE SCALE SALARY  BONUS  AGENCY LOCATION POSITION YEAR 
Doe, Jon 14 SK $176,571.00 $2,000.00 SEC DC  ATTY  2012 
. 
. 
. 

我我試圖將這個列表轉換成一個字典,使用col函數名稱作爲元組和數據作爲列表等的zip()函數,儘管它已經很好地瞭解了Python的特性,但都無濟於事。在獲得數據之後應該做什麼?或者我應該以不同的方式閱讀數據?

目前,刮板代碼:

from selenium import webdriver 

path_to_chromedriver = '/Users/xxx/Documents/webdriver/chromedriver' # change path as needed 
browser = webdriver.Chrome(executable_path = path_to_chromedriver) 

url = 'http://www.fedsdatacenter.com/federal-pay-rates/index.php' 
browser.get(url) 

inputAgency = browser.find_element_by_id('a') 
inputYear = browser.find_element_by_id('y') 

# Send data 
inputAgency.send_keys('SECURITIES AND EXCHANGE COMMISSION') 
inputYear.send_keys('All') 

# Select 'All' from Years element 
browser.find_element_by_css_selector('input[type=\"submit\"]').click() 
browser.find_element_by_xpath('//*[@id="example_length"]/label/select/option[4]').click() 

SMRtable = browser.find_element_by_id('example') 

scrapedData = [] 

for td in SMRtable.find_elements_by_xpath('.//td'): 
    scrapedData.append([td.get_attribute('innerHTML')]) 
    print td.get_attribute('innerHTML') 

回答

1

您只能使用pandas

所以,首先你可以檢查網頁查看頁面源:

http://www.fedsdatacenter.com/federal-pay-rates/index.php?n=&l=&a=SECURITIES+AND+EXCHANGE+COMMISSION&o=&y=all

檢查線路沒有。 14807 - 14826:

// data table initialization 
$(document).ready(function() { 
    $('#example').dataTable({ 
     "bPaginate": true, 
     "bFilter": false, 
     "bProcessing": true, 
     "bServerSide": true, 
     "aoColumns": [ 
     null, 
     null, 
     null, 
     { "sType": 'currency' }, // set currency columns to allow sorting 
     { "sType": 'currency' }, // set second column to currency to allow sorting 
     null, 
     null, 
     null, 
     null 
     ], 
     "sAjaxSource": "output.php?n=&a=SECURITIES AND EXCHANGE COMMISSION&l=&o=&y=all" 
    }); 
}); 

這意味着當前頁使用dataTables和數據從AJAX源作爲JSON加載。

所以不是報廢HTML,你可以得到乾淨漂亮的JSON:

output.php?n=&a=SECURITIES AND EXCHANGE COMMISSION&l=&o=&y=all 

而最後一個環節是(而不是space使用%20):

http://www.fedsdatacenter.com/federal-pay-rates/output.php?n=&a=SECURITIES%20AND%20EXCHANGE%20COMMISSION&l=&o=&y=all

JSON:

{"sEcho":0,"iTotalRecords":"7072900","iTotalDisplayRecords":"19919","aaData":[ 
["ZUVER,SHAHEEN H","14","SK","$170,960.00","$0.00","SECURITIES AND EXCHANGE COMMISSION","WASHINGTON","GENERAL ATTORNEY","2014"], 
["ZUR,MIA C.","14","SK","$164,875.00","$0.00","SECURITIES AND EXCHANGE COMMISSION","WASHINGTON","GENERAL ATTORNEY","2014"], 
["ZUNDEL,JENNET LEONG","14","SK","$204,638.00","$0.00","SECURITIES AND EXCHANGE COMMISSION","SAN FRANCISCO","ACCOUNTING","2014"], 
["ZUKOWSKI,DAVID W","04","SK","$38,382.00","$0.00","SECURITIES AND EXCHANGE COMMISSION","BOSTON","ADMIN AND OFFICE SUPPORT STUDENT TRAINEE","2014"], 
... 

所以你可以解析E本JSON的大熊貓與read_json

import pandas as pd 

df = pd.read_json("http://www.fedsdatacenter.com/federal-pay-rates/output.php?n=&a=SECURITIES%20AND%20EXCHANGE%20COMMISSION&l=&o=&y=all") 
print df.head() 
               aaData iTotalDisplayRecords \ 
0 [ZUVER,SHAHEEN H, 14, SK, $170,960.00, $0.00, ...     19919 
1 [ZUR,MIA C., 14, SK, $164,875.00, $0.00, SECUR...     19919 
2 [ZUNDEL,JENNET LEONG, 14, SK, $204,638.00, $0....     19919 
3 [ZUKOWSKI,DAVID W, 04, SK, $38,382.00, $0.00, ...     19919 
4 [ZOU,FAN, 14, SK, $166,650.00, $0.00, SECURITI...     19919 

    iTotalRecords sEcho 
0  7072900  0 
1  7072900  0 
2  7072900  0 
3  7072900  0 
4  7072900  0 

然後你從aaData列中獲取新的數據框 - 使用列表理解:

df1 = pd.DataFrame([ x for x in df['aaData'] ]) 

設置列名:

df1.columns = ['NAME','GRADE','SCALE','SALARY','BONUS','AGENCY','LOCATION','POSITION','YEAR'] 

print df1.head() 
        NAME GRADE SCALE  SALARY BONUS \ 
0  ZUVER,SHAHEEN H 14 SK $170,960.00 $0.00 
1   ZUR,MIA C. 14 SK $164,875.00 $0.00 
2 ZUNDEL,JENNET LEONG 14 SK $204,638.00 $0.00 
3  ZUKOWSKI,DAVID W 04 SK $38,382.00 $0.00 
4    ZOU,FAN 14 SK $166,650.00 $0.00 

           AGENCY  LOCATION \ 
0 SECURITIES AND EXCHANGE COMMISSION  WASHINGTON 
1 SECURITIES AND EXCHANGE COMMISSION  WASHINGTON 
2 SECURITIES AND EXCHANGE COMMISSION SAN FRANCISCO 
3 SECURITIES AND EXCHANGE COMMISSION   BOSTON 
4 SECURITIES AND EXCHANGE COMMISSION  WASHINGTON 

            POSITION YEAR 
0       GENERAL ATTORNEY 2014 
1       GENERAL ATTORNEY 2014 
2        ACCOUNTING 2014 
3 ADMIN AND OFFICE SUPPORT STUDENT TRAINEE 2014 
4   INFORMATION TECHNOLOGY MANAGEMENT 2014 
+0

這是偉大的,謝謝!還需要更好地掌握Javascript。 – user2559269

+0

實際上,發現一個暗示抓取仍然有必要的進一步限制 - 而「iTotalDisplayRecords」:「19919」,由此產生的實際數據幀僅包含100行,對應於行選擇元素的100行的最大選項允許。知道任何解決這個問題的方法? – user2559269

+1

你可以試試這個網址http://www.fedsdatacenter.com/federal-pay-rates/output.php?n =&a = SECURITIES%20AND%20EXCHANGE%20COMMISSION&l =&o =&y = all&sEcho = 4&iColumns = 9&sColumns =&iDisplayStart = 0&iDisplayLength = 100000'並且可能嘗試更改最後一個數字「100000」 – jezrael