的數據與一個Ajax請求檢索,你可以從它返回很好的格式化的JSON做一個GET:
json = requests.get("http://podaac.jpl.nasa.gov/dmasSolr/solr/dataset/select/?q=*:*&fl=Dataset-PersistentId,Dataset-ShortName-Full&rows=2147483647&fq=DatasetPolicy-AccessType-Full:(OPEN+OR+PREVIEW+OR+SIMULATED+OR+REMOTE)+AND+DatasetPolicy-ViewOnline:Y&wt=json").json()
print(json)
我們只需要使用幾個鍵來拉:
from pprint import pprint as pp
pp(json["response"]["docs"])
輸出的一個片段:
[{'Dataset-PersistentId': 'PODAAC-MODST-M8D9N',
'Dataset-ShortName-Full': 'MODIS_TERRA_L3_SST_MID-IR_8DAY_9KM_NIGHTTIME'},
{'Dataset-PersistentId': 'PODAAC-MODST-MAN4N',
'Dataset-ShortName-Full': 'MODIS_TERRA_L3_SST_MID-IR_ANNUAL_4KM_NIGHTTIME'},
{'Dataset-PersistentId': 'PODAAC-MODSA-MMO9N',
'Dataset-ShortName-Full': 'MODIS_AQUA_L3_SST_MID-IR_MONTHLY_9KM_NIGHTTIME'},
{'Dataset-PersistentId': 'PODAAC-MODST-M1D9N',
'Dataset-ShortName-Full': 'MODIS_TERRA_L3_SST_MID-IR_DAILY_9KM_NIGHTTIME'},
{'Dataset-PersistentId': 'PODAAC-GHMTG-2PN01',
'Dataset-ShortName-Full': 'NAVO-L2P-AVHRRMTA_G'},
{'Dataset-PersistentId': 'PODAAC-GHBDM-4FD01',
'Dataset-ShortName-Full': 'DMI-L4UHfnd-NSEABALTIC-DMI_OI'},
{'Dataset-PersistentId': 'PODAAC-GHGOY-4FE01',
'Dataset-ShortName-Full': 'EUR-L4HRfnd-GLOB-ODYSSEA'},
{'Dataset-PersistentId': 'PODAAC-GHMED-4FE01',
'Dataset-ShortName-Full': 'EUR-L4UHFnd-MED-v01'},
{'Dataset-PersistentId': 'PODAAC-NSGDR-L2X02',
'Dataset-ShortName-Full': 'NSCAT_LEVEL_2_V2'},
{'Dataset-PersistentId': 'PODAAC-MODST-M1D4N',
'Dataset-ShortName-Full': 'MODIS_TERRA_L3_SST_MID-IR_DAILY_4KM_NIGHTTIME'},
{'Dataset-PersistentId': 'PODAAC-MODSA-MMO4N',
'Dataset-ShortName-Full': 'MODIS_AQUA_L3_SST_MID-IR_MONTHLY_4KM_NIGHTTIME'},
{'Dataset-PersistentId': 'PODAAC-MODST-MMO4N',
'Dataset-ShortName-Full': 'MODIS_TERRA_L3_SST_MID-IR_MONTHLY_4KM_NIGHTTIME'},
{'Dataset-PersistentId': 'PODAAC-MODSA-MAN9N',
'Dataset-ShortName-Full': 'MODIS_AQUA_L3_SST_MID-IR_ANNUAL_9KM_NIGHTTIME'},
{'Dataset-PersistentId': 'PODAAC-MODSA-M8D4N',
'Dataset-ShortName-Full': 'MODIS_AQUA_L3_SST_MID-IR_8DAY_4KM_NIGHTTIME'},
{'Dataset-PersistentId': 'PODAAC-MODSA-M1D4N',
'Dataset-ShortName-Full': 'MODIS_AQUA_L3_SST_MID-IR_DAILY_4KM_NIGHTTIME'},
{'Dataset-PersistentId': 'PODAAC-GOES3-24HOR',
'Dataset-ShortName-Full': 'GOES_L3_SST_6km_NRT_SST_24HOUR'},
這使你所有對集ID和簡稱從表而無需BS4可言。
爲了讓IDS,你只是用訪問每個字典的關鍵Dataset-PersistentId
:
for d in json["response"]["docs"]:
print("ID for {Dataset-ShortName-Full} is {Dataset-PersistentId}".format(**d))
一些輸出:
ID for OSTM_L2_OST_OGDR_GPS is PODAAC-J2ODR-GPS00
ID for JPL-L4UHblend-NCAMERICA-RTO_SST_Ad is PODAAC-GHRAD-4FJ01
ID for SEAWINDS_BYU_L3_OW_SIGMA0_ANTARCTICA_POLAR-STEREOGRAPHIC_BROWSE_IMAGES is PODAAC-SEABY-ANBIM
ID for SEAWINDS_BYU_L3_OW_SIGMA0_ANTARCTICA_POLAR-STEREOGRAPHIC_BROWSE_MAPS_LITE is PODAAC-SEABY-ANBML
ID for CCMP_MEASURES_ATLAS_L4_OW_L3_5A_5DAY_WIND_VECTORS_FLK is PODAAC-CCF35-01AD5
ID for QSCAT_BYU_L3_OW_SIGMA0_ARCTIC_POLAR-STEREOGRAPHIC_BROWSE_MAPS_LITE is PODAAC-QSBYU-ARBML
ID for MODIS_AQUA_L3_SST_MID-IR_ANNUAL_4KM_NIGHTTIME is PODAAC-MODSA-MAN4N
ID for UCLA_DEALIASED_SASS_L3 is PODAAC-SASSX-L3UCD
ID for NSCAT_LEVEL_1.7_V2 is PODAAC-NSSDR-17X02
ID for NSCAT_LEVEL_3_V2 is PODAAC-NSJPL-L3X02
ID for AVHRR_NAVOCEANO_L3_18km_MCSST_DAYTIME is PODAAC-NAVOC-318DY
ID for QSCAT_L3_OW_JPL_BROWSE_IMAGES is PODAAC-QSXXX-L3BI0
ID for QSCAT_BYU_L3_OW_SIGMA0_ANTARCTICA_POLAR-STEREOGRAPHIC_BROWSE_IMAGES is PODAAC-QSBYU-ANBIM
ID for NAVO-L4HR1m-GLOB-K10_SST is PODAAC-GHK10-41N01
ID for NCDC-L4LRblend-GLOB-AVHRR_AMSR_OI is PODAAC-GHAOI-4BC01
ID for SEAWINDS_LEVEL_3_V2 is PODAAC-SEAXX-L3X02
有第二Ajax請求返回進一步的數據:
json = requests.get("http://podaac.jpl.nasa.gov/dmasSolr/solr/granule/select/?q=*&fq=Granule-AccessType:(OPEN+OR+PREVIEW+OR+SIMULATED+OR+REMOTE)+AND+Granule-Status:ONLINE&facet=true&facet.field=Dataset-ShortName-Full&rows=0&facet.limit=-1&facet.mincount=1&wt=json").json()
from pprint import pprint as pp
pp(json)
你也可以改變一些參數來給你不同的輸出。
使用硒... –
你檢查過你的'html.text'嗎?你只能得到標題,因爲這就是html中的全部內容 - 顯然這個表格是由文檔加載到瀏覽器後的一些js填充的。由於你使用'request',所以這個js根本不會被執行,因此是空表。它與「桌子很大」沒有任何關係。 –