2015-04-14 31 views
2

相對較新的beautifulsoup,我試圖從該網頁中提取數據:http://reports.workforce.test.ohio.gov/program-county-wia-reports.aspx?name=GTL8gAmmdulY5GSlycy7WQ==&dataType=hIp9ibmBIwbKor1WvT5Bkg==&dataTypeText=hIp9ibmBIwbKor1WvT5Bkg==#在美麗的湯提取從圖表文字

我想抓住的標題「程序完成者」下的數字,「就業第二季度「等.html代碼的相關部分是:

<ul class="listbox">    
<li class="li1"> 
    <p style="cursor:help" class="listtop" title="WIA Adult 
    completers are those individuals who have exited a WIA Adult program from 
    which the individual received a core staff-assisted service (such as job 
    search or placement assistance) or an intensive service (such as 
    counseling, career planning, or job training). Those individuals who 
    participated in WIA through self-service, like OhioMeansJobs.com, or other 
    less intensive programs are not included in the dashboard.">Program 
    Completers</p> 
    <p id="programcompleters1">18</p></li> 

我想要字符串」Program Completers「和」18「。我嘗試過實施這些解決方案hereherehere,但沒有多少運氣。我的代碼的一個版本是:

from bs4 import BeautifulSoup 
import urllib2 

url="http://reports.workforce.test.ohio.gov/program-county-wia-reports.aspx?name=GTL8gAmmdulY5GSlycy7WQ==&dataType=hIp9ibmBIwbKor1WvT5Bkg==&dataTypeText=hIp9ibmBIwbKor1WvT5Bkg==" 
hdr = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36', 
     'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'} 

req = urllib2.Request(url, headers=hdr) 
page = urllib2.urlopen(req) 

soup = BeautifulSoup(page) 
for tag in soup.find_all('ul'): 
    print tag.text, tag.next_sibling 

這將返回文本,但會從網頁的其他部分也標記'ul'。我從圖表區域內抓取任何文本都沒有成功。我如何檢索我想要的文字?

謝謝你的幫助!

+0

謝謝!這兩個答案都奏效了,但@ Matt_Davidson的解決方案讓我獲得了更具體的數據。 – brbarkley

回答

0

當你正在尋找的是在iframe數據之前所提到的,訪問它作爲@chosen_codex說,在這裏:

http://reports.workforce.test.ohio.gov/WIAReports/WIA_COUNTY.ASPX?level=county&DataType=hIp9ibmBIwbKor1WvT5Bkg==&name=GTL8gAmmdulY5GSlycy7WQ==&programDate=Kf/2jvCFFRgQJjODWV7l08ATxxM/adw9p1FWfZ9J7O8=

然後,您可以訪問域你有興趣者:

data = {} 
for tag in soup.find_all('p'): 
    if tag.get('id'): 
     data[tag.get('id')] = tag.text 

print(data) 

>> print(data.get('programcompleters1')) 
18 
0

你想要的元素是在iframe中。嘗試http://reports.workforce.test.ohio.gov/WIAReports/WIA_COUNTY.ASPX?level=county&DataType=hIp9ibmBIwbKor1WvT5Bkg==&name=GTL8gAmmdulY5GSlycy7WQ==&programDate=Kf/2jvCFFRgQJjODWV7l08ATxxM/adw9p1FWfZ9J7O8=

所以從頁面本身中提取,這應該工作

url="http://reports.workforce.test.ohio.gov/WIAReports/WIA_COUNTY.ASPX?level=county&DataType=hIp9ibmBIwbKor1WvT5Bkg==&name=GTL8gAmmdulY5GSlycy7WQ==&programDate=Kf/2jvCFFRgQJjODWV7l08ATxxM/adw9p1FWfZ9J7O8=" 
page = urllib2.urlopen(url) 
soup = BeautifulSoup(page) 

chartcontainers = soup.findAll('div', {"class": "chartcontain"}) 
for container in chartcontainers: 
    print(container) 
    #then do whatever