2015-02-09 108 views
-3

如何從html頁面獲取如下輸出?使用beautifulsoup解析python中的html

>html_sting='''<td class="status_icon" rowspan="2"><img alt="QUEUED" src="images/arts/status_QUEUED.png" style="border:none" title="QUEUED"/></td> 
 

 
><td class="test"> v1402beta_150127_1_OTM_TICKETS_dv_c_UID142307274200 
 
>  <div class="start">(04.02) 23:29</div> 
 
> \t \t <div class="end">~ 
 
>  <span style="color:green">() </span> 
 
> \t </div> 
 
></td> 
 

 
><td>mcordeix</td> 
 
><td>1614809</td> 
 

 
><td><a href="?command=compoundinfo&amp;test_id=v1402beta_150127_1_OTM_TICKETS_dv_c_UID142307274200 " onmouseover="Tip('compounds completed/running/queued')"target="_blank">0/0/0 of 0</a></td> 
 
><td>high</td> 
 
><td style="white-space:nowrap"><img class="pbar" src="images/arts/bar_green.gif" style="border-right:2px;border-right-style:solid;border-right-color:#ffffff" width="1%"/><img class="pbar" src="images/arts/bar_gray.gif" width="99%"/></td> 
 
><td></td> 
 
><td></td> 
 
><td></td> 
 
><td></td> 
 
><td colspan="4"> 
 
><!-- Florent Vial: this can be alway shown if admin=1 --> 
 
><a href="?command=getrequest&amp;test_id=v1402beta_150127_1_OTM_TICKETS_dv_c_UID142307274200" target="_blank">XML</a> 
 
><a href="?command=getrequest&amp;test_id=v1402beta_150127_1_OTM_TICKETS_dv_c_UID142307274200&amp;raw=1" target="_blank">Raw XML</a> 
 
><a href="?command=compoundinfo&amp;test_id=v1402beta_150127_1_OTM_TICKETS_dv_c_UID142307274200" target="_blank">CINFO</a> 
 
></td> 
 
><td></td> 
 
><td><!-- <script type="text/javascript">DIVShowHideDetails('func:DoPrintArtsDetails')</script> --> </td> 
 
><td></td> 
 
><td></td> 
 
><td></td> 
 
><td></td> 
 
''' 
 
    EXpected Output: 
 
------- 
 
Status="QUEUED" 
 
test=v1402beta_150127_1_OTM_TICKETS_dv_c_UID142307274200 
 
start=(04.02) 23:29 
 
end=~ 
 
user=mcordeix

回答

1

歡迎的StackOverflow! 請閱讀我們常見問題的How to ask a question部分。

解釋您是如何遇到您正在嘗試解決的問題以及阻止您自己解決問題的任何困難。

到目前爲止,您試圖解決這個問題的是什麼?


讓我們給你一個開始。

所有你需要的是findfind_all功能。

soup = BeautifulSoup(html_string) 

status = soup.find('img').get('alt') # get 'alt' content of the first <img> tag. 

# find the first <td> tag with a class="test", get its content, split it using spaces, 
version = soup.find('td', class_='test').text.split()[0] # and get the first substring 
time_start = soup.find('div', class_='start').text 
time_end = soup.find('div', class_='end').text 
user = soup.find_all('td')[2].text # get a third <td>'s content. 

print status # QUEUED 
print version # v1402beta_150127_1_OTM_TICKETS_dv_c_UID142307274200 
print time_start # (04.02) 23:29 
print time_end # ~ >  () > 
print user # mcordeix 

這只是讀取bs4's documentation 10分鐘,然後自己嘗試。
只需彈出Python解釋器,指定html_string變量,導入beautifulsoup庫,然後嘗試。

我相信你可以自己解決time_end內容帶來的問題。這並不難。