2011-06-28 23 views
0

我試圖從使用BeautifulSoup for python的網站解析html數據。但是,urllib2或機械化不能讀取整個html格式。返回的數據是無法讀取HTML數據 - Python

<html> 
<head> 
    <title> 
    EC 4.1.2.13 - Fructose-bisphosphate aldolase </title> 
    <meta name="description" content="Information on EC 4.1.2.13 - Fructose-bisphosphate aldolase"> 
    <meta name="keywords" content="EC,Number,Enzyme,Pathway,Reaction,Organism,Substrate,Cofactor,Inhibitor,Compound,KM Value,KI Value,IC50 Value,pi Value,Turnover Number,pH,Temperature,Optimum,Range,Source Tissue,BLAST,Subunits,Modification,Crystallization,Stability,Purification"> 
</head> 
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN" "http://www.w3.org/TR/html4/frameset.dtd"> 
<frameset cols="190,*" border="0"> 
    <frame name="navigation" src="flat_navigation.php4?ecno=4.1.2.13&organism_list=Mycobacterium tuberculosis&Suchword=&UniProtAcc=P67475" frameborder="no"> 
    <frameset rows="110,*" border="0"> 
      <frame name="header" src="flat_head.php4?ecno=4.1.2.13" frameborder="no"> 

     <frame name="flat" src="flat_result.php4?ecno=4.1.2.13&organism_list=Mycobacterium tuberculosis&Suchword=&UniProtAcc=P67475" frameborder="no"> 

    </frameset> 
</frameset> 
<noframes> 
<body> 
<h1>EC 4.1.2.13 - Fructose-bisphosphate aldolase </h1> 

<a href="flat_result.php4?ecno=4.1.2.13&organism_list=Mycobacterium tuberculosis&Suchword=&UniProtAcc=P67475">More detailed information on the enzyme EC 4.1.2.13 - Fructose-bisphosphate aldolase</a> 

Sorry, but your browser doesn't support frames. Please use another browser! 
</body> 
</noframes> 
</html> 

當我使用Internet Explorer手動打開webste時,可以讀取整個html。無論如何,使用urllib2,機械化或BeautifulSoup來解決這個問題?

回答

3

這是因爲內容在幀中。您可以解析頁面並查找主要<frame>元素的src屬性,或者直接請求該幀。在大多數瀏覽器中,您可以右鍵單擊並選擇「框架屬性」或以獲取框架的URL。