的Python HTML解析框架

我使用Beauitful湯框架中檢索（從下面的html內容的href）的鏈接的Python HTML解析框架

  <div class="store"> 
       <label>Store</label> 
       <span> 
        <a title="Open in Google Play" href="https://play.google.com/store/apps/details?id=com.opera.mini.android" target="_blank"> 
         <!-- ><span class="ui-icon app-store-gp"></span> --> 
         Google Play 
        </a><i class="icon-external-link"></i> 
       </span> 
      </div>

我用下面的代碼在python檢索此：

pageFile = urllib.urlopen("appannie.com/apps/google-play/app/com.opera.mini.android") 
pageHtml = pageFile.read() 
pageFile.close() 
print pageHtml 
soup = BeautifulSoup("".join(pageHtml)) 
item = soup.find("a", {"title":"Open in Google Play"}) 

print item

我得到NoneType作爲輸出。任何幫助都會非常棒。

我打印出來的HTML頁面，並輸出結果如下：

<html> 
    <head><title>503 Service Temporarily Unavailable</title></head> 
    <body bgcolor="white"> 
    <center><h1>503 Service Temporarily Unavailable</h1></center> 
    <hr><center>nginx</center> 
    </body> 
    </html>

它工作正常，在瀏覽器上

來源

2013-11-25 Siddharthan Asokan

「503服務暫時不可用」所以這不是BeautifulSoup問題，而是一個服務器...你確定你正在請求頁面正確嗎？嘗試設置一個像瀏覽器一樣的通用用戶代理，看看它是否仍然可以。 – 2013-11-25 19:18:15

item = soup.find("a", {"title":"Open in Google Play"})

你最初用一個標題「搜索「跨度」在Google Play中打開「，但是您要查找的元素是」a「（鏈接）。

編輯：因爲它似乎服務器返回503錯誤，可以嘗試設置普通用戶代理與此代碼（未測試，它可能無法在所有的工作，你需要import urllib2）：

soup = BeautifulSoup(urllib2.urlopen(urllib2.Request(sampleURL, None, {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0"})).read()) 
item = soup.find("a", {"title":"Open in Google Play"}) 
print item

另外我刪除了無用的"".join(pageHtml)，因爲urllib2已經返回字符串，所以不需要連接。

來源

2013-11-25 19:04:44

http://www.appannie.com/apps/google-play/app/com.opera.mini.android/我也嘗試過使用它。它似乎沒有幫助。仍然越來越NoneType –

我嘗試了上面發佈的代碼，並檢索到肯定的結果。 – hyleaus

@hyleaus我編輯了我使用的代碼。該鏈接在瀏覽器上完美打開。 –

的Python HTML解析框架

回答

相關問題