2017-09-21 32 views
2

我試圖處理此頁:BeautifulSoap不解析的div類

https://play.google.com/store/movies/details?id=3B6EBBD94D13B4DCMV 

我用下面的代碼讀取HTML:

from BeautifulSoup import BeautifulSoup as BS 
import requests 

def read_html(url): 
    try: 
    res = requests.get(url) 
    if res.status_code == 200: 
     html_content = res.content 
     soup = BS(html_content) 
     return _get_type(soup)   
     else: 
     print res.status_code 
    except ValueError, e: 
    print e 


def _get_type(soup): 
    """Read Movie.""" 

    mydivs = soup.findAll("span", {"class": "DBzzzb"}) 
    if mydivs: 
    return 'AVAILABLE' 

    mydivs = soup.findAll("span", {"class": "DBzzzb"}) 
    if mydivs: 
    return 'PREORDER' 

    mydivs = soup.findAll("div", {"class": "Wc4pU"}) 
    if mydivs: 
    return 'NOT_AVAILABLE' 

    return 'INVALID' 

我的條件永遠不匹配:soup.findAll("div", {"class": "Wc4pU"}即使有實際上是在HTML代碼中有:

<div class="Wc4pU">We'll notify you on your wishlist when movies become available</div> 

來源HTML:

view-source:https://play.google.com/store/movies/details?id=3B6EBBD94D13B4DCMV 

有什麼建議嗎?

+2

您應該使用'bs4' –

+0

更改爲BS4工作! – spicyramen

回答

2

你需要指定一個解析器:

soup = BS(html_content, 'html5lib') 

這使得該過程也更快。