在Python中使用BeautifulSoup解析HTML

我想用Python使用BeautifulSoup解析HTML，但是我無法設法得到我需要的東西。在Python中使用BeautifulSoup解析HTML

這是我想要做的個人應用程序的一個小模塊，它包含一個帶有憑據的Web登錄部分，一旦腳本登錄到Web中，我需要解析一些信息以便管理它並處理它。

越來越登錄後的HTML代碼是：

<div class="widget_title clearfix"> 

     <h2>Account Balance</h2> 

    </div> 

    <div class="widget_body"> 

     <div class="widget_content"> 

      <table class="simple"> 

       <tr> 

        <td><a href="#" id="west1" title="Total earned daily">Daily Earnings</a></td> 

        <td style="text-align: right; width: 125px; color: #119911; font-weight: bold;"> 

         150       

        </td> 

       </tr> 

       <tr> 

        <td><a href="#" id="west2" title="Total weekly earnings">Weekly Earnings</a></td> 

        <td style="text-align: right; border-bottom: 1px solid #000; color: #119911; font-weight: bold;"> 

         500      </td> 

       </tr> 

       <tr> 

        <td><a href="#" id="west3" title="Total Monthly earnings">Monthly Earnings</a></td> 

        <td style="text-align: right; color: #119911; font-weight: bold;"> 

         1500      </td> 

       </tr> 

       <tr> 

        <td><a href="#" id="west4" title="Total expenses">Total expended</a></td> 

        <td style="text-align: right; border-bottom: 1px solid #000; color: #880000; font-weight: bold;"> 

         430      </td> 

       </tr> 

       <tr> 

        <td><a href="#" id="west5" title="Total available">Account Balance</a></td> 

        <td style="text-align: right; border-bottom: 3px double #000; color: #119911; font-weight: bold;"> 

         840      </td> 

       </tr> 

       <tr> 

        <td></td> 

        <td style="padding: 5px;"> 

         <center> 

          <form id="request_bill" method="POST" action="index.php?page=dashboard"> 

           <input type="hidden" name="secret_token" value="" /> 

           <input type="hidden" name="request_payout" value="1" /> 

           <input type="submit" class="btn blue large" value="Request Payout" /> 

          </form> 

         </center> 

        </td> 

       </tr> 

      </table> 

     </div> 

    </div> 

</div>

正如你所看到的，這不是一個很好的格式化的HTML，但我需要提取的元素和它們的值，我的意思是，對於例如：「每日收入」和「150」| 「每週收入」和「500」...

我認爲「id」屬性可能會有所幫助，但是當我嘗試解析它時，它會崩潰。

的Python代碼我工作是：

def parseo(archivohtml): 
    html = archivohtml 
    parsed_html = BeautifulSoup(html) 
    par = parsed_html.find('td', attrs={'id':'west1'}).string 
    print par

凡archivohtml是在網絡

登錄當我運行該腳本後保存的HTML文件，我只得到錯誤。

我也試着這樣做：

def parseo(archivohtml): 
    soup = BeautifulSoup() 
    html = archivohtml 
    parsed_html = soup(html) 
    par = soup.parsed_html.find('td', attrs={'id':'west1'}).string 
    print par

但結果還是一樣。

來源

2013-03-22 dexafree

哪些錯誤???? – 2013-03-22 17:44:40

「它崩潰」是什麼意思？它是否用回溯打印出異常然後退出？如果是這樣，請向我們展示異常和追溯（當然還有追溯所涉及的代碼）。 – abarnert 2013-03-22 18:01:09

文件「C：\ py \ projectparse \ logparse.py」，第53行，在parseo par = parsed_html.find（'td'，attrs = {'id'：'west1'}）字符串 AttributeError：'NoneType 'object has no attribute'string' – dexafree 2013-03-22 18:51:05

帶有id="west1"的標籤是<a>標籤。您正在尋找此<a>標籤後到來的<td>標籤：

import BeautifulSoup as bs 

content = '''<div class="widget_title clearfix"> 
     <h2>Account Balance</h2> 
    </div> 
    <div class="widget_body"> 
     <div class="widget_content"> 
      <table class="simple"> 
       <tr> 
        <td><a href="#" id="west1" title="Total earned daily">Daily Earnings</a></td> 
        <td style="text-align: right; width: 125px; color: #119911; font-weight: bold;"> 
         150       
        </td> 
       </tr> 
       <tr> 
        <td><a href="#" id="west2" title="Total weekly earnings">Weekly Earnings</a></td> 
        <td style="text-align: right; border-bottom: 1px solid #000; color: #119911; font-weight: bold;"> 
         500      </td> 
       </tr> 
       <tr> 
        <td><a href="#" id="west3" title="Total Monthly earnings">Monthly Earnings</a></td> 
        <td style="text-align: right; color: #119911; font-weight: bold;"> 
         1500      </td> 
       </tr> 
       <tr> 
        <td><a href="#" id="west4" title="Total expenses">Total expended</a></td> 
        <td style="text-align: right; border-bottom: 1px solid #000; color: #880000; font-weight: bold;"> 
         430      </td> 
       </tr> 
       <tr> 
        <td><a href="#" id="west5" title="Total available">Account Balance</a></td> 
        <td style="text-align: right; border-bottom: 3px double #000; color: #119911; font-weight: bold;"> 
         840      </td> 
       </tr> 
       <tr> 
        <td></td> 
        <td style="padding: 5px;"> 
         <center> 
          <form id="request_bill" method="POST" action="index.php?page=dashboard"> 
           <input type="hidden" name="secret_token" value="" /> 
           <input type="hidden" name="request_payout" value="1" /> 
           <input type="submit" class="btn blue large" value="Request Payout" /> 
          </form> 
         </center> 
        </td> 
       </tr> 
      </table> 
     </div> 
    </div> 
</div>''' 

def parseo(archivohtml): 
    html = archivohtml 
    parsed_html = bs.BeautifulSoup(html) 
    par = parsed_html.find('a', attrs={'id':'west1'}).findNext('td')   
    print par.string.strip() 

parseo(content)

產生

來源

2013-03-22 17:46:53 unutbu

非常感謝您的快速回答！我試過你的代碼，但是我有一個bs.BeautifulSoup（html）表達式的問題... 我該在哪裏聲明bs？我的進口是從BeautifulSoup導入BeautifulSoup 我必須在開始時添加bs = BeautifulSoup（）嗎？我也看到BeautifulSoup可以導入爲「進口BeautifulSoup爲BS」，但它仍然無法正常工作我得到「AttributeError的：‘NoneType’對象有沒有屬性‘FindNext中’」我不不知道我做錯了什麼！ – dexafree 2013-03-22 18:54:24

我已經添加了可運行的代碼。希望有所幫助。 – unutbu 2013-03-22 20:33:39

非常感謝！現在代碼運行了，它確實顯示了預期顯示的內容，現在我已經設法查看錯誤是什麼了！問題是我還在保存一個包含所有內容的.html文件，以便監視所有進程是否順利進行，但我沒有將BS應用到html代碼本身。我是這樣做的.html代碼現在它完美的工作，我可以繼續工作！非常感謝你:) – dexafree 2013-03-23 00:29:28

我無法從你的問題告訴我們，如果這將是適用於你，但這裏的另一種方法：

def parseo(archivohtml): 
    html = archivohtml 
    parsed_html = BeautifulSoup(html) 
    for line in parsed_html.stripped_strings:   
     print line.strip()

其產生：

Account Balance 
Daily Earnings 
150 
Weekly Earnings 
500 
Monthly Earnings 
1500 
Total expended 
430 
Account Balance 
840

如果你想在一個列表中的數據：

data = [line.strip() for line in parsed_html.stripped_strings]

[u'Account Balance', u'Daily Earnings', u'150', u'Weekly Earnings', u'500', u'Monthly Earnings', u'1500', u'Total expended', u'430', u'Account Balance', u'840']

來源

2013-03-22 17:56:07

非常感謝你！現在代碼正在工作，這種方式來存儲信息和樣式比我使用的方式好很多！ – dexafree 2013-03-23 11:21:40

在Python中使用BeautifulSoup解析HTML

回答

相關問題