2017-01-09 14 views
0

我正在學習Python中的美味湯和字典。我正在按照斯坦福大學的美麗湯的簡短教程在這裏找到:http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html將美味湯捕獲的值存儲在字典中,然後訪問這些值

由於訪問網站是禁止的我已經將教程中提供的文本存儲到字符串,然後將字符串湯轉換爲湯對象。打印輸出如下:

print(soup_string) 

<html><body><div class="ec_statements"><div id="legalert_title"><a  
href="/Legislation-and-Politics/Legislative-Alerts/Letter-to-Senators- 
Urging-Them-to-Support-Cloture-and-Final-Passage-of-the-Paycheck- 
Fairness-Act-S.2199">'Letter to Senators Urging Them to Support Cloture  
and Final Passage of the Paycheck Fairness Act (S.2199) 
</a> 
</div> 
<div id="legalert_date"> 
September 10, 2014 
</div> 
</div> 
<div class="ec_statements"> 
<div id="legalert_title"> 
<a href="/Legislation-and-Politics/Legislative-Alerts/Letter-to- 
Representatives-Urging-Them-to-Vote-on-the-Highway-Trust-Fund-Bill"> 
Letter to Representatives Urging Them to Vote on the Highway Trust Fund Bill 
</a> 
</div> 
<div id="legalert_date"> 
     July 30, 2014 
     </div> 
</div> 
<div class="ec_statements"> 
<div id="legalert_title"> 
<a href="/Legislation-and-Politics/Legislative-Alerts/Letter-to-Representatives-Urging-Them-to-Vote-No-on-the-Legislation-Providing-Supplemental-Appropriations-for-the-Fiscal-Year-Ending-Sept.-30-2014"> 
     Letter to Representatives Urging Them to Vote No on the Legislation Providing Supplemental Appropriations for the Fiscal Year Ending Sept. 30, 2014 
     </a> 
</div> 
<div id="legalert_date"> 
     July 30, 2014 
     </div> 
</div> 
<div class="ec_statements"> 
<div id="legalert_title"> 
<a href="/Legislation-and-Politics/Legislative-Alerts/Letter-to-Senators-Urging-Them-to-Vote-Yes- 
      on-the-Motion-to-Proceed-to-the-Emergency-Supplemental-Appropriations-Act-of-2014-S.2648"></a></div></div></body></html> 

在某些時候的導師捕捉湯對象中具有標記「格」的所有元素,類_ =「ec_statements」。該

「我們將通過所有在我們的信件收集的項目,併爲每一個,拉出的名稱,使之成爲我們的字典的關鍵:

letters = soup_string.find_all("div", class_="ec_statements") 

然後導師說。值將是另一個字典,但我們還沒有找到其他項目的內容,所以我們將創建一個空的字典對象。「

的代碼如下:

lobbying = {} 
for element in letters: 
    lobbying[element.a.get_text()] = {} 

然而,當我打印遊說字典,我發現的鍵和值的最後一個元素 - 「信爲本,以參議員緊壓了他們,TO-投票 - 正在進行動議的緊急補充撥款 - 2014年的S.2648號法案「 - 缺少。相反,有一個沒有分配密鑰的空字典。

for key, value in lobbying.iteritems(): 
    print key, value 

{} 

     Letter to Representatives Urging Them to Vote No on the Legislation Providing Supplemental Appropriations for the Fiscal Year Ending Sept. 30, 2014 
     {} 

     Letter to Representatives Urging Them to Vote on the Highway Trust Fund Bill 
     {} 
'Letter to Senators Urging Them to Support Cloture and Final Passage of the Paycheck Fairness Act (S.2199) 
     {} 

你如何解釋這一點?您的建議將不勝感激。

+0

last'div'沒有文本,所以它創建了以空字符串爲鍵的元素。而你將它看作是「一個沒有分配鍵的空字典」。 – furas

+0

順便說一句:至少使用'print'>「,key,」<「'你會看到你的鍵是空字符串,或者它只有'spaces','tabs'和'entered' – furas

回答

0

最後<div class="ec_statements">的元素<a>沒有任何文字吧:

<a href="/Legislation-and-Politics/Legislative-Alerts/Letter-to-Senators-Urging-Them-to-Vote-Yes- 
      on-the-Motion-to-Proceed-to-the-Emergency-Supplemental-Appropriations-Act-of-2014-S.2648"> 
</a> 

比較這對上面的另一個DIV:

<a href="/Legislation-and-Politics/Legislative-Alerts/Letter-to- 
Representatives-Urging-Them-to-Vote-on-the-Highway-Trust-Fund-Bill"> 
Letter to Representatives Urging Them to Vote on the Highway Trust Fund Bill 
</a> 

正如你所看到的,在第二個文本示例在<a>標記之後和</a>標記之前。在第一個例子中,沒有這樣的文字。

0

要調用element.a.get_text()生成密鑰,但對於最後一個元素的標籤沒有文本內容:<a ...></a>