我正在學Python的美麗湯和字典。我下面由斯坦福大學在美麗的湯一個簡短的教程在這裏找到:http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html獲取清潔數據:美麗的湯就夠了,或者我還必須使用正則表達式?
由於訪問webside被禁伐我已存儲在本教程爲字符串表示的文本,然後轉換成字符串湯湯對象。打印輸出如下:
print(soup_string)
<html><body><div class="ec_statements"><div id="legalert_title"><a
href="/Legislation-and-Politics/Legislative-Alerts/Letter-to-Senators-
Urging-Them-to-Support-Cloture-and-Final-Passage-of-the-Paycheck-
Fairness-Act-S.2199">'Letter to Senators Urging Them to Support Cloture
and Final Passage of the Paycheck Fairness Act (S.2199)
</a>
</div>
<div id="legalert_date">
September 10, 2014
</div>
</div>
<div class="ec_statements">
<div id="legalert_title">
<a href="/Legislation-and-Politics/Legislative-Alerts/Letter-to-
Representatives-Urging-Them-to-Vote-on-the-Highway-Trust-Fund-Bill">
Letter to Representatives Urging Them to Vote on the Highway Trust Fund Bill
</a>
</div>
<div id="legalert_date">
July 30, 2014
</div>
</div>
<div class="ec_statements">
<div id="legalert_title">
<a href="/Legislation-and-Politics/Legislative-Alerts/Letter-to-Representatives-Urging-Them-to-Vote-No-on-the-Legislation-Providing-Supplemental-Appropriations-for-the-Fiscal-Year-Ending-Sept.-30-2014">
Letter to Representatives Urging Them to Vote No on the Legislation Providing Supplemental Appropriations for the Fiscal Year Ending Sept. 30, 2014
</a>
</div>
<div id="legalert_date">
July 30, 2014
</div>
</div>
</body></html>
在某些時候的導師捕捉湯對象中具有標記「格」的所有元素,類_ =「ec_statements」。該
「我們將通過所有在我們的信件收集的項目,併爲每一個,拉出的名稱,使之成爲我們的字典的關鍵:
letters = soup_string.find_all("div", class_="ec_statements")
然後導師說。值將是另一個字典,但我們還沒有找到其他項目的內容,所以我們將創建一個空的字典對象。「
在這一點上,我採取不同的方法,我決定先在列表中,然後存儲在數據幀中的數據。代碼如下:
lobbying_1 = []
lobbying_2 = []
lobbying_3 = []
for element in letters:
lobbying_1.append(element.a.get_text())
lobbying_2.append(element.a.attrs.get('href'))
lobbying_3.append(element.find(id="legalert_date").get_text())
df =pd.DataFrame([])
df = pd.DataFrame(lobbying_1, columns = ['Name'])
df['href'] = lobbying_2
df['Date'] = lobbying_3
輸出如下:
print(df)
Name \
0 \n 'Letter to Senators Urging Them to S...
1 \n Letter to Representatives Urging Th...
2 \n Letter to Representatives Urging Th...
href \
0 /Legislation-and-Politics/Legislative-Alerts/L...
1 /Legislation-and-Politics/Legislative-Alerts/L...
2 /Legislation-and-Politics/Legislative-Alerts/L...
Date
0 \n September 10, 2014\n
1 \n July 30, 2014\n
2 \n July 30, 2014\n
我的問題是:有沒有辦法讓更乾淨的數據,即字符串不\ n和空間,只是美麗的湯真正的價值?或者我必須使用正則表達式處理數據?
您的建議將不勝感激。