2017-01-09 48 views
1

我正在學Python的美麗湯和字典。我下面由斯坦福大學在美麗的湯一個簡短的教程在這裏找到:http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html獲取清潔數據:美麗的湯就夠了,或者我還必須使用正則表達式?

由於訪問webside被禁伐我已存儲在本教程爲字符串表示的文本,然後轉換成字符串湯湯對象。打印輸出如下:

print(soup_string) 

    <html><body><div class="ec_statements"><div id="legalert_title"><a  
    href="/Legislation-and-Politics/Legislative-Alerts/Letter-to-Senators- 
    Urging-Them-to-Support-Cloture-and-Final-Passage-of-the-Paycheck- 
    Fairness-Act-S.2199">'Letter to Senators Urging Them to Support Cloture  
    and Final Passage of the Paycheck Fairness Act (S.2199) 
    </a> 
    </div> 
    <div id="legalert_date"> 
    September 10, 2014 
    </div> 
    </div> 
    <div class="ec_statements"> 
    <div id="legalert_title"> 
    <a href="/Legislation-and-Politics/Legislative-Alerts/Letter-to- 
    Representatives-Urging-Them-to-Vote-on-the-Highway-Trust-Fund-Bill"> 
    Letter to Representatives Urging Them to Vote on the Highway Trust Fund Bill 
    </a> 
    </div> 
    <div id="legalert_date"> 
      July 30, 2014 
      </div> 
    </div> 
    <div class="ec_statements"> 
    <div id="legalert_title"> 
    <a href="/Legislation-and-Politics/Legislative-Alerts/Letter-to-Representatives-Urging-Them-to-Vote-No-on-the-Legislation-Providing-Supplemental-Appropriations-for-the-Fiscal-Year-Ending-Sept.-30-2014"> 
      Letter to Representatives Urging Them to Vote No on the Legislation Providing Supplemental Appropriations for the Fiscal Year Ending Sept. 30, 2014 
      </a> 
    </div> 
    <div id="legalert_date"> 
      July 30, 2014 
      </div> 
    </div> 
</body></html> 

在某些時候的導師捕捉湯對象中具有標記「格」的所有元素,類_ =「ec_statements」。該

「我們將通過所有在我們的信件收集的項目,併爲每一個,拉出的名稱,使之成爲我們的字典的關鍵:

letters = soup_string.find_all("div", class_="ec_statements") 

然後導師說。值將是另一個字典,但我們還沒有找到其他項目的內容,所以我們將創建一個空的字典對象。「

在這一點上,我採取不同的方法,我決定先在列表中,然後存儲在數據幀中的數據。代碼如下:

lobbying_1 = [] 
lobbying_2 = [] 
lobbying_3 = [] 
for element in letters: 
    lobbying_1.append(element.a.get_text()) 
    lobbying_2.append(element.a.attrs.get('href')) 
    lobbying_3.append(element.find(id="legalert_date").get_text()) 
df =pd.DataFrame([]) 
df = pd.DataFrame(lobbying_1, columns = ['Name']) 
df['href'] = lobbying_2 
df['Date'] = lobbying_3 

輸出如下:

print(df) 

               Name \ 
0 \n  'Letter to Senators Urging Them to S... 
1 \n   Letter to Representatives Urging Th... 
2 \n   Letter to Representatives Urging Th... 

               href \ 
0 /Legislation-and-Politics/Legislative-Alerts/L... 
1 /Legislation-and-Politics/Legislative-Alerts/L... 
2 /Legislation-and-Politics/Legislative-Alerts/L... 

            Date 
0 \n  September 10, 2014\n   
1  \n  July 30, 2014\n   
2  \n  July 30, 2014\n 

我的問題是:有沒有辦法讓更乾淨的數據,即字符串不\ n和空間,只是美麗的湯真正的價值?或者我必須使用正則表達式處理數據?

您的建議將不勝感激。

回答

1

爲了擺脫在文本中換行符,呼籲get_text()當通過strip=True

for element in letters: 
    lobbying_1.append(element.a.get_text(strip=True)) 
    lobbying_2.append(element.a.attrs.get('href')) 
    lobbying_3.append(element.find(id="legalert_date").get_text(strip=True)) 

這當然,假設,你還是希望數據是在一個DataFrame的形式。