獲取清潔數據：美麗的湯就夠了，或者我還必須使用正則表達式？

我正在學Python的美麗湯和字典。我下面由斯坦福大學在美麗的湯一個簡短的教程在這裏找到：http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html 獲取清潔數據：美麗的湯就夠了，或者我還必須使用正則表達式？

由於訪問webside被禁伐我已存儲在本教程爲字符串表示的文本，然後轉換成字符串湯湯對象。打印輸出如下：

print(soup_string) 

    <html><body><div class="ec_statements"><div id="legalert_title"><a  
    href="/Legislation-and-Politics/Legislative-Alerts/Letter-to-Senators- 
    Urging-Them-to-Support-Cloture-and-Final-Passage-of-the-Paycheck- 
    Fairness-Act-S.2199">'Letter to Senators Urging Them to Support Cloture  
    and Final Passage of the Paycheck Fairness Act (S.2199) 
    </a> 
    </div> 
    <div id="legalert_date"> 
    September 10, 2014 
    </div> 
    </div> 
    <div class="ec_statements"> 
    <div id="legalert_title"> 
    <a href="/Legislation-and-Politics/Legislative-Alerts/Letter-to- 
    Representatives-Urging-Them-to-Vote-on-the-Highway-Trust-Fund-Bill"> 
    Letter to Representatives Urging Them to Vote on the Highway Trust Fund Bill 
    </a> 
    </div> 
    <div id="legalert_date"> 
      July 30, 2014 
      </div> 
    </div> 
    <div class="ec_statements"> 
    <div id="legalert_title"> 
    <a href="/Legislation-and-Politics/Legislative-Alerts/Letter-to-Representatives-Urging-Them-to-Vote-No-on-the-Legislation-Providing-Supplemental-Appropriations-for-the-Fiscal-Year-Ending-Sept.-30-2014"> 
      Letter to Representatives Urging Them to Vote No on the Legislation Providing Supplemental Appropriations for the Fiscal Year Ending Sept. 30, 2014 
      </a> 
    </div> 
    <div id="legalert_date"> 
      July 30, 2014 
      </div> 
    </div> 
</body></html>

在某些時候的導師捕捉湯對象中具有標記「格」的所有元素，類_ =「ec_statements」。該

「我們將通過所有在我們的信件收集的項目，併爲每一個，拉出的名稱，使之成爲我們的字典的關鍵：

letters = soup_string.find_all("div", class_="ec_statements")

然後導師說。值將是另一個字典，但我們還沒有找到其他項目的內容，所以我們將創建一個空的字典對象。「

在這一點上，我採取不同的方法，我決定先在列表中，然後存儲在數據幀中的數據。代碼如下：

lobbying_1 = [] 
lobbying_2 = [] 
lobbying_3 = [] 
for element in letters: 
    lobbying_1.append(element.a.get_text()) 
    lobbying_2.append(element.a.attrs.get('href')) 
    lobbying_3.append(element.find(id="legalert_date").get_text()) 
df =pd.DataFrame([]) 
df = pd.DataFrame(lobbying_1, columns = ['Name']) 
df['href'] = lobbying_2 
df['Date'] = lobbying_3

輸出如下：

print(df) 

               Name \ 
0 \n  'Letter to Senators Urging Them to S... 
1 \n   Letter to Representatives Urging Th... 
2 \n   Letter to Representatives Urging Th... 

               href \ 
0 /Legislation-and-Politics/Legislative-Alerts/L... 
1 /Legislation-and-Politics/Legislative-Alerts/L... 
2 /Legislation-and-Politics/Legislative-Alerts/L... 

            Date 
0 \n  September 10, 2014\n   
1  \n  July 30, 2014\n   
2  \n  July 30, 2014\n

我的問題是：有沒有辦法讓更乾淨的數據，即字符串不\ n和空間，只是美麗的湯真正的價值？或者我必須使用正則表達式處理數據？

您的建議將不勝感激。

來源

2017-01-09 im7

爲了擺脫在文本中換行符，呼籲get_text()當通過strip=True：

for element in letters: 
    lobbying_1.append(element.a.get_text(strip=True)) 
    lobbying_2.append(element.a.attrs.get('href')) 
    lobbying_3.append(element.find(id="legalert_date").get_text(strip=True))

這當然，假設，你還是希望數據是在一個DataFrame的形式。

來源

2017-01-09 18:12:54 alecxe

獲取清潔數據：美麗的湯就夠了，或者我還必須使用正則表達式？

回答

相關問題