刮表和從鏈接獲取更多信息

我正在使用python和beautifulsoup刮表...我有一個相當不錯的句柄來獲取我需要的大部分信息。縮短了我正試圖抓取的內容。刮表和從鏈接獲取更多信息

<tr> <td><a href="/wiki/Joseph_Carter_Abbott" title="Joseph Carter Abbott">Joseph Carter Abbott</a></td> <td>1868–1872</td> <td>North Carolina</td> <td><a href="/wiki/Republican_Party_(United_States)" title="Republican Party (United States)">Republican</a></td> 
</tr> 
<tr> <td><a href="/wiki/James_Abdnor" title="James Abdnor">James Abdnor</a></td> <td>1981–1987</td> <td>South Dakota</td> <td><a href="/wiki/Republican_Party_(United_States)" title="Republican Party (United States)">Republican</a></td> </tr> <tr> <td><a href="/wiki/Hazel_Abel" title="Hazel Abel">Hazel Abel</a></td> <td>1954</td> <td>Nebraska</td> <td><a href="/wiki/Republican_Party_(United_States)" title="Republican Party (United States)">Republican</a></td> 
</tr>

http://en.wikipedia.org/wiki/List_of_former_United_States_senators

我想名稱，描述，多年來，國家，黨。

說明是每個人頁面上的第一段文字。我知道如何獨立完成此任務，但我不知道如何將其與名稱，年份，州，黨整合，因爲我必須導航到不同的頁面。

哦，我需要將它寫入csv。

謝謝！

來源

2014-04-03 user3485563

您必須編寫一些代碼才能讀取這兩個網頁並結合其中包含的信息。哦，並將其寫入CSV。 –

只是要闡述@ anrosent的答案：發送請求中解析是最好的和最一致的方法之一。但是，獲取描述的函數也必須正確運行，因爲如果它返回一個NoneType錯誤，則整個過程變爲混亂。

我在我的最後做到這一點的方式是這樣的（請注意，我使用的是請求庫，而不是urllib或urllib2，因爲我對此更加適應 - 隨時可以根據自己的喜好更改它，邏輯反正是一樣的）：

from bs4 import BeautifulSoup as bsoup 
import requests as rq 
import csv 

ofile = open("presidents.csv", "wb") 
f = csv.writer(ofile) 
f.writerow(["Name","Description","Years","State","Party"]) 
base_url = "http://en.wikipedia.org/wiki/List_of_former_United_States_senators" 
r = rq.get(base_url) 
soup = bsoup(r.content) 
all_tables = soup.find_all("table", class_="wikitable") 

def get_description(url): 
    r = rq.get(url) 
    soup = bsoup(r.content) 
    desc = soup.find_all("p")[0].get_text().strip().encode("utf-8") 
    return desc 

complete_list = [] 
for table in all_tables: 
    trs = table.find_all("tr")[1:] # Ignore the header row. 
    for tr in trs: 
     tds = tr.find_all("td") 
     first = tds[0].find("a") 
     name = first.get_text().encode("utf-8") 
     desc = get_description("http://en.wikipedia.org%s" % first["href"]) 
     years = tds[1].get_text().encode("utf-8") 
     state = tds[2].get_text().encode("utf-8") 
     party = tds[3].get_text().encode("utf-8") 
     f.writerow([name, desc, years, state, party]) 

ofile.close()

然而，這種嘗試在剛剛David Barton後的行結束。如果你檢查頁面，也許這與他自己佔據兩行有關。這取決於你解決。回溯如下：

Traceback (most recent call last): 
    File "/home/nanashi/Documents/Python 2.7/Scrapers/presidents.py", line 25, in <module> 
    name = first.get_text().encode("utf-8") 
AttributeError: 'NoneType' object has no attribute 'get_text'

另外，請注意我的get_description功能是如何的主要過程之前。這顯然是因爲你必須首先定義函數。最後，我的get_description功能還不夠完美，因爲它可能會失敗，因爲單個頁面中的第一個p標籤不是您想要的標籤。

抽樣結果：

enter image description here

講究的錯誤路線，像Maryon艾倫的描述。這也是爲了你解決的。

希望這點能指引您朝着正確的方向發展。

來源

2014-04-03 22:48:24 Manhattan

吶喊！謝謝！對不起，我沒有看到這一點！我很驚訝這個網站的成員的技能水平。我在幾天工作的時間只需幾分鐘。 – user3485563

@ user3485563，並且仔細查看您之前的其他問題，並接受它們是否值得，謝謝。 – alecxe

感謝您的接受。 @alecxe：謝謝。儘管這不是最好的答案，但不是一個糟糕的問題。 :) – Manhattan

如果您使用的是BeautifulSoup，那麼您將不會像有狀態的類似瀏覽器的瀏覽器導航到其他頁面，而只是使用wiki /名稱等另一個頁面請求另一個頁面。所以，你的代碼可能看起來像

import urllib, csv 

with open('out.csv','w') as f: 

    csv_file = csv.writer(f) 

    #loop through the rows of the table 
    for row in senator_rows: 
     name = get_name(row) 

     ... #extract the other data from the <tr> elt 

     senator_page_url = get_url(row) 

     #get description from HTML text of senator's page 
     description = get_description(get_html(senator_page_url)) 

     #write this row to the CSV file 
     csv_file.writerow([name, ..., description]) 

#quick way to get the HTML text as string for given url 
def get_html(url): 
    return urllib.urlopen(url).read()

注意的是Python 3.x中，你會被導入和使用urllib.request代替urllib，你就會有解碼bytes的read()調用將返回。這聽起來像你知道如何填寫我留在那裏的其他get_*函數，所以我希望這有助於！

來源

2014-04-03 21:53:48 anrosent

刮表和從鏈接獲取更多信息

回答

相關問題