2013-10-28 65 views
1

我可以打印出我從網站上提取的信息,沒有任何問題。但是,當我嘗試將街道名稱放在一列中,並將郵編放入另一列中時,我就會遇到遇到問題時的CSV文件。我所獲得的所有CSV都是兩列名稱,並且每一頁都在頁面的各列中。這是我的代碼。另外我使用Python 2.7.5和美麗的湯4Python BeautifulSoup以CSV格式打印信息

from bs4 import BeautifulSoup 
import csv 
import urllib2 

url="http://www.conakat.com/states/ohio/cities/defiance/road_maps/" 

page=urllib2.urlopen(url) 

soup = BeautifulSoup(page.read()) 

f = csv.writer(open("Defiance Steets1.csv", "w")) 
f.writerow(["Name", "ZipCodes"]) # Write column headers as the first line 

links = soup.find_all(['i','a']) 

for link in links: 
    names = link.contents[0] 
    print unicode(names) 

f.writerow(names) 
+0

您的代碼不顯示如何獲取郵政編碼。另外,你在循環中沒有使用f.writerow,名字爲 – Vorsprung

回答

2

您從URL檢索數據包含比i元素更a元素。您必須過濾a元素,然後使用Python zip buildin構建對。

links = soup.find_all('a') 
links = [link for link in links 
     if link["href"].startswith("http://www.conakat.com/map/?p=")] 
zips = soup.find_all('i') 

for l, z in zip(links, zips): 
    f.writerow((l.contents[0], z.contents[0])) 

輸出:

Name,ZipCodes 
1ST ST,(43512) 
E 1ST ST,(43512) 
W 1ST ST,(43512) 
2ND ST,(43512) 
E 2ND ST,(43512) 
W 2ND ST,(43512) 
3 RIVERS CT,(43512) 
3RD ST,(43512) 
E 3RD ST,(43512) 
... 
+0

這正是我所需要的,非常感謝。 – Codin

2

另一種方法(python3)是每一個<a>鏈接後,找到下一個兄弟,檢查它是否是一個標籤,並提取其價值:

from bs4 import BeautifulSoup 
import csv 
import urllib.request as urllib2 

url="http://www.conakat.com/states/ohio/cities/defiance/road_maps/" 

page=urllib2.urlopen(url) 

soup = BeautifulSoup(page.read()) 

f = csv.writer(open("Defiance Steets1.csv", "w")) 
f.writerow(["Name", "ZipCodes"]) # Write column headers as the first line 

links = soup.find_all('a') 

for link in links: 
    i = link.find_next_sibling('i') 
    if getattr(i, 'name', None): 
     a, i = link.string, i.string 
     f.writerow([a, i]) 

它產生:

Name,ZipCodes 
1ST ST,(43512) 
E 1ST ST,(43512) 
W 1ST ST,(43512) 
2ND ST,(43512) 
E 2ND ST,(43512) 
W 2ND ST,(43512) 
3 RIVERS CT,(43512) 
3RD ST,(43512) 
E 3RD ST,(43512) 
W 3RD ST,(43512) 
... 
+0

你的方法很好,也謝謝你。我有一個簡短的問題,你將如何從郵政編碼周圍刪除()。謝謝 – Codin

+0

@Codin:一個字符串('i.string')也是一個可迭代的,所以你可以使用一個切片去除第一個和最後一個字符:'a,i = link.string,i.string [1: - 1]' – Birei

+0

謝謝你的幫助 – Codin