我試圖從csv表中提取http://www.immihelp.com/h1b-sponsoring-companies-database/display-2-2010.html中的一些列。與BeautifulSoup for循環的問題
from bs4 import BeautifulSoup
import urllib2
import csv
f = csv.writer(open("H1B_apps.csv", "w"))
f.writerow(["Name", "Jobs", "Positions", "Wage", "City", "State", "Zip"]) # Write column headers as the first line
for x in range (2,5):
soup = BeautifulSoup(urllib2.urlopen('http://www.immihelp.com/h1b-sponsoring-companies-database/display-'+str(x)+'-2010.html').read())
table = soup.find('table', cellspacing = '1')
rows = table.findAll('tr')
for tr in rows:
cols = tr.findAll('nobr')
for data in cols:
name = cols[0].findAll(text=True)
jobs = cols[1].findAll(text=True)
position = cols[2].findAll(text=True)
wage = cols[3].findAll(text=True)
city = cols[4].findAll(text=True)
state = cols[5].findAll(text=True)
zip = cols[6].findAll(text=True)
print(name,jobs,position,wage,city,state,zip)
f.writerow([name,jobs,position,wage,city,state,zip])
該代碼似乎一般運作良好。不過,我有以下問題:
- 輸出不斷重演7倍(?有毛病我的for循環,但不能弄明白)
- 輸出文本來[「U TEXT」] - 我只想要文本位。
這裏是輸出的一個示例:
([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER SUPPORT SPECIALISTS'], [u'43139.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER SUPPORT SPECIALISTS'], [u'43139.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER SUPPORT SPECIALISTS'], [u'43139.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER SUPPORT SPECIALISTS'], [u'43139.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'55994.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'55994.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'55994.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'55994.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'55994.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'55994.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'55994.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'67995.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'67995.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'67995.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'67995.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'67995.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873'])
任何幫助,將不勝感激。謝謝
'findAll'旨在能夠找到*所有*,如果需要的不僅僅是一個。這就是爲什麼'findAll'的輸出是它找到的所有內容的列表,而不僅僅是一個項目。如果你只是想找到第一個,訪問列表的第一個元素('findAll(...)[0]'),或者首先使用'find'。 – poke
謝謝,當我嘗試(findAll(...)[0])時,我得到了cols [5]的IndexError。當我嘗試查找時,它仍然有效,但我仍然有7個數據輸入。 – user3316270