2014-02-16 41 views
1

我試圖從csv表中提取http://www.immihelp.com/h1b-sponsoring-companies-database/display-2-2010.html中的一些列。與BeautifulSoup for循環的問題

from bs4 import BeautifulSoup 
import urllib2 
import csv 

f = csv.writer(open("H1B_apps.csv", "w")) 
f.writerow(["Name", "Jobs", "Positions", "Wage", "City", "State", "Zip"]) # Write column headers as the first line 

for x in range (2,5): 

    soup = BeautifulSoup(urllib2.urlopen('http://www.immihelp.com/h1b-sponsoring-companies-database/display-'+str(x)+'-2010.html').read()) 

    table = soup.find('table', cellspacing = '1') 

    rows = table.findAll('tr') 



    for tr in rows: 
     cols = tr.findAll('nobr') 
     for data in cols: 
      name = cols[0].findAll(text=True) 
      jobs = cols[1].findAll(text=True) 
      position = cols[2].findAll(text=True) 
      wage = cols[3].findAll(text=True) 
      city = cols[4].findAll(text=True) 
      state = cols[5].findAll(text=True) 
      zip = cols[6].findAll(text=True) 

      print(name,jobs,position,wage,city,state,zip) 
      f.writerow([name,jobs,position,wage,city,state,zip]) 

該代碼似乎一般運作良好。不過,我有以下問題:

  1. 輸出不斷重演7倍(?有毛病我的for循環,但不能弄明白)
  2. 輸出文本來[「U TEXT」] - 我只想要文本位。

這裏是輸出的一個示例:

([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER SUPPORT SPECIALISTS'], [u'43139.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER SUPPORT SPECIALISTS'], [u'43139.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER SUPPORT SPECIALISTS'], [u'43139.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER SUPPORT SPECIALISTS'], [u'43139.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'55994.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'55994.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'55994.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'55994.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'55994.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'55994.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'55994.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'67995.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'67995.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'67995.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'67995.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'67995.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873'])

任何幫助,將不勝感激。謝謝

+0

'findAll'旨在能夠找到*所有*,如果需要的不僅僅是一個。這就是爲什麼'findAll'的輸出是它找到的所有內容的列表,而不僅僅是一個項目。如果你只是想找到第一個,訪問列表的第一個元素('findAll(...)[0]'),或者首先使用'find'。 – poke

+0

謝謝,當我嘗試(findAll(...)[0])時,我得到了cols [5]的IndexError。當我嘗試查找時,它仍然有效,但我仍然有7個數據輸入。 – user3316270

回答