2014-04-10 63 views
1

將多個類別的網頁抓取到csv中。成功獲得第一類成列,但第二列數據不寫入csv。我正在使用的代碼:抓取網站將數據移動到多個csv列

import urllib2 
import csv 
from bs4 import BeautifulSoup 
url = "http://digitalstorage.journalism.cuny.edu/sandeepjunnarkar/tests/jazz.html" 
page = urllib2.urlopen(url) 
soup_jazz = BeautifulSoup(page) 
all_years = soup_jazz.find_all("td",class_="views-field views-field-year") 
all_category = soup_jazz.find_all("td",class_="views-field views-field-category-code") 
with open("jazz.csv", 'w') as f: 
    csv_writer = csv.writer(f) 
    csv_writer.writerow([u'Year Won', u'Category']) 
    for years in all_years: 
     year_won = years.string 
     if year_won: 
      csv_writer.writerow([year_won.encode('utf-8')]) 
    for categories in all_category: 
     category_won = categories.string 
     if category_won: 
      csv_writer.writerow([category_won.encode('utf-8')]) 

它將列標題寫入第二列而不是category_won。

根據您的建議,我已把它編譯閱讀:

with open("jazz.csv", 'w') as f: 
    csv_writer = csv.writer(f) 
    csv_writer.writerow([u'Year Won', u'Category']) 
for years, categories in zip(all_years, all_category): 
    year_won = years.string 
    category_won = categories.string 
    if year_won and category_won: 
     csv_writer.writerow([year_won.encode('utf-8'), category_won.encode('utf-8')]) 

但現在我已經收到以下錯誤:

csv_writer.writerow([year_won.encode( 'UTF-8' ),category_won.encode( 'UTF-8')]) ValueError異常:I/O操作上關閉的文件

回答

0

你可以在兩個列表zip()在一起:

for years, categories in zip(all_years, all_category): 
    year_won = years.string 
    category_won = categories.string 
    if year_won and category_won: 
     csv_writer.writerow([year_won.encode('utf-8'), category_won.encode('utf-8')]) 

不幸的是,那個HTML頁面有點壞了,你不能像你期望的那樣搜索表格行。

下一個最好的事情是尋找這些年來,然後找同級細胞:

soup_jazz = BeautifulSoup(page) 
with open("jazz.csv", 'w') as f: 
    csv_writer = csv.writer(f) 
    csv_writer.writerow([u'Year Won', u'Category']) 
    for year_cell in soup_jazz.find_all('td', class_='views-field-year'): 
     year = year_cell and year_cell.text.strip().encode('utf8') 
     if not year: 
      continue 
     category = next((e for e in year_cell.next_siblings 
         if getattr(e, 'name') == 'td' and 
          'views-field-category-code' in e.attrs.get('class', [])), 
         None) 
     category = category and category.text.strip().encode('utf8') 
     if year and category: 
      csv_writer.writerow([year, category]) 

這將產生:

Year Won,Category 
2012,Best Improvised Jazz Solo 
2012,Best Jazz Vocal Album 
2012,Best Jazz Instrumental Album 
2012,Best Large Jazz Ensemble Album 
.... 
1960,Best Jazz Composition Of More Than Five Minutes Duration 
1959,Best Jazz Performance - Soloist 
1959,Best Jazz Performance - Group 
1958,"Best Jazz Performance, Individual" 
1958,"Best Jazz Performance, Group" 
+0

只是去嘗試,現在我上面列出得到一個錯誤。 – user1922698

+0

@ user1922698:然後,您正在嘗試運行'with'語句的*外部*循環。 –

+0

但上面生成的內容一次又一次地顯示了同一類別,但它們都是不同的類別。 – user1922698