2016-10-12 25 views
0

我有一個csv列表的URL需要刮和組織成一個csv文件。我希望每個url的數據都是csv文件中的一行。我有大約19000個網址需要掃描,但我只是試圖用少數幾個來解決這個問題。我能夠抓取文件並在終端中查看它們,但是當我將它們導出到csv文件時,只會顯示最後一個文件。從美麗的湯導出多個抓取的文件到一個cvs文件

的URL出現在CSV文件:

http://www.gpo.gov/fdsys/pkg/CREC-2005-01-26/html/CREC-2005-01-26-pt1-PgH199-6.htm

http://www.gpo.gov/fdsys/pkg/CREC-2005-01-26/html/CREC-2005-01-26-pt1-PgH200-3.htm

我有一種感覺,我做錯了什麼與我的循環,但似乎無法找出其中。任何幫助將不勝感激!

以下是我與迄今爲止的工作:

import urllib 
from bs4 import BeautifulSoup 
import csv 
import re 
import pandas as pd 
import requests 

with open('/Users/test/Dropbox/one_minute_json/Extracting Data/a_2005_test.csv') as f: 
reader = csv.reader(f) 

for row in reader: 
    html = urllib.urlopen(row[0]) 
    r = requests.get(html) 
    soup = BeautifulSoup(r, "lxml") 

for item in soup: 

volume = int(re.findall(r"Volume (\d{1,3})", soup.title.text)[0]) 
print(volume) 

issue = int(re.findall(r"Issue (\d{1,3})", soup.title.text)[0]) 
print(issue) 



date = re.findall(r"\((.*?)\)", soup.title.text)[0] 
print(date) 

page = re.findall(r"\[Page (.*?)]", soup.pre.text.split('\n')[3])[0] 
print(page) 

title = soup.pre.text.split('\n\n ')[1].strip() 
print(title) 

name = soup.pre.text.split('\n ')[2] 
print(name) 

text = soup.pre.text.split(')')[2] 
print(text) 

df = pd.DataFrame() 
df['volume'] = [volume] 
df['issue'] = [issue] 
df['date'] = [date] 
df['page'] = [page] 
df['title'] = [title] 
df['name'] = [name] 
df['text'] = [text] 

df.to_csv('test_scrape.csv', index=False) 

謝謝!

回答

0

你的縮進被完全關閉,請嘗試以下操作:

from bs4 import BeautifulSoup 
import csv 
import re 
import pandas as pd 
import requests 

with open('/Users/test/Dropbox/one_minute_json/Extracting Data/a_2005_test.csv') as f: 
    reader = csv.reader(f) 

    index = 0  
    df = pd.DataFrame(columns=["Volume", "issue", "date", "page", "title", "name", "text"]) 

    for row in reader: 
     r = requests.get(row[0]) 
     soup = BeautifulSoup(r.text, "lxml") 

     for item in soup: 
      volume = int(re.findall(r"Volume (\d{1,3})", soup.title.text)[0]) 
      issue = int(re.findall(r"Issue (\d{1,3})", soup.title.text)[0]) 
      date = re.findall(r"\((.*?)\)", soup.title.text)[0] 
      page = re.findall(r"\[Page (.*?)]", soup.pre.text.split('\n')[3])[0] 
      title = soup.pre.text.split('\n\n ')[1].strip() 
      name = soup.pre.text.split('\n ')[2] 
      text = soup.pre.text.split(')')[2] 
      row = [volume, issue, date, page, title, name, text] 

      df.loc[index] = row 
      index += 1 

    df.to_csv('test_scrape.csv', index=False)