0
我有一個csv列表的URL需要刮和組織成一個csv文件。我希望每個url的數據都是csv文件中的一行。我有大約19000個網址需要掃描,但我只是試圖用少數幾個來解決這個問題。我能夠抓取文件並在終端中查看它們,但是當我將它們導出到csv文件時,只會顯示最後一個文件。從美麗的湯導出多個抓取的文件到一個cvs文件
的URL出現在CSV文件:
http://www.gpo.gov/fdsys/pkg/CREC-2005-01-26/html/CREC-2005-01-26-pt1-PgH199-6.htm
http://www.gpo.gov/fdsys/pkg/CREC-2005-01-26/html/CREC-2005-01-26-pt1-PgH200-3.htm
我有一種感覺,我做錯了什麼與我的循環,但似乎無法找出其中。任何幫助將不勝感激!
以下是我與迄今爲止的工作:
import urllib
from bs4 import BeautifulSoup
import csv
import re
import pandas as pd
import requests
with open('/Users/test/Dropbox/one_minute_json/Extracting Data/a_2005_test.csv') as f:
reader = csv.reader(f)
for row in reader:
html = urllib.urlopen(row[0])
r = requests.get(html)
soup = BeautifulSoup(r, "lxml")
for item in soup:
volume = int(re.findall(r"Volume (\d{1,3})", soup.title.text)[0])
print(volume)
issue = int(re.findall(r"Issue (\d{1,3})", soup.title.text)[0])
print(issue)
date = re.findall(r"\((.*?)\)", soup.title.text)[0]
print(date)
page = re.findall(r"\[Page (.*?)]", soup.pre.text.split('\n')[3])[0]
print(page)
title = soup.pre.text.split('\n\n ')[1].strip()
print(title)
name = soup.pre.text.split('\n ')[2]
print(name)
text = soup.pre.text.split(')')[2]
print(text)
df = pd.DataFrame()
df['volume'] = [volume]
df['issue'] = [issue]
df['date'] = [date]
df['page'] = [page]
df['title'] = [title]
df['name'] = [name]
df['text'] = [text]
df.to_csv('test_scrape.csv', index=False)
謝謝!