我正在從一個約12,000 PubMed ID的CSV文件下載標題,摘要,年份發佈和MeSH條款的項目。我已經寫了下面的代碼:使用bs4從ID獲取PubMed數據
import urllib2
from bs4 import BeautifulSoup
import csv
CSVfile = open('srData.csv')
fileReader = csv.reader(CSVfile)
Data = list(fileReader)
i = 0
with open('blank.csv','wb') as f1:
writer=csv.writer(f1, delimiter='\t',lineterminator='\n',)
for id in Data:
soup = BeautifulSoup(urllib2.urlopen("http://www.ncbi.nlm.nih.gov/pubmed/" & id).read())
jouryear = soup.find_all(attrs={"class": "cit"})
year = jouryear[0].get_text()
yearlength = len(year)
titleend = year.find(".")
year1 = titleend+2
year2 = year1+1
year3 = year2+1
year4 = year3+1
year5 = year4+1
published_date = (year[year1:year5])
title = soup.find_all(attrs={"class": "rprt abstract"})
title = (title[0].h1.string)
abstract = (soup.find_all(attrs={"class": "abstr"}))
abstract = (abstract[0].p.string)
writer.writerow([published_date, title, abstract])
i = i+1
print i
當我運行它,我得到以下錯誤:
TypeError: unsupported operand type(s) for &: 'str' and 'list'
我怎樣才能解決這個問題?我也遇到了一年的問題,題目和寫在同一個單元格,但我需要他們在不同的列。我能做些什麼來解決這個問題?
如何使用編碼? – Toby
@Toby:你可以像上面例子那樣使用它,abstract.encode('ascii','ignore')會嘗試編碼以ascii編碼,並刪除所有不適合的字符。 –