Web報廢 - 使用python從頁面提取數據

這是我正在使用的代碼。它返回一個空列表。可以弄清楚我做錯了什麼！Web報廢 - 使用python從頁面提取數據

from urllib request import urlopen 
import re 

url = 'http://pubs.acs.org/doi/full/10.1021/jacs.6b10998'# example of a web page 
html = urlopen(url).read().decode('utf-8')# decoding 

cite_year='<span class="citation_year">(.+?)</span>'# extract citation year 
pattern = re.compile(cite_year) #compile 
citation_year = re.findall(pattern, html) #store data into a variable 

print(citation_year)# and print

來源

2017-02-07 wus

你確定你的正則表達式是正確的？ –

暗示與樣本數據替換的前兩行（I做HTML = 「」「<跨度類=」 citation_year 「>測試 ... ... <跨度類=」 citation_year「>巴 ...的三的四 ... ... 酒吧「」」，然後你的代碼的其餘部分和預期一樣......這會允許你分類問題出在哪裏，以及數據是否有像你期望的那樣的引號等。還要注意，SO往往不鼓勵用正則表達式解析HTML – Foon

添加頭的要求，我用requests和bs4庫：

import requests 
import bs4 
headers = {'User-Agent':'Mozilla/5.0'} 
url = 'http://pubs.acs.org/doi/full/10.1021/jacs.6b10998'# example of a web page 
html = requests.get(url, headers=headers) 
soup = bs4.BeautifulSoup(html.text, 'lxml') 
year = soup.find(class_="citation_year").text 
print(year)

來源

2017-02-07 16:27:10

Web報廢 - 使用python從頁面提取數據

回答

相關問題