2017-02-27 25 views
0

鑑於url ='http://normanpd.normanok.gov/content/daily-activity',該網站有三種類型的逮捕,事件和案例摘要。我被要求使用正則表達式來發現Python中所有Incidents pdf文檔的URL字符串。正則表達式在網頁中查找精確的pdf鏈接

pdf將被下載到指定的位置。

我已經通過鏈接去,發現事故的PDF文件的URL是形式:

normanpd.normanok.gov/filebrowser_download/657/2017-02-19%20Daily%20Incident%20Summary.pdf 

我已經寫代碼:

import urllib.request 

url="http://normanpd.normanok.gov/content/daily-activity" 

response = urllib.request.urlopen(url) 

data = response.read()  # a `bytes` object 
text = data.decode('utf-8') 
urls=re.findall(r'(\w|/|-/%)+\sIncident\s(%|\w)+\.pdf$',text) 

但在網址列表,值是空的。 我是python3和正則表達式命令的初學者。誰能幫我?

+0

你有你的正則表達式的事件,但不是在字符串中。有[這個網站](https://regex101.com/)來幫助python模式, – Myonara

+0

我忘了添加我已經有的文本字符串

回答

0

這不是一個可取的方法。相反,使用像bs4(BeautifulSoup)這樣的HTML解析庫來查找鏈接,然後使用regex來過濾結果。

from urllib.request import urlopen 
from bs4 import BeautifulSoup 
import re 

url="http://normanpd.normanok.gov/content/daily-activity" 
response = urlopen(url).read() 
soup= BeautifulSoup(response, "html.parser")  
links = soup.find_all('a', href=re.compile(r'(Incident%20Summary\.pdf)')) 

for el in links: 
    print("http://normanpd.normanok.gov" + el['href']) 

輸出:

http://normanpd.normanok.gov/filebrowser_download/657/2017-02-23%20Daily%20Incident%20Summary.pdf 
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-22%20Daily%20Incident%20Summary.pdf 
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-21%20Daily%20Incident%20Summary.pdf 
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-20%20Daily%20Incident%20Summary.pdf 
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-19%20Daily%20Incident%20Summary.pdf 
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-18%20Daily%20Incident%20Summary.pdf 
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-17%20Daily%20Incident%20Summary.pdf 

但是,如果你被要求只使用正則表達式,然後嘗試更簡單的東西:

import urllib.request 
import re 

url="http://normanpd.normanok.gov/content/daily-activity" 
response = urllib.request.urlopen(url) 
data = response.read()  # a `bytes` object 
text = data.decode('utf-8') 
urls=re.findall(r'(filebrowser_download.+?Daily%20Incident.+?\.pdf)',text) 
print(urls) 
for link in urls: 
    print("http://normanpd.normanok.gov/" + link) 
+0

你的代碼適用於我..我也寫了一個正則表達式 –

+0

Thanks @ ettore-rizza ..我也寫了一個正則表達式s = re.findall(r'\/file [\ w | \/| | |%] +事件[\ w |%] * \。pdf',v)雖然效率不高已爲我工作 –

+0

最重要的是找到一個循環模式。如果所有文件都以「filebrowser_download」開頭,並以「.pdf」結尾,那麼爲什麼要打破頭部? –

0

使用BeautifulSoup這是一個簡單的方法:

soup = BeautifulSoup(open_page, 'html.parser') 
links = [] 
for link in soup.find_all('a'): 
    current = link.get('href') 
    if current.endswith('pdf') and "Incident" in current: 
     links.append('{0}{1}'.format(url,current))