正則表達式在網頁中查找精確的pdf鏈接

鑑於url ='http://normanpd.normanok.gov/content/daily-activity'，該網站有三種類型的逮捕，事件和案例摘要。我被要求使用正則表達式來發現Python中所有Incidents pdf文檔的URL字符串。正則表達式在網頁中查找精確的pdf鏈接

pdf將被下載到指定的位置。

我已經通過鏈接去，發現事故的PDF文件的URL是形式：

normanpd.normanok.gov/filebrowser_download/657/2017-02-19%20Daily%20Incident%20Summary.pdf

我已經寫代碼：

import urllib.request 

url="http://normanpd.normanok.gov/content/daily-activity" 

response = urllib.request.urlopen(url) 

data = response.read()  # a `bytes` object 
text = data.decode('utf-8') 
urls=re.findall(r'(\w|/|-/%)+\sIncident\s(%|\w)+\.pdf$',text)

但在網址列表，值是空的。我是python3和正則表達式命令的初學者。誰能幫我？

來源

2017-02-27 Vishnu Shetti

你有你的正則表達式的事件，但不是在字符串中。有[這個網站]（https://regex101.com/）來幫助python模式， – Myonara

我忘了添加我已經有的文本字符串 –

如果在空格使用'％20'轉義時尋找空格，您希望如何找到該字符串？ – Fallenhero

這不是一個可取的方法。相反，使用像bs4（BeautifulSoup）這樣的HTML解析庫來查找鏈接，然後使用regex來過濾結果。

from urllib.request import urlopen 
from bs4 import BeautifulSoup 
import re 

url="http://normanpd.normanok.gov/content/daily-activity" 
response = urlopen(url).read() 
soup= BeautifulSoup(response, "html.parser")  
links = soup.find_all('a', href=re.compile(r'(Incident%20Summary\.pdf)')) 

for el in links: 
    print("http://normanpd.normanok.gov" + el['href'])

輸出：

http://normanpd.normanok.gov/filebrowser_download/657/2017-02-23%20Daily%20Incident%20Summary.pdf 
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-22%20Daily%20Incident%20Summary.pdf 
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-21%20Daily%20Incident%20Summary.pdf 
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-20%20Daily%20Incident%20Summary.pdf 
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-19%20Daily%20Incident%20Summary.pdf 
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-18%20Daily%20Incident%20Summary.pdf 
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-17%20Daily%20Incident%20Summary.pdf

但是，如果你被要求只使用正則表達式，然後嘗試更簡單的東西：

import urllib.request 
import re 

url="http://normanpd.normanok.gov/content/daily-activity" 
response = urllib.request.urlopen(url) 
data = response.read()  # a `bytes` object 
text = data.decode('utf-8') 
urls=re.findall(r'(filebrowser_download.+?Daily%20Incident.+?\.pdf)',text) 
print(urls) 
for link in urls: 
    print("http://normanpd.normanok.gov/" + link)

來源

2017-02-28 13:45:44

你的代碼適用於我..我也寫了一個正則表達式 –

Thanks @ ettore-rizza ..我也寫了一個正則表達式s = re.findall（r'\/file [\ w | \/| | |％] +事件[\ w |％] * \。pdf'，v）雖然效率不高已爲我工作 –

最重要的是找到一個循環模式。如果所有文件都以「filebrowser_download」開頭，並以「.pdf」結尾，那麼爲什麼要打破頭部？ –

使用BeautifulSoup這是一個簡單的方法：

soup = BeautifulSoup(open_page, 'html.parser') 
links = [] 
for link in soup.find_all('a'): 
    current = link.get('href') 
    if current.endswith('pdf') and "Incident" in current: 
     links.append('{0}{1}'.format(url,current))

來源

2017-03-01 19:26:50 user3471418

正則表達式在網頁中查找精確的pdf鏈接

回答

相關問題