正則表達式在Python中沒有得到我想要的結果

我是Python的新手，並試圖在Python中創建一個腳本，該腳本會刮擦一個網站並在幾個鏈接中返回文本。出於某種原因，我不明白爲什麼這不起作用，並想知道爲什麼。我的正則表達式是：正則表達式在Python中沒有得到我想要的結果

> regex = re.compile(r'<a target="_blank" title=".+" href=".+.pdf">(.+)</a>')

全碼：

import requests, re 

response = requests.get('websithere') 

websiteDate = response.text 

regex = re.compile(r'<a target="_blank" title=".+" href=".+.pdf">(.+)</a>') 
mo = regex.findall(websiteDate) 
print(mo)

我把（+）組中認爲它會發現在那裏列出的任何文字。該3個鏈接它通過掃描爲：

> <a target="_blank" title="Farm Business &amp; Production Management 
> Instructor" href="/uploadedpdfs/job-opportunities/Farm Business 
> Production Mgt Instructor 8-17.pdf">Farm Business &amp; Production 
> Management Instructor</a> 
> 
> <a target="_blank" title="Paramedic Tech Adjunct Instructor Aide" 
> href="/uploadedpdfs/job-opportunities/Paramedic Adjunct Instructor 
> Aide.pdf">Paramedic Tech Adjunct Instructor Aide</a> 
> 
> <a target="_blank" title="Technology Support Specialist" 
> href="/uploadedpdfs/job-opportunities/Technology Support 
> Specialist.pdf">Technology Support Specialist</a>

而不是我的結果只返回：「技術支持專家」

什麼我錯在這裏做什麼？我只是試圖返回標籤內的文字。我嘗試了一下，並且無法使其工作。任何幫助，將不勝感激。

謝謝！

來源

2017-08-07 Winks

您爲執行文章中顯示的輸出而執行哪條語句？請粘貼所有相關的代碼。作爲一個附註，不要使用REGEX來分析HTML。 https://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la。使用BeautifulSoup。 – DyZ

不要使用正則表達式來解析html。 –

簡單：你的正則表達式的一部分title=".+"消耗一切從第一題開始到最後一個標題的末尾：

農場經營&生產管理指導的「href =」/uploadedpdfs /工作機會/農場業務生產管理講師8-17.pdf「>農場業務&生產管理講師</a> < a target =」_ blank「title =」醫護人員技術輔導教員助理「href =」/ uploadedpdfs/job-機會/輔助醫療輔助教練Aide.pdf「>輔助醫療技術兼職輔導員助手</a> <目標= 「_空白」稱號=「技術支持專家

DO NOT USE REGEX TO PARSE HTML

使用BeautifulSoup來代替。

來源

2017-08-07 02:46:37 DyZ

好的，所以我對BeatifulSoup不是很熟悉，但我已經使用了一下。除了正則表達式之外，還有其他的東西可以用在BeautifulSoup中，以縮小我可以讀取的結果的範圍嗎？不使用正則表達式的網站背後的原因是什麼？ – Winks

豐富的BS文檔提供瞭如何從HTML中提取鏈接標題的示例。幫助你自己。 – DyZ

正則表達式在Python中沒有得到我想要的結果

回答

相關問題