Python - 爲什麼findall正則表達式找不到這個特定的文本？

-1

編輯：請不要DOWNVOTE沒有發出你爲什麼下樓的感覺。我正在盡我最大的努力寫這篇文章！Python - 爲什麼findall正則表達式找不到這個特定的文本？

我正嘗試在網站上打印所有手錶的URL鏈接。除了一個之外，我已經把所有這些打印都打印好了，即使那個打印機和其他打印機具有完全相同的正則表達式條件。有人可以解釋爲什麼這不是打印請嗎？我在某處弄錯了一些語法嗎？下面的代碼應該能夠被粘貼到Python編輯器（即IDLE）中並運行。

## Import required modules 
from urllib import urlopen 
from re import findall 
import re 

## Provide URL 
dennisov_url = 'https://denissov.ru/en/' 

## Open and read URL as string named 'dennisov_html' 
dennisov_html = urlopen(dennisov_url).read() 

## Find all of the links when each watch is clicked (those with the designated 
## preceeding text 'window.open', then any character that occurs zero or more 
## times, then the text '/en/'. Remove matches with the word "History" and 
## any " symbols in the URL. 
watch_link_urls = findall('window.open.*(/en/[^history][^"]*/)', dennisov_html) 
## For every URL, convert it into a string on a new line and add the domain 
for link in watch_link_urls: 
    link = 'https://denissov.ru' + link 
## Print out the full URLs 
    print link 

## This code should show the link https://denissov.ru/en/speedster/ yet 
## it isn't showing. It has the exact preceeding text as the other links 
## that are printing and is in the same div container. If you inspect the 
## website then search 'en/barracuda_mechanical/ and then 'en/speedster/' 
## you will see that the speedster link is only a few lines below barracuda 
## mechanical and there is nothing different about the two's preceeding 
## text, so speedster should be printing

來源

2017-05-20 user88720

噢是的，'[^ history] [^「] *'部分被搞砸了，它意味着任何字符，但是h，I，s，t，o，r，y跟着螞蟻字符，而後跟'''。 –

你可以試試這個代碼使用此模式：

from urllib2 import urlopen 
import re 

url = 'https://denissov.ru/en/' 
data = urlopen(url).read() 
sub_urls = re.findall('window.open\(\'(/.*?)\'', data) 
# take everything without deleting dublicates 
# final_urls = [k for k in b if '/history' not in k and k is not ''] 
# Or: remove duplicates 
set(k for k in b if '/history' not in k) 

for k in final_urls: 
    link = 'https://denissov.ru' + k 
    print link

將輸出類似這樣：

https://denissov.ru/eng/denissovdesign/index.html 
https://denissov.ru/en/barracuda_limited/ 
https://denissov.ru/en/barracuda_chronograph/ 
https://denissov.ru/en/barracuda_mechanical/ 
https://denissov.ru/en/speedster/ 
https://denissov.ru/en/free_rider/ 
https://denissov.ru/en/nau_automatic/ 
https://denissov.ru/en/lady_flower/ 
https://denissov.ru/en/enigma/ 
https://denissov.ru/en/number_one/

來源

2017-05-20 05:55:29

如果你想有一個正則表達式來獲取所有網址不包含 word history並以en/開頭，那麼你應該使用脾氣暴躁的解決方案，像這樣：

en\/(?:(?!history).)*?\/

(?:(?!history).)*?是一個鍛鍊點，這將匹配不具有history作爲一個超前的任何字符。
- (?!history)是一個負面的前瞻來確保。
- 已添加?:以表明該組爲非捕獲組。
- 的*?指示非貪婪匹配，這樣它會僅匹配高達第一/

Regex101 Demo

更改Python代碼是這樣的：

watch_link_urls = findall('window.open.*(/en\/(?:(?!history).)*?\/)', dennisov_html)

輸出：

https://denissov.ru/en/barracuda_limited/ 
https://denissov.ru/en/barracuda_chronograph/ 
https://denissov.ru/en/barracuda_mechanical/ 
https://denissov.ru/en/speedster/ 
https://denissov.ru/en/free_rider/ 
https://denissov.ru/en/nau_automatic/ 
https://denissov.ru/en/lady_flower/ 
https://denissov.ru/en/enigma/ 
https://denissov.ru/en/number_one/

查看更多about tempered greedy here。

來源

2017-05-20 11:28:07 degant

Python - 爲什麼findall正則表達式找不到這個特定的文本？

回答

相關問題