2017-05-20 58 views
-1

編輯:請不要DOWNVOTE沒有發出你爲什麼下樓的感覺。我正在盡我最大的努力寫這篇文章!Python - 爲什麼findall正則表達式找不到這個特定的文本?

我正嘗試在網站上打印所有手錶的URL鏈接。除了一個之外,我已經把所有這些打印都打印好了,即使那個打印機和其他打印機具有完全相同的正則表達式條件。有人可以解釋爲什麼這不是打印請嗎?我在某處弄錯了一些語法嗎?下面的代碼應該能夠被粘貼到Python編輯器(即IDLE)中並運行。

## Import required modules 
from urllib import urlopen 
from re import findall 
import re 

## Provide URL 
dennisov_url = 'https://denissov.ru/en/' 

## Open and read URL as string named 'dennisov_html' 
dennisov_html = urlopen(dennisov_url).read() 

## Find all of the links when each watch is clicked (those with the designated 
## preceeding text 'window.open', then any character that occurs zero or more 
## times, then the text '/en/'. Remove matches with the word "History" and 
## any " symbols in the URL. 
watch_link_urls = findall('window.open.*(/en/[^history][^"]*/)', dennisov_html) 
## For every URL, convert it into a string on a new line and add the domain 
for link in watch_link_urls: 
    link = 'https://denissov.ru' + link 
## Print out the full URLs 
    print link 

## This code should show the link https://denissov.ru/en/speedster/ yet 
## it isn't showing. It has the exact preceeding text as the other links 
## that are printing and is in the same div container. If you inspect the 
## website then search 'en/barracuda_mechanical/ and then 'en/speedster/' 
## you will see that the speedster link is only a few lines below barracuda 
## mechanical and there is nothing different about the two's preceeding 
## text, so speedster should be printing 
+0

噢是的,'[^ history] ​​[^「] *'部分被搞砸了,它意味着任何字符,但是h,I,s,t,o,r,y跟着螞蟻字符,而後跟'''。 –

回答

0

你可以試試這個代碼使用此模式:

from urllib2 import urlopen 
import re 

url = 'https://denissov.ru/en/' 
data = urlopen(url).read() 
sub_urls = re.findall('window.open\(\'(/.*?)\'', data) 
# take everything without deleting dublicates 
# final_urls = [k for k in b if '/history' not in k and k is not ''] 
# Or: remove duplicates 
set(k for k in b if '/history' not in k) 

for k in final_urls: 
    link = 'https://denissov.ru' + k 
    print link 

將輸出類似這樣:

https://denissov.ru/eng/denissovdesign/index.html 
https://denissov.ru/en/barracuda_limited/ 
https://denissov.ru/en/barracuda_chronograph/ 
https://denissov.ru/en/barracuda_mechanical/ 
https://denissov.ru/en/speedster/ 
https://denissov.ru/en/free_rider/ 
https://denissov.ru/en/nau_automatic/ 
https://denissov.ru/en/lady_flower/ 
https://denissov.ru/en/enigma/ 
https://denissov.ru/en/number_one/ 
0

如果你想有一個正則表達式來獲取所有網址不包含 word history並以en/開頭,那麼你應該使用脾氣暴躁的解決方案,像這樣:

en\/(?:(?!history).)*?\/ 
  • (?:(?!history).)*?是一個鍛鍊點,這將匹配不具有history作爲一個超前的任何字符。
    • (?!history)是一個負面的前瞻來確保。
    • 已添加?:以表明該組爲非捕獲組。
    • *?指示非貪婪匹配,這樣它會僅匹配高達第一/

Regex101 Demo

更改Python代碼是這樣的:

watch_link_urls = findall('window.open.*(/en\/(?:(?!history).)*?\/)', dennisov_html) 

輸出:

https://denissov.ru/en/barracuda_limited/ 
https://denissov.ru/en/barracuda_chronograph/ 
https://denissov.ru/en/barracuda_mechanical/ 
https://denissov.ru/en/speedster/ 
https://denissov.ru/en/free_rider/ 
https://denissov.ru/en/nau_automatic/ 
https://denissov.ru/en/lady_flower/ 
https://denissov.ru/en/enigma/ 
https://denissov.ru/en/number_one/ 

查看更多about tempered greedy here

相關問題