2012-08-17 52 views
3

這裏是我在一些字符串中查找URL的正則表達式(我需要該域的組,因爲進一步的操作基於域),在本例中我注意到了一些字符串'fffffffff'這很慢,有什麼明顯的我失蹤?匹配正則表達式的URL在某些字符串上速度很慢

>>> URL_ALLOWED = r"[a-z0-9$-_.+!*'(),%]" 
>>> URL_RE = re.compile(
...  r'(?:(?:https?|ftp):\/\/)?' # protocol 
...  r'(?:www.)?' # www 
...  r'(' # host - start 
...   r'(?:' 
...    r'[a-z0-9]' # first character of domain('-' not allowed) 
...    r'(?:' 
...     r'[a-z0-0-]*' # characters in the middle of domain 
...     r'[a-z0-9]' # last character of domain('-' not allowed) 
...    r')*' 
...    r'\.' # dot before next part of domain name 
...   r')+' 
...   r'[a-z]{2,10}' # TLD 
...   r'|' # OR 
...   r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}' # IP address 
...  r')' # host - end 
...  r'(?::[0-9]+)?' # port 
...  r'(?:\/%(allowed_chars)s+/?)*' # path 
...  r'(?:\?(?:%(allowed_chars)s+=%(allowed_chars)s+&)*' # GET params 
...  r'%(allowed_chars)s+=%(allowed_chars)s+)?' # last GET param 
...  r'(?:#[^\s]*)?' % { # anchor 
...   'allowed_chars': URL_ALLOWED 
...  }, 
...  re.IGNORECASE 
...) 
>>> from time import time 
>>> strings = [ 
...  'foo bar baz', 
...  'blah blah blah blah blah blah', 
...  'f' * 10, 
...  'f' * 20, 
...  'f' * 30, 
...  'f' * 40, 
... ] 
>>> def t(): 
...  for string in strings: 
...    t1 = time() 
...    URL_RE.findall(string) 
...    print string, time() - t1 
... 
>>> t() 
foo bar baz 3.91006469727e-05 
blah blah blah blah blah blah 6.98566436768e-05 
ffffffffff 0.000313997268677 
ffffffffffffffffffff 0.183916091919 
ffffffffffffffffffffffffffffff 178.445468903 

是的,我知道有另一種解決方案使用非常簡單的正則表達式(包含例如點字),並使用后里urlparse拿到域名,但不裏urlparse未如預期時,我們沒有協議工作在URL:

>>> urlparse('example.com') 
ParseResult(scheme='', netloc='', path='example.com', params='', query='', fragment='') 
>>> urlparse('http://example.com') 
ParseResult(scheme='http', netloc='example.com', path='', params='', query='', fragment='') 
>>> urlparse('example.com/test/test') 
ParseResult(scheme='', netloc='', path='example.com/test/test', params='', query='', fragment='') 
>>> urlparse('http://example.com/test/test') 
ParseResult(scheme='http', netloc='example.com', path='/test/test', params='', query='', fragment='') 
>>> urlparse('example.com:1234/test/test') 
ParseResult(scheme='example.com', netloc='', path='1234/test/test', params='', query='', fragment='') 
>>> urlparse('http://example.com:1234/test/test') 
ParseResult(scheme='http', netloc='example.com:1234', path='/test/test', params='', query='', fragment='') 

呀預謀的http://也是一個解決方案(我還沒有100%肯定,如果沒有其他問題裏urlparse),但我很好奇什麼不對的正則表達式反正

+3

該正則表達式使我的大腦受到傷害 – chucksmash 2012-08-17 12:38:31

+2

我的直覺是,有很多的h或F的(或者不再有任何搜索目標子字符串)匹配的模式開始將突然出現。你有沒有考慮試圖通過由whitepace分隔的標記化對字符串進行預處理,然後針對這些標記運行更簡單的正則表達式?嘗試在線事情並不總是最快的方法。 – 2012-08-17 13:13:55

+0

urlparse有按預期工作。只是你傳遞的信息不是網址。 「example.com」不是一個URL,「myshellserver:22」也不是一個URL。你必須準備好接受這種方法有時會產生誤報,如果是這樣的話,也許簡單的帶點正則表達式就沒問題。否則,我同意IamChuckB – moopet 2012-08-17 14:16:52

回答

3

我認爲這是因爲這部分

...   r'(?:' 
...    r'[a-z0-9]' # first character of domain('-' not allowed) 
...    r'(?:' 
...     r'[a-z0-0-]*' # characters in the middle of domain 
...     r'[a-z0-9]' # last character of domain('-' not allowed) 
...    r')*' 
...    r'\.' # dot before next part of domain name 
...   r')+' 

你不應該使用結構是這樣的([set_of_symbols#1] * [set_of_symbols#2])*如果set_of_symbols#1和#set_of_symbols 2具有相同的符號。

請嘗試使用下面的代碼:

...   r'(?:' 
...    r'[a-z0-9]' # first character of domain('-' not allowed) 
...    r'[a-z0-0-]*' # characters in the middle of domain 
...    r'(?<=[a-z0-9])' # last character of domain('-' not allowed) 
...    r'\.' # dot before next part of domain name 
...   r')+' 

應該更好地工作。

+0

謝謝,當改變了這一行,並修復了錯別字@ridgerunner發現它工作該死的速度快:) – virhilo 2012-08-17 15:25:40

0

僅供參考,您也可以使用re.VERBOSE選項,使這個更具可讀性

URL_RE = re.compile(r""" 
    (?:(?:https?|ftp):\/\/)?       # protocol 
    (?:www.)?           # www 
    (             # host - start 
     (?: 
      [a-z0-9]         # first character of domain('-' not allowed) 
      (?: 
       [a-z0-0-]*        # characters in the middle of domain 
       [a-z0-9]        # last character of domain('-' not allowed) 
      )* 
      \.           # dot before next part of domain name 
     )+ 
     [a-z]{2,10}          # TLD 
     |            # OR 
     (?:[0-9]{1,3}\.){3}[0-9]{1,3}     # IP address 
    )             # host - end 
    (?::[0-9]+)?          # port 
    (?:\/%(allowed_chars)s+/?)*       # path 
    (?:\?(?:%(allowed_chars)s+=%(allowed_chars)s+&)* # GET params 
    %(allowed_chars)s+=%(allowed_chars)s+)?    # last GET param 
    (?:#[^\s]*)? 
""" % { # anchor 
     'allowed_chars': URL_ALLOWED 
    }, 
    re.IGNORECASE|re.VERBOSE 
)