2017-06-23 69 views
0

我有一個網址列表,我想解析:提取數字

['https://www.richmondfed.org/-/media/richmondfedorg/press_room/speeches/president_jeff_lacker/2017/pdf/lacker_speech_20170303.pdf','http://www.federalreserve.gov/newsevents/speech/powell20160929a.htm','http://www.federalreserve.gov/newsevents/speech/fischer20161005a.htm'] 

我想用一個正則表達式表達式來創建一個包含該號碼的新名單字符串的結尾和標點前的任何字母(某些字符串包含兩個位置的數字,如上面列表中的第一個字符串所示)。因此,新的名單看起來像:

['20170303', '20160929a', '20161005a'] 

這是我一直沒有運氣嘗試:

code = re.search(r'?[0-9a-z]*', urls) 

更新:

運行 -

[re.search(r'(\d+)\D+$', url).group(1) for url in urls] 

我收到以下錯誤 -

AttributeError: 'NoneType' object has no attribute 'group' 

此外,它似乎不會像這樣會在數字後面接一個字母,如果有一封信。

+0

也許['re.search(r'。* \ D(\ d \ w *)',s)'](https://regex101.com/r/gZpX4t/2)會做。 –

+0

您可以試試'\ d [^ /。] *(?= \。\ w + $)' – horcrux

回答

0
# python3 

from urllib.parse import urlparse 
from os.path import basename 

def extract_id(url): 
    path = urlparse(url).path 
    resource = basename(path) 
    _id = re.search('\d[^.]*', resource) 
    if _id: 
     return _id.group(0) 

urls =['https://www.richmondfed.org/-/media/richmondfedorg/press_room/speeches/president_jeff_lacker/2017/pdf/lacker_speech_20170303.pdf','http://www.federalreserve.gov/newsevents/speech/powell20160929a.htm','http://www.federalreserve.gov/newsevents/speech/fischer20161005a.htm'] 

# /!\ here you have None if pattern doesn't exist ;) in ids list 
ids = [extract_id(url) for url in urls] 

print(ids) 

輸出:

['20170303', '20160929a', '20161005a'] 
+0

這很好,除了示例中第一個字符串的輸出沒有跳過第一個2017年 - 輸出爲:['2017/pdf/lacker_speech_20170303','20160929a','20161005a'] –

+0

您必須已經更改正則表達式中的東西,因爲它現在可行,謝謝! –

0

考慮:

>>> lios=['https://www.richmondfed.org/-/media/richmondfedorg/press_room/speeches/president_jeff_lacker/2017/pdf/lacker_speech_20170303.pdf','http://www.federalreserve.gov/newsevents/speech/powell20160929a.htm','http://www.federalreserve.gov/newsevents/speech/fischer20161005a.htm'] 

你可以這樣做:

for s in lios: 
    m=re.search(r'(\d+\w*)\D+$', s) 
    if m: 
     print m.group(1) 

打印:

20170303 
20160929a 
20161005a 

這是基於這個表達式:

(\d+\w*)\D+$ 
^   digits 
    ^   any non digits 
     ^  non digits 
     ^ end of string 
+0

您是否看過期望的輸出? – horcrux

0

你可以使用這個表達式(\d+[a-z]*)\.

regex demo

輸出

20170303 
20160929a 
20161005a 
-1
import re 

patterns = { 
    'url_refs': re.compile("(\d+[a-z]*)\."), # YCF_L 
} 

def scan(iterable, pattern=None): 
    """Scan for matches in an iterable.""" 
    for item in iterable: 
     # if you want only one, add a comma: 
     # reference, = pattern.findall(item) 
     # but it's less reusable. 
     matches = pattern.findall(item) 
     yield matches 

你可以再做:

hits = scan(urls, pattern=patterns['url_refs']) 
references = (item[0] for item in hits) 

飼料references你的其他功能。你可以通過這種方式處理更多的東西,並且可以更快地完成任務。