REGEX提取部分鏈接

我的目標是從拍賣網站頁面中刪除一些拍賣ID。頁面爲here REGEX提取部分鏈接

對於我感興趣的頁面，大約有60個拍賣ID。 auctionID前面有一個短劃線，由10個數字組成，並在.htm之前終止。例如在ID下面的鏈接將0133346952

<a href="/sports/cycling/mountain-bikes/full-suspension/auction-1033346952.htm" class="tile-2">

我已經得到儘可能從提取的各個環節，通過識別「一」的標籤。該代碼位於頁面的底部。

根據我有限的知識，我會說REGEX應該是解決這個問題的正確方法。我想正則表達式是這樣的：

-...........htm

不過，我沒能在正則表達式成功地融入代碼。我會盡管像

for links in soup.find_all('-...........htm'):

會做的伎倆，但顯然不是。

我該如何解決這段代碼？

import bs4 
import requests 
import re 
res = requests.get('http://www.trademe.co.nz/browse/categorylistings.aspx?mcatpath=sports%2fcycling%2fmountain-bikes%2ffull-suspension&page=2&sort_order=default&rptpath=5-380-50-7145-') 
res.raise_for_status() 
soup = bs4.BeautifulSoup(res.text, 'html.parser') 
for links in soup.find_all('-...........htm'): 
    print (links.get('href'))

來源

2016-02-20 Steve

這裏的代碼工作：

for links in soup.find_all(href=re.compile("auction-[0-9]{10}.htm")): 
    h = links.get('href') 
    m = re.search("auction-([0-9]{10}).htm", h) 
    if m: 
     print(m.group(1))

首先你需要一個正則表達式來提取href。然後你需要一個捕獲正則表達式來提取id。

來源

2016-02-20 08:37:03 skyline75489

How關於使像OP這樣的數字串10位數表示。 '[0-9] {10}' – Marichyasana

好點。之前沒有注意到。 – skyline75489

你有一個regular expression對象find_all()通過你只是要作爲一個正則表達式模式使用字符串移交。

學習和調試這種東西，它從站點緩存中的數據是有用的，直到事情的工作：

import bs4 
import requests 
import re 
import os 

# don't want to download while experimenting 
tmp_file = 'data.html' 

if True and os.path.exists('data.html'): # switch True to false in production 
    with open(tmp_file) as fp: 
     data = fp.read() 
else: 
    res = requests.get('http://www.trademe.co.nz/browse/categorylistings.aspx?mcatpath=sports%2fcycling%2fmountain-bikes%2ffull-suspension&page=2&sort_order=default&rptpath=5-380-50-7145-') 
    res.raise_for_status() 
    data = res.text 
    with open(tmp_file, 'w') as fp: 
     fp.write(data) 

soup = bs4.BeautifulSoup(data, 'html.parser') 
# and start experimenting with your regular expressions 
regex = re.compile('...........htm') 
for links in soup.find_all(regex): 
    print (links.get('href')) 
# the above doesn't find anything, you need to search the hrefs 
print('try again') 
for links in soup.find_all(href=regex): 
    print (links.get('href'))

一旦你得到一些比賽，你可以提高你的正則表達式模式，使用更多複雜的技術，但這在我的經驗中並不比以正確的「框架」開始快速嘗試（而不是等待每次測試代碼更改的下載）更爲重要。

來源

2016-02-20 08:29:48 Anthon

import re 
p = re.compile(r'-(\d{10})\.htm') 
print(p.search('<a href="/sports/cycling/mountain-bikes/full-suspension/auction-1033346952.htm" class="tile-2">')) 
res = p.search('<a href="/sports/cycling/mountain-bikes/full-suspension/auction-1033346952.htm" class="tile-2">') 
print(res.group(1))

-(\d{10})\.htm意味着你想要一個破折號，10位和.htm。更重要的是，這10位數字在捕獲組中，所以您可以稍後提取它們。

您搜索此模式，然後您有兩個組：一個具有整體模式，另一個具有捕獲組（僅10位數字）。

來源

2016-02-20 08:34:52 bartoszukm

在Python中，你可以這樣做：

import re 
text = """<a href="/sports/cycling/mountain-bikes/full-suspension/auction-1033346952.htm" class="tile-2">""" 
p = re.compile(r'(?<=<a\shref=").*?(?=")') 
re.findall(p,text) ## ['/sports/cycling/mountain-bikes/full-suspension/auction-1033346952.htm']

來源

2016-02-20 08:35:20 bmbigbang

據我所知'findall'已被棄用，取而代之的find_all'的'在2012年 – Anthon

最簡單的方法其仍稱.findall根據3.5.1文檔。我想你正在考慮用soup.find_all（）來查找某個標籤的所有元素。我從正則表達的角度回答了這個問題，而不是美麗的湯。但仍然可以對每個soup.find_all（）元素運行re.search，使用該模式並仍然得到結果 – bmbigbang

如果OP顯然與BeatifulSoup4（'import bs4'）一起工作，我認爲3.5.1文檔無關緊要。 – Anthon

這很簡單;你不需要正則表達式。讓s成爲您的字符串（由於我不知道如何處理環繞，因此無法將整行放在此處。）

s = '<a href="....../auction-1033346952.htm......>' 
i = s.find('auction-') 
j = s[i+8:i+18] 
print j

來源

2016-02-20 08:57:24 Marichyasana

禾正則表達式

>>> s='<a href="/sports/cycling/mountain-bikes/full-suspension/auction-1033346952.htm" class="tile-2">' 
>>> s.split('.htm')[0].split('-')[-1] 
'1033346952'

來源

2016-02-20 10:13:32 josifoski

REGEX提取部分鏈接

回答

相關問題