故障可以用Python

我正在寫找到所有的指向主辦的photobucket在phpBB論壇數據庫愚蠢的照片，並將其傳遞到下載管理器（在我的情況下免費下載管理器）的URL短的Python腳本的正則表達式爲了將圖像保存在本地計算機，然後將它們移動另一臺主機上（現在的photobucket開始要求每年訂閱其他網站嵌入在其服務器上託管的圖片）。我已經成功地使用搜索與lookarounds正則表達式的所有照片，當我測試了我的正則表達式與正則表達式搜索支持兩個文本編輯器，我發現我想要的東西，但在我的腳本它給了我麻煩。故障可以用Python

import re 
import os 

main_path = input("Enter a path to the input file:") 
with open(main_path, 'r', encoding="utf8") as file: 
    file_cont = file.read() 
pattern = re.compile(r'(?!(<IMG src=""))http:\/\/i[0-9][0-9][0-9]\.photobucket\.com\/albums\/[^\/]*\/[^\/]*\/[^\/]*(?=("">))') 
findings = pattern.findall(file_cont) 
for finding in findings: 
    print(finding) 
os.system("pause")

我試圖調試它去掉部分下載並打印的所有比賽，我得到的（''，'"">'），而不是網址類似一長串這一個：http://i774.photobucket.com/albums/myalbum/Emi998/mypicture.jpg 我哪裏錯了？

來源

2017-08-27 Emiliano S.

Python的正則表達式引擎是他們可能不同。我建議你用[regex101]測試它（http://www.regex101.com），其中 – TemporalWolf

您在其他測試系統，它的工作是正確的，你可以切換到蟒蛇，regex101在Python模式未能匹配字符串。我將來會使用它。 –

我認爲以下版本的正則表達式應該工作：
請注意，我用\"代替""，
我更換img src與img.+src支持img alt="" src也，的，而是我用[^\/]+刪除的\\驗證，
和URL的最後一部分，我也檢查
則不是檢查>不發生"，嚴格遵循"後我檢查後可選其他字符".*。

(?!(<img.+src=\"))http:\/\/i\d{3}\.photobucket\.com\/albums\/[^\/]+\/[^\/]+\/[^\/\"]+(?=\".*/>) 
                        ^^  ^^^

您可以使用\d\d\d或[0-9]{3}或\d{3}代替[0-9][0-9][0-9]，

[Regex Demo]

來源

2017-08-27 10:56:18

你的正則表達式模式是不好的。

我不確定你想要做什麼，如果你需要解析HTML（因爲Regex can not really parse HTML），我建議你使用BeautifulSoup而不是用正則表達式。

但無論如何 - 用正則表達式 - 這應該工作：

r'<IMG src=\"(https?:\/\/i[0-9]{3}\.photobucket\.com\/albums[^\"]+)\"[^>]+\/>'

的https?:\/\/i[0-9]{3}\.photobucket\.com\/albums做是爲了過濾非的photobucket圖像，[^\"]+是更通用的，只是提取的一切，直到屬性的最後"字符。

例子：

<IMG src="http://i774.photobucket.com/albums/myalbum/Emi998/mypicture.jpg" foo="bar"/>

給人以.group(1)：

http://i774.photobucket.com/albums/myalbum/Emi998/mypicture.jpg

來源

2017-08-27 10:35:38 Arount

故障可以用Python

回答

相關問題