2017-08-31 51 views
1

我正在嘗試正則表達式鏈接的結尾,其中唯一的標識符是類值fl。因此,正則表達式(據我所知)必須包括:正則表達式 - 從改變中間的HTML獲取價值

class=\"fl\" 

帳戶改變中間部分,其中\ S +不起作用,然後找到並組:

data-href="http://www.twitter.com/(newyorklife) 

其中組在圓括號中找到。我試圖解析的整個字符串是。

<g-link class="fl"><a href="/url?sa=t&amp;rct=j&amp;q=&amp;esrc=s&amp;source=web&amp;cd=32&amp;cad=rja&amp;uact=8&amp;ved=0ahUKEwjknIy87oHWAhXHi1QKHXQdAJsQ9zAIyQEwHw&amp;url=http%3A%2F%2Fwww.twitter.com%2Fnewyorklife&amp;usg=AFQjCNHKcAcw6H6cYG3YH1j4V3UOxX1whw" onmousedown="return rwt(this,'','','','32','AFQjCNHKcAcw6H6cYG3YH1j4V3UOxX1whw','','0ahUKEwjknIy87oHWAhXHi1QKHXQdAJsQ9zAIyQEwHw','','',event)" data-href="http://www.twitter.com/newyorklife"><div jsl="$t t-XNwoAoU5dyo;$x 0;" class="r-iBA3fWkVHWLE"><g-img class="_tek"><img id="uid_4" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAACAAAAAgCAIAAAD8GO2jAAABZUlEQVR4AWLQWfWQpmjUAjxo1IJRC2wBpJTDQkVRFIafKBujZrnGjbNd84wHiJNs27btrm3rrFbW1T73m20u/yMsW0cBg6zue5XCYLFQcC41JK0I3PsYaWvC+BkugYFljrbmWPp/H/86FOnhB2hGZbTg/dBhFoEBhsoEAO23Su9+5s/9nA0R/ANtXEgNJTtiAgObfB28gZaKt8Wen2ZarhRgjVL8nagGmetC+IFMb5lgqOtOZAtsLVgjcIhFZqD+RLYj0IFzGCwUcRctc7XgNNcyA7GBhAW+EWvnHK3XCjqDhg3OUpvAEegFTgAdA+nrwnuF4zCw7DSlwqOPscRxUAmtiYqY5NDXImz/6mPprlAP1sDgcjdFLokdCkPGW6Kstmbhtoim2IWNsRsvFXNsjURvBmvgiMROc11S0+BhVvmhFAUDhewrISgbg4/qlyUdeEnl+sBk7SOgfcBSb3jWaKMWjFoAABKespvtvzYlAAAAAElFTkSuQmCC" data-deferred="1" class="_WCg" height="32" width="32" alt="" onload="typeof google==='object'&amp;&amp;google.aft&amp;&amp;google.aft(this)"></g-img></div>Twitter</a></g-link> 

我不知道如果正則表達式有一個方法或可以跳過整個中間部分與這麼多的特殊字符。我在pythex.org上玩了一段時間,找不到一個簡單地找到初始值的方法,然後跳過所有的東西,直到......指定的值。有任何想法嗎?

編輯。我想要字符串'Newyorklife'作爲輸出。雖然這是一個不斷變化的價值,但我真的很想在twitter.com/後面加上\ w +。問題在於class = fl是網頁上唯一的唯一標識符(因爲twitter和data-href出現在頁面的其他地方)。

+0

你想要的輸出是什麼? – Ajax1234

+1

我是否正確理解你想從「rwt(」...到...「事件)中捕獲該位」是的? –

+0

我試圖在這個例子中獲得(newyorklife)。雖然這將是一個變化的價值,因此它將是我想要獲得的twitter.com後的一個\ w +。唯一的唯一值是class = fl值。 – WolVes

回答

1

將會有一個辦法做到這一點的一個正則表達式的字符串,但它將是豬醜陋,難以閱讀。所以我會分兩步來解決這個問題。首先,捕獲類「fl」的HTML標籤,然後在屬性中找到Twitter句柄。

str = document.documentElement.innerHTML; 
 

 
anchorTag = str.match("class=\"fl\">([^>]+)")[1]; 
 

 
matches = anchorTag.match("twitter\.com%2F([^&]+)&"); 
 
if(matches != null && matches.length > 1){ 
 
    var handle = matches[1]; 
 
} 
 

 
console.log(handle);
<g-link class="fl"><a href="/url?sa=t&amp;rct=j&amp;q=&amp;esrc=s&amp;source=web&amp;cd=32&amp;cad=rja&amp;uact=8&amp;ved=0ahUKEwjknIy87oHWAhXHi1QKHXQdAJsQ9zAIyQEwHw&amp;url=http%3A%2F%2Fwww.twitter.com%2Fnewyorklife&amp;usg=AFQjCNHKcAcw6H6cYG3YH1j4V3UOxX1whw" onmousedown="return rwt(this,'','','','32','AFQjCNHKcAcw6H6cYG3YH1j4V3UOxX1whw','','0ahUKEwjknIy87oHWAhXHi1QKHXQdAJsQ9zAIyQEwHw','','',event)" data-href="http://www.twitter.com/newyorklife"><div jsl="$t t-XNwoAoU5dyo;$x 0;" class="r-iBA3fWkVHWLE"><g-img class="_tek"><img id="uid_4" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAACAAAAAgCAIAAAD8GO2jAAABZUlEQVR4AWLQWfWQpmjUAjxo1IJRC2wBpJTDQkVRFIafKBujZrnGjbNd84wHiJNs27btrm3rrFbW1T73m20u/yMsW0cBg6zue5XCYLFQcC41JK0I3PsYaWvC+BkugYFljrbmWPp/H/86FOnhB2hGZbTg/dBhFoEBhsoEAO23Su9+5s/9nA0R/ANtXEgNJTtiAgObfB28gZaKt8Wen2ZarhRgjVL8nagGmetC+IFMb5lgqOtOZAtsLVgjcIhFZqD+RLYj0IFzGCwUcRctc7XgNNcyA7GBhAW+EWvnHK3XCjqDhg3OUpvAEegFTgAdA+nrwnuF4zCw7DSlwqOPscRxUAmtiYqY5NDXImz/6mPprlAP1sDgcjdFLokdCkPGW6Kstmbhtoim2IWNsRsvFXNsjURvBmvgiMROc11S0+BhVvmhFAUDhewrISgbg4/qlyUdeEnl+sBk7SOgfcBSb3jWaKMWjFoAABKespvtvzYlAAAAAElFTkSuQmCC" data-deferred="1" class="_WCg" height="32" width="32" alt="" onload="typeof google==='object'&amp;&amp;google.aft&amp;&amp;google.aft(this)"></g-img></div>Twitter</a></g-link>

+1

來吧,這不是問題中所要求的「Python」! – Jan

+0

它是我能夠修改這個概念的工作。儘管你們的結果都很好,無論出於何種原因,我都無法讓他們發揮作用。我特別困惑,爲什麼我不能讓你的正則表達式工作@ekhumoro – WolVes

+0

@WolVes。我能想到的唯一的事情是,你真正的html數據包含換行符,'''運算符通常不會匹配,除非設置了're.S'標誌。 FWIW,我會更新我的答案來處理這個問題。 – ekhumoro

0

你可以試試這個:

import re 
s = '<g-link class="fl"><a href="/url?sa=t&amp;rct=j&amp;q=&amp;esrc=s&amp;source=web&amp;cd=32&amp;cad=rja&amp;uact=8&amp;ved=0ahUKEwjknIy87oHWAhXHi1QKHXQdAJsQ9zAIyQEwHw&amp;url=http%3A%2F%2Fwww.twitter.com%2Fnewyorklife&amp;usg=AFQjCNHKcAcw6H6cYG3YH1j4V3UOxX1whw" onmousedown="return rwt(this,'','','','32','AFQjCNHKcAcw6H6cYG3YH1j4V3UOxX1whw','','0ahUKEwjknIy87oHWAhXHi1QKHXQdAJsQ9zAIyQEwHw','','',event)" data-href="http://www.twitter.com/newyorklife"><div jsl="$t t-XNwoAoU5dyo;$x 0;" class="r-iBA3fWkVHWLE"><g-img class="_tek"><img id="uid_4" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAACAAAAAgCAIAAAD8GO2jAAABZUlEQVR4AWLQWfWQpmjUAjxo1IJRC2wBpJTDQkVRFIafKBujZrnGjbNd84wHiJNs27btrm3rrFbW1T73m20u/yMsW0cBg6zue5XCYLFQcC41JK0I3PsYaWvC+BkugYFljrbmWPp/H/86FOnhB2hGZbTg/dBhFoEBhsoEAO23Su9+5s/9nA0R/ANtXEgNJTtiAgObfB28gZaKt8Wen2ZarhRgjVL8nagGmetC+IFMb5lgqOtOZAtsLVgjcIhFZqD+RLYj0IFzGCwUcRctc7XgNNcyA7GBhAW+EWvnHK3XCjqDhg3OUpvAEegFTgAdA+nrwnuF4zCw7DSlwqOPscRxUAmtiYqY5NDXImz/6mPprlAP1sDgcjdFLokdCkPGW6Kstmbhtoim2IWNsRsvFXNsjURvBmvgiMROc11S0+BhVvmhFAUDhewrISgbg4/qlyUdeEnl+sBk7SOgfcBSb3jWaKMWjFoAABKespvtvzYlAAAAAElFTkSuQmCC" data-deferred="1" class="_WCg" height="32" width="32" alt="" onload="typeof google==='object'&amp;&amp;google.aft&amp;&amp;google.aft(this)"></g-img></div>Twitter</a></g-link>' 
r = 'data-href="http://www.twitter.com/\((.*?\))' 
data = re.findall(r, s) 

print(data) 

輸出:

['newyorklife'] 
+0

newyorklife不是一個常數,twitter經常作爲鏈接出現。因此,它必須具有class = fl的唯一標識符。 – WolVes

+0

@WolVes請看我最近的編輯。 – Ajax1234

1

沒有regex needed,使用一個體面的解析器來代替:

from bs4 import BeautifulSoup 

html = """<g-link class="fl"><a href="/url?sa=t&amp;rct=j&amp;q=&amp;esrc=s&amp;source=web&amp;cd=32&amp;cad=rja&amp;uact=8&amp;ved=0ahUKEwjknIy87oHWAhXHi1QKHXQdAJsQ9zAIyQEwHw&amp;url=http%3A%2F%2Fwww.twitter.com%2Fnewyorklife&amp;usg=AFQjCNHKcAcw6H6cYG3YH1j4V3UOxX1whw" onmousedown="return rwt(this,'','','','32','AFQjCNHKcAcw6H6cYG3YH1j4V3UOxX1whw','','0ahUKEwjknIy87oHWAhXHi1QKHXQdAJsQ9zAIyQEwHw','','',event)" data-href="http://www.twitter.com/newyorklife"><div jsl="$t t-XNwoAoU5dyo;$x 0;" class="r-iBA3fWkVHWLE"><g-img class="_tek"><img id="uid_4" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAACAAAAAgCAIAAAD8GO2jAAABZUlEQVR4AWLQWfWQpmjUAjxo1IJRC2wBpJTDQkVRFIafKBujZrnGjbNd84wHiJNs27btrm3rrFbW1T73m20u/yMsW0cBg6zue5XCYLFQcC41JK0I3PsYaWvC+BkugYFljrbmWPp/H/86FOnhB2hGZbTg/dBhFoEBhsoEAO23Su9+5s/9nA0R/ANtXEgNJTtiAgObfB28gZaKt8Wen2ZarhRgjVL8nagGmetC+IFMb5lgqOtOZAtsLVgjcIhFZqD+RLYj0IFzGCwUcRctc7XgNNcyA7GBhAW+EWvnHK3XCjqDhg3OUpvAEegFTgAdA+nrwnuF4zCw7DSlwqOPscRxUAmtiYqY5NDXImz/6mPprlAP1sDgcjdFLokdCkPGW6Kstmbhtoim2IWNsRsvFXNsjURvBmvgiMROc11S0+BhVvmhFAUDhewrISgbg4/qlyUdeEnl+sBk7SOgfcBSb3jWaKMWjFoAABKespvtvzYlAAAAAElFTkSuQmCC" data-deferred="1" class="_WCg" height="32" width="32" alt="" onload="typeof google==='object'&amp;&amp;google.aft&amp;&amp;google.aft(this)"></g-img></div>Twitter</a></g-link>""" 

soup = BeautifulSoup(html, 'html5lib') 

# select one 
user = soup.select_one('.fl > a')["data-href"].split('/')[-1] 
print(user) 
# newyorklife 

要選擇多個鏈接,請使用soup.findAll(),看到documentation for more information

1

這裏是一個工作正則表達式:

>>> r = re.compile(r'\bclass="fl".*?\bdata-href="http://www\.twitter\.com/(\w+)"', re.S) 
>>> r.search(s).group(1) 
'newyorklife' 

這裏的關鍵概念是不貪婪匹配。由於頁面上可能會有多個data-href,因此在class="fl"已被匹配後,您必須注意查找的第一次發生。因此,在嘗試匹配下一個data-href之前,.*?表達式在此用於匹配儘可能少的字符