2017-04-15 76 views
0

我有以下鏈接: https://webcache.googleusercontent.com/search?q=cache:jAc7OJyyQboJhttps://cooking.nytimes.com/learn-to-cook + & CD = 5 & HL = EN &克拉= clnk網頁抓取 - 如何獲取一個網絡鏈接的特定部分

我有一個多鏈路數據集。每個鏈接都是相同的模式。我想獲得鏈接的特定部分,因爲上面的鏈接我將成爲上面鏈接的大膽部分。我想從第二個http開始到第一個+符號之前的文本。

我不知道如何使用正則表達式。我在Python中工作。請幫助我。

回答

0

如果每個鏈接有相同的模式,你不需要正則表達式。您可以使用string.find()string cutting

link = "https://webcache.googleusercontent.com/search?q=cache:jAc7OJyyQboJ:https://cooking.nytimes.com/learn-to-cook+&cd=5&hl=en&ct=clnk" 

# This finds the second occurrence of "https://" and returns the position 
second_https = link.find("https://", link.find("https://")+1) 
# Index of the end of the link 
end_of_link = link.find("+") 

new_link = link[second_https:end_of_link] 

print(new_link) 

這將返回「https://cooking.nytimes.com/learn-to-cook」,並描述如果鏈接遵循相同的模式就可以了(它是第二HTTPS:在鏈接//結束+號)

0

我會去與urlparse (Python 2)urlparse (Python 3)重新 GEX一點點:

import re 
from urlparse import urlparse 

url_example = "https://webcache.googleusercontent.com/search?q=cache:jAc7OJyyQboJ:https://cooking.nytimes.com/learn-to-cook+&cd=5&hl=en&ct=clnk" 
parsed = urlparse(url_example) 
result = re.findall('https?.*', parsed.query)[0].split('+')[0] 
print(result) 

輸出:

https://cooking.nytimes.com/learn-to-cook