網頁抓取 - 如何獲取一個網絡鏈接的特定部分

我有以下鏈接： https://webcache.googleusercontent.com/search?q=cache:jAc7OJyyQboJ：https://cooking.nytimes.com/learn-to-cook + & CD = 5 & HL = EN &克拉= clnk網頁抓取 - 如何獲取一個網絡鏈接的特定部分

我有一個多鏈路數據集。每個鏈接都是相同的模式。我想獲得鏈接的特定部分，因爲上面的鏈接我將成爲上面鏈接的大膽部分。我想從第二個http開始到第一個+符號之前的文本。

我不知道如何使用正則表達式。我在Python中工作。請幫助我。

來源

2017-04-15 Ali Hamza

如果每個鏈接有相同的模式，你不需要正則表達式。您可以使用string.find()和string cutting

link = "https://webcache.googleusercontent.com/search?q=cache:jAc7OJyyQboJ:https://cooking.nytimes.com/learn-to-cook+&cd=5&hl=en&ct=clnk" 

# This finds the second occurrence of "https://" and returns the position 
second_https = link.find("https://", link.find("https://")+1) 
# Index of the end of the link 
end_of_link = link.find("+") 

new_link = link[second_https:end_of_link] 

print(new_link)

這將返回「https://cooking.nytimes.com/learn-to-cook」，並描述如果鏈接遵循相同的模式就可以了（它是第二HTTPS：在鏈接//與結束+號）

來源

2017-04-15 17:34:13

我會去與urlparse (Python 2)或urlparse (Python 3)和重新 GEX一點點：

import re 
from urlparse import urlparse 

url_example = "https://webcache.googleusercontent.com/search?q=cache:jAc7OJyyQboJ:https://cooking.nytimes.com/learn-to-cook+&cd=5&hl=en&ct=clnk" 
parsed = urlparse(url_example) 
result = re.findall('https?.*', parsed.query)[0].split('+')[0] 
print(result)

輸出：

https://cooking.nytimes.com/learn-to-cook

來源

2017-04-15 18:18:13

網頁抓取 - 如何獲取一個網絡鏈接的特定部分

回答

相關問題