Python的正則表達式來刪除URL中不需要的部分

所以我有這些網址不斷變化：Python的正則表達式來刪除URL中不需要的部分

http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNFcQAQ4S3H5xUuU4N-LoM2I9tLxJg&url=http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/

但我想脫掉變化的第一部分和一下就只剩下：

http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/

我會用什麼正則表達式去除所有的東西？

我不能使用「startswith（）」，因爲這個URL中的「usg」數字發生了變化。

來源

2013-11-10 user2270029

有什麼問題['urlparse.parse_qs（）']（http://docs.python.org/2/library/urlparse。 html＃urlparse.parse_qs） –

@MartijnPieters你打算把它當作答案...... :) –

@Jon：現在完成了...... –

使用正確的工具進行工作;使用urlparse module解析查詢字符串：

import urlparse 

qs = urlparse.urlsplit(inputurl).query 
url = urlparse.parse_qs(qs).get('url', [None])[0]

這臺url到None是否有在URL查詢字符串沒有url=元素，網址值，否則。

演示：

>>> import urlparse 
>>> inputurl = 'http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNFcQAQ4S3H5xUuU4N-LoM2I9tLxJg&url=http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/' 
>>> qs = urlparse.urlsplit(inputurl).query 
>>> urlparse.parse_qs(qs).get('url', [None])[0] 
'http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/'

來源

2013-11-10 01:58:00

爲什麼不

print data.split("&url=", 1)[1].split("&", 1)[0]

樣品試驗

data = "http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNFcQAQ4S3H5xUuU4N- 
LoM2I9tLxJg&url=http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/ 
11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/" 
print data.split("&url=", 1)[1].split("&", 1)[0]

輸出

http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/

來源

2013-11-10 01:51:09 thefourtheye

除了事實之外還有一個更好的解決方案 - 你可能也希望限制分裂時間分裂或使用分區 –

@JonClements更新我的答案，以限制分裂。請立即檢查 – thefourtheye

這將很好地工作：

url = "http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNFcQAQ4S3H5xUuU4N- 
LoM2I9tLxJg&url=http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/ 
11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/" 

In [148]: url.split('&url=')[1] 
Out[148]: 'http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/'

我會使用urlparse.parse_qs(url)作爲@MartijnPieters在評論中提到。

來源

2013-11-10 01:54:33 moenad

請注意，什麼是右側的「& URL =」是不的URL。這是一個網址編碼的網址。因此，例如，如果原始網址中包含「&」，則會包含「％26」。無需解碼即可使用它，但對於許多網站來說，它們並不能保證。

由於Martjin建議，這將始終正常工作：

import urlparse 
data = "http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNFcQAQ4S3H5xUuU4N-LoM2I9tLxJg&url=http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/" 
o = urlparse.urlparse(data) 
q = urlparse.parse_qs(o.query) 
print q['url']

來源

2013-11-10 02:00:02

Python的正則表達式來刪除URL中不需要的部分

回答

相關問題