更好的使用方法re.sub

我正在清理twitter流中的一系列來源。下面是數據的一個例子：更好的使用方法re.sub

source = ['<a href="https://twitter.com/download/android" rel="nofollow">Twitter for Android Tablets</a>', 
      '<a href="https://twitter.com/download/android" rel="nofollow">Twitter for Android</a>', 
      '<a href="http://foursquare.com" rel="nofollow">foursquare</a>', 'web', 
      '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', 
      '<a href="http://blackberry.com/twitter" rel="nofollow">Twitter for BlackBerry</a>'] 


import re 
for i in source: 
    re.sub('<.*?>', '', re.sub(r'(<.*?>)(Twitter for)(\s+)', r'', i)) 

### This would be the expected output ### 
'Android Tablets' 
'Android' 
'foursquare' 
'web' 
'iPhone' 
'BlackBerry'

後者是代碼我有沒有工作，但看起來可怕。我希望有一個更好的方法來做到這一點，包括re.sub()或其他更適合的功能。

來源

2014-05-07 marbel

'S [s.index（ '>'）+ 1：s.rindex（ '<'）]'。順便說一下：而不是'。*？'我會使用'[^>] *'。 – Bakuriu

不錯的codegolf！ :) – TML

@Bakuriu thx的評論。 '[^>] *'的解釋是什麼？ – marbel

這裏有建議，以改善在您的代碼：

使用正則表達式的編譯，所以你不要每次應用正則表達式一次處理正則表達式，
使用原始字符串，以避免正則表達式的任何解釋通過Python字符串，
使用正則表達式是需要什麼，但結束標記字符標記中的匹配
你不需要重複替代，因爲它的每默認

這裏有一個簡單，更好的結果是：

>>> import re 
>>> r = re.compile(r'<[^>]+>') 
>>> for it in source: 
...  r.sub('', it) 
... 
'Twitter for Android Tablets' 
'Twitter for Android' 
'foursquare' 
'web' 
'Twitter for iPhone' 
'Twitter for BlackBerry'

注意：爲您的使用情況下，最好的解決辦法是@ bakuriu的建議：

>>> for it in source: 
...  it[it.index('>')+1:it.rindex('<')] 
'Twitter for Android Tablets' 
'Twitter for Android' 
'foursquare' 
'Twitter for iPhone' 
'Twitter for BlackBerry'

它增加了沒有重大的開銷和使用基本的，快速的字符串操作但是，這種解決方案需要只有什麼是標籤之間，而不是刪除它，這可能有副作用，如果有該<a>和</a>內標籤或根本沒有標籤，即它不會爲web字符串工作。在所有針對沒有標記A液：

>>> for it in source: 
...  if '>' in it and '<' in it: 
...   it[it.index('>')+1:it.rindex('<')] 
...  else: 
...   it 
'Twitter for Android Tablets' 
'Twitter for Android' 
'foursquare' 
'web' 
'Twitter for iPhone' 
'Twitter for BlackBerry'

來源

2014-05-07 16:50:16 zmo

正則表達式解決方案的+1。由於'網絡'的情況，bakuriu的不起作用。它沒有'<' or '>'。不過，有趣的是我知道它是python中非常新的東西。 – marbel

我將使用以下代碼：'r = re.compile（r'（<[^>）+>）|（Twitter for \ s +）'）'以便擺脫部分Twitter。 – marbel

一種選擇，如果文字真的是在這個一致的格式的，是隻使用字符串操作，而不是正則表達式：

source = ['<a href="https://twitter.com/download/android" rel="nofollow">Twitter for Android Tablets</a>', 
      '<a href="https://twitter.com/download/android" rel="nofollow">Twitter for Android</a>', 
      '<a href="http://foursquare.com" rel="nofollow">foursquare</a>', 'web', 
      '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', 
      '<a href="http://blackberry.com/twitter" rel="nofollow">Twitter for BlackBerry</a>'] 

for i in source: 
    print i.partition('>')[-1].rpartition('<')[0]

這個代碼是找到的第一個「>」字符串中，採取一切之後，找到剩下的第一個'<'，並在返回之前返回所有內容;例如，在第一個'>'和最後一個'<'之間給出任何文本。

還有更多的最小版本@Bakuriu發表評論，這可能比我更好！

來源

2014-05-07 16:44:26 TML

這看起來不太難看，我和應發揮同樣的

import re 
for i in source: 
    print re.sub('(<.*?>)|(Twitter for\s+)', '', i);

來源

2014-05-07 16:47:45

只是另一種選擇，使用BeautifulSoup HTML解析器：

>>> from bs4 import BeautifulSoup 
>>> for link in source: 
...  print BeautifulSoup(link, 'html.parser').text.replace('Twitter for', '').strip() 
... 
Android Tablets 
Android 
foursquare 
web 
iPhone 
BlackBerry

來源

2014-05-07 16:49:06 alecxe

如果你做了很多的這些，使用一個旨在處理（X）HTML的庫。 lxml效果很好，但我更熟悉BeautifulSoup包裝。

from bs4 import BeautifulSoup 

source = ['<a href="https://twitter.com/download/android" rel="nofollow">Twitter for Android Tablets</a>', 
     '<a href="https://twitter.com/download/android" rel="nofollow">Twitter for Android</a>', 
     '<a href="http://foursquare.com" rel="nofollow">foursquare</a>', 'web', 
     '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', 
     '<a href="http://blackberry.com/twitter" rel="nofollow">Twitter for BlackBerry</a>'] 

soup = BeautifulSoup('\n'.join(source)) 
for tag in soup.findAll('a'): 
    print(tag.text)

雖然這可能對您的用例有點矯枉過正。

來源

2014-05-07 16:49:58

更好的使用方法re.sub

回答

相關問題