2015-11-26 21 views
0

我有一個遵循這個一般模式字符串列表:使用python正則表達式來更改字符串的變體?

X (a, b, c, d) 

其中:

X是字符串item description

a, b, c, d的一些變化是逗號分隔的文字,符號一些變化,數字。

我試圖使之成爲這個刪除括號外的括號和文本:

a, b, c, d 

我已經注意到了在輸入一些可怕的變化:

# ideal input 
items (lcd, cardboard, hats on rack, keyboard cat) 

# Sometimes missing/extra space (both outside text and inside) 
items(lcd , cardboard,hats on rack , keyboard cat) 

# Outside text may contain other symbols and words 
items & descrips: (lcd, cardboard, hats on rack, keyboard cat) 

# Inner text may contain parenthesis, brackets, other enclosures 
descriptions & items: (lcd (for computer), cardboard {brown & white colored}, hats on rack, keyboard cat[dept. 11]) 

# Parent parenthesis may not be closed 
items: (lcd, cardboard, hats on rack, keyboard cat (dept. 11) 

# Using semi-colons instead of commas 
item (lcd; cardboard; hats on rack; keyboard cat) 

# Some text have non-ascii characters 
item (lcd\u2122, cardboard) 

理想輸出將是

lcd, cardboard, hats on rack, keyboard cat 

點一些澄清:

(1)任何內罩(及其數據)應該被刪除

即:

descriptions & items: (lcd (for computer), cardboard {brown & white colored}, hats on rack, keyboard cat[dept. 11]) 

應該是:

lcd, cardbard, hats on rack, keyboard cat 

什麼是適當的正則表達式這個?使用我有限的正則表達式技巧,不同的變體會使這非常困難。

樣品輸入數組:

a = [ 
"items (lcd, cardboard, hats on rack, keyboard cat)", 
"items(lcd , cardboard,hats on rack , keyboard cat)", 
"items & descrips: (lcd, cardboard, hats on rack, keyboard cat)", 
"descriptions & items: (lcd (for computer), cardboard {brown & white colored}, hats on rack, keyboard cat[dept. 11])", 
"items: (lcd, cardboard, hats on rack, keyboard cat (dept. 11)", 
"items: (lcd, cardboard, hats on rack, keyboard cat [dept. 11]", 
"item (lcd; cardboard; hats on rack; keyboard cat)", 
u"item (lcd\u2122, cardboard)" 
] 
+1

怎麼樣''項目(液晶顯示器,紙板)其他材料(也可以有parens)''? –

+0

@KevinGuan這是一個特例。 'item(lcd,cardboard)'後面的尾部文本不應該是有效的,所以只需忽略整個字符串(保持原樣)。我將在稍後階段處理這個問題。我想在技術上,正則表達式甚至不應該考慮它(因此完全忽略了字符串)。我在帖子中刪除了該部分。 – rublex

回答

2

嗯......我不知道,如果這是你想要與否,但它工作正常,如果a就像你們的榜樣名單:

import re 

a = [ 
"items (lcd, cardboard, hats on rack, keyboard cat)", 
"items(lcd , cardboard,hats on rack , keyboard cat)", 
"items & descrips: (lcd, cardboard, hats on rack, keyboard cat)", 
"descriptions & items: (lcd (for computer), cardboard {brown & white colored}, hats on rack, keyboard cat[dept. 11])", 
"items: (lcd, cardboard, hats on rack, keyboard cat (dept. 11)", 
"items: (lcd, cardboard, hats on rack, keyboard cat [dept. 11]", 
"item (lcd; cardboard; hats on rack; keyboard cat)", 
u"item (lcd\u2122, cardboard)" 
] 

for i in [re.sub(' *[,;] *', ', ', 
      re.sub('\(.+?\)|\[.+?\]|{.+?}', '', 
      re.search('\((.*)', i).group(1))).strip() 
      for i in a]: 

    if i[-1] == ')': 
     i = i[:-1] 

    if not re.search('[\(\[{}\]\)]', i):  
     print(i) 

輸出:

lcd, cardboard, hats on rack, keyboard cat 
lcd, cardboard, hats on rack, keyboard cat 
lcd, cardboard, hats on rack, keyboard cat 
lcd, cardboard, hats on rack, keyboard cat 
lcd, cardboard, hats on rack, keyboard cat 
lcd, cardboard, hats on rack, keyboard cat 
lcd, cardboard, hats on rack, keyboard cat 
lcd™, cardboard 

因此,這將做到:

  1. 匹配(<text>在字符串中(如您所說父括號可能未關閉)。

  2. 使用re.sub()刪除(<string>),在<text>[<string>]{<string>}

  3. 格式更改爲可讀的,我的意思是使用*[,;] *匹配所有的空間和,;,然後通過,替換它們。

  4. 刪除)在行末......如果有的話。

  5. 如果<text>中仍然有一些引用像我在評論中提問的那樣(您是否在新列表中刪除了該示例?好吧,我會保留這個示例),然後忽略它。

  6. 打印<string>出(你也可以把它們放在一個列表中......如果你願意)。

相關問題