我有一個遵循這個一般模式字符串列表:使用python正則表達式來更改字符串的變體?
X (a, b, c, d)
其中:
X
是字符串item description
a, b, c, d
的一些變化是逗號分隔的文字,符號一些變化,數字。
我試圖使之成爲這個刪除括號外的括號和文本:
a, b, c, d
我已經注意到了在輸入一些可怕的變化:
# ideal input
items (lcd, cardboard, hats on rack, keyboard cat)
# Sometimes missing/extra space (both outside text and inside)
items(lcd , cardboard,hats on rack , keyboard cat)
# Outside text may contain other symbols and words
items & descrips: (lcd, cardboard, hats on rack, keyboard cat)
# Inner text may contain parenthesis, brackets, other enclosures
descriptions & items: (lcd (for computer), cardboard {brown & white colored}, hats on rack, keyboard cat[dept. 11])
# Parent parenthesis may not be closed
items: (lcd, cardboard, hats on rack, keyboard cat (dept. 11)
# Using semi-colons instead of commas
item (lcd; cardboard; hats on rack; keyboard cat)
# Some text have non-ascii characters
item (lcd\u2122, cardboard)
理想輸出將是
lcd, cardboard, hats on rack, keyboard cat
點一些澄清:
(1)任何內罩(及其數據)應該被刪除
即:
descriptions & items: (lcd (for computer), cardboard {brown & white colored}, hats on rack, keyboard cat[dept. 11])
應該是:
lcd, cardbard, hats on rack, keyboard cat
什麼是適當的正則表達式這個?使用我有限的正則表達式技巧,不同的變體會使這非常困難。
樣品輸入數組:
a = [
"items (lcd, cardboard, hats on rack, keyboard cat)",
"items(lcd , cardboard,hats on rack , keyboard cat)",
"items & descrips: (lcd, cardboard, hats on rack, keyboard cat)",
"descriptions & items: (lcd (for computer), cardboard {brown & white colored}, hats on rack, keyboard cat[dept. 11])",
"items: (lcd, cardboard, hats on rack, keyboard cat (dept. 11)",
"items: (lcd, cardboard, hats on rack, keyboard cat [dept. 11]",
"item (lcd; cardboard; hats on rack; keyboard cat)",
u"item (lcd\u2122, cardboard)"
]
怎麼樣''項目(液晶顯示器,紙板)其他材料(也可以有parens)''? –
@KevinGuan這是一個特例。 'item(lcd,cardboard)'後面的尾部文本不應該是有效的,所以只需忽略整個字符串(保持原樣)。我將在稍後階段處理這個問題。我想在技術上,正則表達式甚至不應該考慮它(因此完全忽略了字符串)。我在帖子中刪除了該部分。 – rublex