從帶引號的字符串中提取鍵值對

我在爲這個需求編寫'優雅'分析器時遇到了麻煩。（一個看起來不像一杯C早餐）。輸入是一個字符串，由'，'分隔的鍵值對加入'='。從帶引號的字符串中提取鍵值對

key1=value1,key2=value2

的部分欺騙我是值可以引號（「），且引號內‘’並沒有結束的關鍵。

key1=value1,key2="value2,still_value2"

最後這部分取得了棘手的我使用拆分或re.split，訴諸於我的範圍內循環:(。

任何人都可以證明一個乾淨的方式來做到這一點？

它是確定假設報價僅在值發生，並有不是白人速度或非字母數字字符。

來源

2016-08-03 Evan Benn

可以請您發佈預期的輸出嗎？ –

第二個例子中'key2'的值是否包含引號？即在你的例子中，'key2'映射到''value2，still_value2「'或'」\「value2，still_value2 \」「'？ – EvilTak

我建議不要使用正則表達式完成這個任務，因爲你想解析的語言是不規則的。

您有一個多個鍵值對的字符串。解析這個問題的最好方法不是匹配它上的模式，而是正確地標記它。

Python標準庫中有一個模塊，名爲shlex，它模仿POSIX shell所做的解析，並提供了一個可以根據需要輕鬆定製的詞法分析器實現。

from shlex import shlex 

def parse_kv_pairs(text, item_sep=",", value_sep="="): 
    """Parse key-value pairs from a shell-like text.""" 
    # initialize a lexer, in POSIX mode (to properly handle escaping) 
    lexer = shlex(text, posix=True) 
    # set ',' as whitespace for the lexer 
    # (the lexer will use this character to separate words) 
    lexer.whitespace = item_sep 
    # include '=' as a word character 
    # (this is done so that the lexer returns a list of key-value pairs) 
    # (if your option key or value contains any unquoted special character, you will need to add it here) 
    lexer.wordchars += value_sep 
    # then we separate option keys and values to build the resulting dictionary 
    # (maxsplit is required to make sure that '=' in value will not be a problem) 
    return dict(word.split(value_sep, maxsplit=1) for word in lexer)

實例運行：

parse_kv_pairs(
    'key1=value1,key2=\'value2,still_value2,not_key1="not_value1"\'' 
)

輸出：

{'key1': 'value1', 'key2': 'value2,still_value2,not_key1="not_value1"'}

編輯：我忘了補充一點，我通常shlex堅持，而不是使用常規的理由表達式（在這種情況下更快）是gi你不會感到驚訝，特別是如果你以後需要允許更多的投入。我從來沒有發現如何正確解析這些鍵值對與正則表達式，總會有輸入（例如：A="B=\"1,2,3\""），將欺騙引擎。

如果你不關心這樣的輸入，（或換句話說，如果你能確保你的輸入遵循常規語言的定義），正則表達式是完全正確的。

EDIT2：split有一個maxsplit參數，這比使用split/slicing/joining要乾淨得多。感謝@cdlane的聲音輸入！

來源

2016-08-03 09:00:01 pistache

我相信'shlex'是一個可靠的生產解決方案，這是一個很好的例子，可以幫助您解決手頭的問題。然而，這個回答在我的return語句中失去了所有的優雅 - 分割（）相同的數據兩次，然後用'join（）'在過多的split（）之後清理，這樣你就可以使用字典理解？如何在詞法分析器中返回字典（word.split（value_sep，maxsplit = 1）for word）' – cdlane

是的，這樣更好，我在寫入時忘記了'maxsplit'參數，並且確實在添加時不太優雅在值中支持'='。感謝您的建議，我編輯答案。 – pistache

我不知道它看起來並不像體C的早餐，它是相當考究:)

data = {} 
original = 'key1=value1,key2="value2,still_value2"' 
converted = '' 

is_open = False 
for c in original: 
    if c == ',' and not is_open: 
     c = '\n' 
    elif c in ('"',"'"): 
     is_open = not is_open 
    converted += c 

for item in converted.split('\n'): 
    k, v = item.split('=') 
    data[k] = v

來源

2016-08-03 08:25:58

使用正則表達式的一些魔術從Split a string, respect and preserve quotes，我們可以這樣做：

import re 

string = 'key1=value1,key2="value2,still_value2"' 

key_value_pairs = re.findall(r'(?:[^\s,"]|"(?:\\.|[^"])*")+', string) 

for key_value_pair in key_value_pairs: 
    key, value = key_value_pair.split("=")

Per BioGeek，我試圖猜測，我的意思是解釋正則表達式Janne Karila使用：該模式在逗號上打斷了字符串，但是在過程中尊重雙引號部分（可能帶有逗號）。它有兩個單獨的選項：不涉及引號的字符串運行;和雙引號，其中一個雙引號結束運行，除非它的（反斜槓）字符的運行轉義：

(?:    # parenthesis for alternation (|), not memory 
[^\s,"]   # any 1 character except white space, comma or quote 
|    # or 
"(?:\\.|[^"])*" # a quoted string containing 0 or more characters 
       # other than quotes (unless escaped) 
)+    # one or more of the above

來源

2016-08-03 08:32:07 cdlane

你可以添加關於正則表達式如何工作的一些解釋。 – BioGeek

@BioGeek，我試着按照你的要求，讓我知道我是否成功！ – cdlane

cdlane，謝謝你的解釋！ – BioGeek

我想出了這個正則表達式的解決方案：

import re 
match = re.findall(r'([^=]+)=(("[^"]+")|([^,]+)),?', 'key1=value1,key2=value2,key3="value3,stillvalue3",key4=value4')

，這使得「匹配」：

[('key1', 'value1', '', 'value1'), ('key2', 'value2', '', 'value2'), ('key3', '"value3,stillvalue3"', '"value3,stillvalue3"', ''), ('key4', 'value4', '', 'value4')]

然後你就可以做一個for循環得到鍵和值：

for m in match: 
    key = m[0] 
    value = m[1]

來源

2016-08-03 08:37:03

基於其他幾個答案，我想出了以下解決方案：

import re 
import itertools 

data = 'key1=value1,key2="value2,still_value2"' 

# Based on Alan Moore's answer on http://stackoverflow.com/questions/2785755/how-to-split-but-ignore-separators-in-quoted-strings-in-python 
def split_on_non_quoted_equals(string): 
    return re.split('''=(?=(?:[^'"]|'[^']*'|"[^"]*")*$)''', string) 
def split_on_non_quoted_comma(string): 
    return re.split(''',(?=(?:[^'"]|'[^']*'|"[^"]*")*$)''', string) 

split1 = split_on_non_quoted_equals(data) 
split2 = map(lambda x: split_on_non_quoted_comma(x), split1) 

# 'Unpack' the sublists in to a single list. Based on Alex Martelli's answer on http://stackoverflow.com/questions/952914/making-a-flat-list-out-of-list-of-lists-in-python 
flattened = [item for sublist in split2 for item in sublist] 

# Convert alternating elements of a list into keys and values of a dictionary. Based on Sven Marnach's answer on http://stackoverflow.com/questions/6900955/python-convert-list-to-dictionary 
d = dict(itertools.izip_longest(*[iter(flattened)] * 2, fillvalue=""))

所得d在以下詞典：

{'key1': 'value1', 'key2': '"value2,still_value2"'}

來源

2016-08-03 08:48:45

從帶引號的字符串中提取鍵值對

回答

相關問題