2011-09-09 36 views
1

我有一個包含字典列表的文件,其中大部分文件都沒有用引號標記。一個例子如下:閱讀文件時使用的標記不當的字典

{game:Available,player:Available,location:"Chelsea, London, England",time:Available} 
{"game":"Available","player":"Available","location":"Chelsea, London, England","time":"Available","date":"Available"} 

正如你所看到的,鍵也可以從字典到另一個不同。

我試着用json模塊或csv模塊的DictReader讀取它,但每次遇到困難都是因爲「」總是出現在位置值中,但並不總是用於其他鍵或值。直到這一點,我看到兩種可能性:

  1. 用「;」代替「,」在位置值,並擺脫所有的報價。
  2. 爲每個值和關鍵字添加引號,位置一除外。

PS:我的最後一點是要能夠格式化所有這些字典來創建列是所有辭典的聯合,每一行是我的字典裏的一個SQL表,用空白的時候有缺少值。

回答

1

這裏是一個非常完整的代碼,我想。

首先,我創建了以下文件:

{surprise : "perturbating at start ", game:Available Universal Dices Game, 
    player:FTROE875574,location 
:"Lakeview School, Kingsmere Boulevard, Saskatoon, Saskatchewan , Canada",time:15h18} 

{"game":"Available"," player":"LOI4531", 
"location": "Perth, Australia","time":"08h13","date":"Available"} 

{"game":Available,player:PLLI874,location:"Chelsea, London, England",time:20h35} 

{special:"midnight happening",game:"Available","player":YTR44, 
"location":"Paris, France","time":"02h24" 
, 
"date":"Available"} 

{game:Available,surprise:" hretyuuhuhu ",player:FT875,location 
:,"time":11h22} 

{"game":"Available","player":"LOI4531","location": 
"Damas,Syria","time":"unavailable","date":"Available"} 

{"surprise " : GARAMANANATALA Tower , game:Available Dices,player : 
    PuLuLu874,location:" Westminster, London, England ",time:20h01} 

{"game":"Available",special:"overnight", "player":YTR44,"location": 
"Madrid, Spain" ,  "time": 
"12h33", 
date:"Available" 
} 

然後將以下代碼把該文件的內容在兩個階段:

  • 第一,通過內容運行時,所有在所有的詞典居間鍵被收集

  • 字典posis被扣除,即給出每個鍵所對應的值必須連續佔據的位置

  • 其次,由於另一個運行thro唉文件,該行是建立一個接一個,並在列表中收集

順便說一句,注意與關鍵位置「位置」關聯的值的條件受到尊重。

import re 

dicreg = re.compile('(?<=\{)[^}]*}') 

kvregx = re.compile('[ \r\n]*' 
        '(" *)?((location)|[^:]+?)(?(1) *")' 
        '[ \r\n]*' 
        ':' 
        '[ \r\n]*' 
        '(?(3)|(" *)?)([^:]*?)(?(4) *")' 
        '[ \r\n]*(?:,(?=[^,]+?:)|\})') 


checking_dict = {} 
checking_list = [] 

filename = 'zzz.txt' 

with open(filename) as f: 

    ######## First part: to gather all the keys in all the dictionaries 

    prec,chunk = '','go' 
    ecr = [] 
    while chunk: 
     chunk = f.read(120) 
     ss = ''.join((prec,chunk)) 
     ecr.append('\n\n------------------------------------------------------------\nss == %r' %ss) 
     mat_dic = None 
     for mat_dic in dicreg.finditer(ss): 
      ecr.append('\nmmmmmmm dictionary found in ss mmmmmmmmmmmmmm') 
      for mat_kv in kvregx.finditer(mat_dic.group()): 
       k,v = mat_kv.group(2,5) 
       ecr.append('%s : %s' % (k,v)) 
       if k in checking_list: 
        checking_dict[k] += 1 
       else: 
        checking_list.append(k) 
        checking_dict[k] = 1 
     if mat_dic: 
      prec = ss[mat_dic.end():] 
     else: 
      prec += chunk 

    print '\n'.join(ecr) 
    print '\n\n\nchecking_dict == %s\n\nchecking_list  == %s' %(checking_dict,checking_list) 

    ######## The keys are sorted in order that the less frequent ones are at the end 
    checking_list.sort(key=lambda k: checking_dict[k], reverse=True) 
    posis = dict((k,i) for i,k in enumerate(checking_list)) 
    print '\nchecking_list sorted == %s\n\nposis == %s' % (checking_list,posis) 



    ######## Now, the file is read again to build a list of rows 

    f.seek(0,0) # the file's pointer is move backed to the beginning of the file 

    prec,chunk = '','go' 
    base = [ '' for i in xrange(len(checking_list))] 
    rows = [] 
    while chunk: 
     chunk = f.read(110) 
     ss = ''.join((prec,chunk)) 
     mat_dic = None 
     for mat_dic in dicreg.finditer(ss): 
      li = base[:] 
      for mat_kv in kvregx.finditer(mat_dic.group()): 
       k,v = mat_kv.group(2,5) 
       li[posis[k]] = v 
      rows.append(li) 
     if mat_dic: 
      prec = ss[mat_dic.end():] 
     else: 
      prec += chunk 


    print '\n\n%s\n%s' % (checking_list,30*'___') 
    print '\n'.join(str(li) for li in rows) 

結果

------------------------------------------------------------ 
ss == '{surprise : "perturbating at start ", game:Available Universal Dices Game,\n player:FTROE875574,location\n:"Lakeview S' 


------------------------------------------------------------ 
ss == '{surprise : "perturbating at start ", game:Available Universal Dices Game,\n player:FTROE875574,location\n:"Lakeview School, Kingsmere Boulevard, Saskatoon, Saskatchewan , Canada",time:15h18}\n\n{"game":"Available"," player":"LOI4531",\n"l' 

mmmmmmm dictionary found in ss mmmmmmmmmmmmmm 
surprise : perturbating at start 
game : Available Universal Dices Game 
player : FTROE875574 
location : "Lakeview School, Kingsmere Boulevard, Saskatoon, Saskatchewan , Canada" 
time : 15h18 


------------------------------------------------------------ 
ss == '\n\n{"game":"Available"," player":"LOI4531",\n"location": "Perth, Australia","time":"08h13","date":"Available"}\n\n{"game":Available,player:PLLI874,location:"Chelsea, Lo' 

mmmmmmm dictionary found in ss mmmmmmmmmmmmmm 
game : Available 
player : LOI4531 
location : "Perth, Australia" 
time : 08h13 
date : Available 


------------------------------------------------------------ 
ss == '\n\n{"game":Available,player:PLLI874,location:"Chelsea, London, England",time:20h35}\n\n{special:"midnight happening",game:"Available","player":YTR44,\n"location":"Paris, France","t' 

mmmmmmm dictionary found in ss mmmmmmmmmmmmmm 
game : Available 
player : PLLI874 
location : "Chelsea, London, England" 
time : 20h35 


------------------------------------------------------------ 
ss == '\n\n{special:"midnight happening",game:"Available","player":YTR44,\n"location":"Paris, France","time":"02h24"\n,\n"date":"Available"}\n\n{game:Available,surprise:" hretyuuhuhu ",player:FT875,location\n:,"time":11h22}\n\n{"' 

mmmmmmm dictionary found in ss mmmmmmmmmmmmmm 
special : midnight happening 
game : Available 
player : YTR44 
location : "Paris, France" 
time : 02h24 
date : Available 

mmmmmmm dictionary found in ss mmmmmmmmmmmmmm 
game : Available 
surprise : hretyuuhuhu 
player : FT875 
location : 
time : 11h22 


------------------------------------------------------------ 
ss == '\n\n{"game":"Available","player":"LOI4531","location":\n"Damas,Syria","time":"unavailable","date":"Available"}\n\n{"surprise " ' 

mmmmmmm dictionary found in ss mmmmmmmmmmmmmm 
game : Available 
player : LOI4531 
location : "Damas,Syria" 
time : unavailable 
date : Available 


------------------------------------------------------------ 
ss == '\n\n{"surprise " : GARAMANANATALA Tower , game:Available Dices,player :\n PuLuLu874,location:" Westminster, London, England ",time:20' 


------------------------------------------------------------ 
ss == '\n\n{"surprise " : GARAMANANATALA Tower , game:Available Dices,player :\n PuLuLu874,location:" Westminster, London, England ",time:20h01}\n\n{"game":"Available",special:"overnight", "player":YTR44,"location":\n"Madrid, Spain" ,  "time":\n"12h33",\nda' 

mmmmmmm dictionary found in ss mmmmmmmmmmmmmm 
surprise : GARAMANANATALA Tower 
game : Available Dices 
player : PuLuLu874 
location : " Westminster, London, England " 
time : 20h01 


------------------------------------------------------------ 
ss == '\n\n{"game":"Available",special:"overnight", "player":YTR44,"location":\n"Madrid, Spain" ,  "time":\n"12h33",\ndate:"Available"\n}' 

mmmmmmm dictionary found in ss mmmmmmmmmmmmmm 
game : Available 
special : overnight 
player : YTR44 
location : "Madrid, Spain" 
time : 12h33 
date : Available 


------------------------------------------------------------ 
ss == '' 



checking_dict == {'player': 8, 'game': 8, 'location': 8, 'time': 8, 'date': 4, 'surprise': 3, 'special': 2} 

checking_list  == ['surprise', 'game', 'player', 'location', 'time', 'date', 'special'] 

checking_list sorted == ['game', 'player', 'location', 'time', 'date', 'surprise', 'special'] 

posis == {'player': 1, 'game': 0, 'location': 2, 'time': 3, 'date': 4, 'surprise': 5, 'special': 6} 


['game', 'player', 'location', 'time', 'date', 'surprise', 'special'] 
__________________________________________________________________________________________ 
['Available Universal Dices Game', 'FTROE875574', '"Lakeview School, Kingsmere Boulevard, Saskatoon, Saskatchewan , Canada"', '15h18', '', 'perturbating at start', ''] 
['Available', 'LOI4531', '"Perth, Australia"', '08h13', 'Available', '', ''] 
['Available', 'PLLI874', '"Chelsea, London, England"', '20h35', '', '', ''] 
['Available', 'YTR44', '"Paris, France"', '02h24', 'Available', '', 'midnight happening'] 
['Available', 'FT875', '', '11h22', '', 'hretyuuhuhu', ''] 
['Available', 'LOI4531', '"Damas,Syria"', 'unavailable', 'Available', '', ''] 
['Available Dices', 'PuLuLu874', '" Westminster, London, England "', '20h01', '', 'GARAMANANATALA Tower', ''] 
['Available', 'YTR44', '"Madrid, Spain"', '12h33', 'Available', '', 'overnight'] 

我寫上面的代碼想着幾個GB的一個巨大的文件,就可能無法完全讀:這樣一個非常大的文件的處理必須塊之後進行塊。這就是爲什麼有說明:

while chunk: 
    chunk = f.read(120) 
    ss = ''.join((prec,chunk)) 
    ecr.append('\n\n------------------------------------------------------------\nss == %r' %ss) 
    mat_dic = None 
    for mat_dic in dicreg.finditer(ss): 
     ............ 
     ............... 
    if mat_dic: 
     prec = ss[mat_dic.end():] 
    else: 
     prec += chunk 

但是,顯然,如果文件不是太大,一個炮打響,因此可讀的代碼可以簡化爲:

import re 

dicreg = re.compile('(?<=\{)[^}]*}') 

kvregx = re.compile('[ \r\n]*' 
        '(" *)?((location)|[^:]+?)(?(1) *")' 
        '[ \r\n]*' 
        ':' 
        '[ \r\n]*' 
        '(?(3)|(" *)?)([^:]*?)(?(4) *")' 
        '[ \r\n]*(?:,(?=[^,]+?:)|\})') 


checking_dict = {} 
checking_list = [] 

filename = 'zzz.txt' 

with open(filename) as f: 
    content = f.read() 




######## First part: to gather all the keys in all the dictionaries 

ecr = [] 

for mat_dic in dicreg.finditer(content): 
    ecr.append('\nmmmmmmm dictionary found in ss mmmmmmmmmmmmmm') 
    for mat_kv in kvregx.finditer(mat_dic.group()): 
     k,v = mat_kv.group(2,5) 
     ecr.append('%s : %s' % (k,v)) 
     if k in checking_list: 
      checking_dict[k] += 1 
     else: 
      checking_list.append(k) 
      checking_dict[k] = 1 


print '\n'.join(ecr) 
print '\n\n\nchecking_dict == %s\n\nchecking_list  == %s' %(checking_dict,checking_list) 

######## The keys are sorted in order that the less frequent ones are at the end 
checking_list.sort(key=lambda k: checking_dict[k], reverse=True) 
posis = dict((k,i) for i,k in enumerate(checking_list)) 
print '\nchecking_list sorted == %s\n\nposis == %s' % (checking_list,posis) 



######## Now, the file is read again to build a list of rows 


base = [ '' for i in xrange(len(checking_list))] 
rows = [] 

for mat_dic in dicreg.finditer(content): 
    li = base[:] 
    for mat_kv in kvregx.finditer(mat_dic.group()): 
     k,v = mat_kv.group(2,5) 
     li[posis[k]] = v 
    rows.append(li) 


print '\n\n%s\n%s' % (checking_list,30*'___') 
print '\n'.join(str(li) for li in rows) 
+0

非常感謝這個答案。嘗試使用非常大的文件後,我得到了內存錯誤:ecr.append('\ n \ n -------------------------- ---------------------------------- \ nss ==%r'%ss) MemoryError。我會盡力弄清楚 – Xavier

1

如果它更復雜,那麼你給出的例子,或者如果它必須更快,你應該看看pyparsing

否則,你可以寫更多的東西哈克這樣的:

contentlines = ["""{"game":"Available","player":"Available","location":"Chelsea, London, England","time":"Available","date":"Available"}""", """{game:Available,player:Available,location:"Chelsea, London, England",time:Available}"""] 
def get_dict(line): 
    keys = [] 
    values = [] 
    line = line.replace("{", "").replace("}", "") 
    contlist = line.split(":") 
    keys.append(contlist[0].strip('"').strip("'")) 
    for entry in contlist[1:-1]: 
     entry = entry.strip() 
     if entry[0] == "'" or entry[0] == '"': 
      endpos = entry[1:].find(entry[0]) + 2 
     else: 
      endpos = entry.find(",") 
     values.append(entry[0:endpos].strip('"').strip("'")) 
     keys.append(entry[endpos + 1:].strip('"').strip("'")) 
    values.append(contlist[-1].strip('"').strip("'")) 
    return dict(zip(keys, values)) 


for line in contentlines: 
    print get_dict(line) 
0
import re 

text = """ 
{game:Available,player:Available,location:"Chelsea, London, England",time:Available} 
{"game":"Available","player":"Available","location":"Chelsea, London, England","time":"Available","date":"Available"} 
""" 

dicts = re.findall(r"{.+?}", text)       # Split the dicts 
for dict_ in dicts: 
    dict_ = dict(re.findall(r'(\w+|".*?"):(\w+|".*?")', dict_)) # Get the elements 
    print dict_ 

>>>{'player': 'Available', 'game': 'Available', 'location': '"Chelsea, London, England"', 'time': 'Available'} 
>>>{'"game"': '"Available"', '"time"': '"Available"', '"player"': '"Available"', '"date"': '"Available"', '"location"': '"Chelsea, London, England"'} 
0

希望這pyparsing的解決辦法是更容易跟蹤和維護隨着時間的推移:

data = """\ 
{game:Available,player:Available,location:"Chelsea, London, England",time:Available} 
{"game":"Available","player":"Available","location":"Chelsea, London, England","time":"Available","date":"Available"}""" 

from pyparsing import Suppress, Word, alphas, alphanums, QuotedString, Group, Dict, delimitedList 

LBRACE,RBRACE,COLON = map(Suppress, "{}:") 
key = QuotedString('"') | Word(alphas) 
value = QuotedString('"') | Word(alphanums+"_") 
keyvalue = Group(key + COLON + value) 

dictExpr = LBRACE + Dict(delimitedList(keyvalue)) + RBRACE 

for d in dictExpr.searchString(data): 
    print d.asDict() 

打印:

{'player': 'Available', 'game': 'Available', 'location': 'Chelsea, London, England', 'time': 'Available'} 
{'date': 'Available', 'player': 'Available', 'game': 'Available', 'location': 'Chelsea, London, England', 'time': 'Available'} 
+0

的確如此!我用它來回避eyquem的答案的記憶問題。 PyParsing似乎很強大 – Xavier

+0

編輯:它的工作很好,但它是在解析很慢,可悲。不過謝謝 – Xavier