2011-10-08 32 views
1

我有表,看起來像這樣:如何用Python中的0替換文本表中的空白條目?

text = """ 
ID = 1234 

Hello World    135,343 117,668 81,228 
Another line of text (30,632)    (48,063) 
More text     0   11,205  0  
Even more text      1,447  681 

ID = 18372 

Another table      35,323    38,302  909,381 
Another line with text     13     15 
More text here            7   0  
Even more text here     7,011    1,447  681 
""" 

有沒有辦法用0來代替每個表中的「空白」項?我想在條目之間設置分隔符,而是使用下面的代碼不能與表中的空白點處理:

for line in text.splitlines(): 
    if 'ID' not in line: 
     line1 = line.split() 
     line = '|'.join((' '.join(line1[:-3]), '|'.join(line1[-3:]))) 
     print line 
    else: 
     print line 

輸出是:

ID = 1234 
| 
Hello World|135,343|117,668|81,228 
Another line of|text|(30,632)|(48,063) 
More text|0|11,205|0 
Even more|text|1,447|681 
| 
ID = 18372 
| 
Another table|35,323|38,302|909,381 
Another line with|text|13|15 
More text|here|7|0 
Even more text here|7,011|1,447|681 

正如你所看到的,第一個問題出現在第一個表的第二行。 「文本」這個詞被認爲是第一列。任何方式來解決這個在Python中用0代替空白條目?

回答

1

這是一個用於查找一堆行中的列的函數。第二個參數pat定義了列的內容,可以是任何正則表達式。

import itertools as it 
import re 

def find_columns(lines, pat = r' '): 
    ''' 
    Usage: 
    widths = find_columns(lines) 
    for line in lines: 
     if not line: continue 
     vals = [ line[widths[i]:widths[i+1]].strip() for i in range(len(widths)-1) ] 
    ''' 
    widths = [] 
    maxlen = max(len(line) for line in lines) 
    for line in lines: 
     line = ''.join([line, ' '*(maxlen-len(line))]) 
     candidates = [] 
     for match in re.finditer(pat, line): 
      candidates.extend(range(match.start(), match.end()+1)) 
     widths.append(set(candidates)) 
    widths = sorted(set.intersection(*widths)) 
    diffs = [widths[i+1]-widths[i] for i in range(len(widths)-1)] 
    diffs = [None]+diffs 
    widths = [w for d, w in zip(diffs, widths) if d != 1] 
    if widths[0] != 0: widths = [0]+widths 
    return widths 

def report(text): 
    for key, group in it.groupby(text.splitlines(), lambda line:line.startswith('ID')): 
     lines = list(group) 
     if key: 
      print('\n'.join(lines)) 
     else: 
      # r' (?![a-zA-Z])' defines a column to be any whitespace 
      # not followed by alphabetic characters. 
      widths = find_columns(lines, pat = r'\s(?![a-zA-Z])') 
      for line in lines: 
       if not line: continue 
       vals = [ line[widths[i]:widths[i+1]] for i in range(len(widths)-1) ] 
       vals = [v if v.strip() else v[1:]+'0' for v in vals] 
       print('|'.join(vals)) 

text = """\ 
ID = 1234 

Hello World    135,343 117,668 81,228 
Another line of text (30,632)    (48,063) 
More text     0   11,205  0  
Even more text      1,447  681 

ID = 18372 

Another table      35,323    38,302  909,381 
Another line with text     13     15 
More text here            7   0  
Even more text here     7,011    1,447  681 
""" 

report(text) 

產生

ID = 1234 
Hello World   |  135,343| 117,668| 81,228 
Another line of text| (30,632)|   0| (48,063) 
More text   |  0 |  11,205|  0 
Even more text  |   0|  1,447 |  681 
ID = 18372 
Another table   |    35,323|    38,302|  909,381 
Another line with text|     13 |    15|0 
More text here  |     0|     7 |   0 
Even more text here |    7,011|    1,447|  681 
+0

這將工作我上面列出的表。但是,每個表的列位置可能不同,並且有數千個表。這可以修改,以避免手動檢查每個表的列的開始位置嗎? – myname

+0

表2的列可能沒有像上面的示例那樣排在表1的列的正下方。每個表格只有3列。 – myname

+0

非常感謝你! – myname

相關問題