比列表追加方法更有效地結合Python Pandas Dataframe

我一直不得不做下面的事情來從一個處理單個json行的小型流水線中構建數據框。有沒有更有效的方法來做到這一點，而不是依賴將它們附加到列表然後連接？此外，我不甚至需要在列標籤下方表示爲「鑰匙」，但不知道如何將它們排除在外沒有得到數據幀構造錯誤：比列表追加方法更有效地結合Python Pandas Dataframe

def readfiles(pattern, textfile): 
    for line in open(textfile): 
     try: 
      parsed = ujson.loads(line.rstrip('\n').rstrip(',')) 
      if pattern in parsed: 
       yield parsed 
     except ValueError, e: 
      pass 

def convertodf(lines): 
    dfs = [] 
    for line in lines: 
     dfs.append(pd.DataFrame({'key1':line['value'], 
             'key2':line['value']['value'], 
             'key3':line['value'], 
             'key4':line['value']['value'], 
             'key5':line['value']['value']})) 

    pd.concat(dfs, ignore_index=True).to_csv('testdf2.csv', index=False, header=None) 

def main(pattern, filenames): 
    lines = readfiles(pattern, filenames) 
    convertodf(lines)

上述實施最酷的部分是，一個行[「值」]元素實際上是逗號分隔的整數，例如[1,2,3]的列表和它結束了相應地自動複製的其他值，例如：

'key1' 'key2' 
    1  california 
    2  california 
    3  california 
     ...

這是我的最終工作版本我去感謝unutbu的幫助。

def readfiles(pattern, filedir): 
     for f in glob.glob(filedir+'*.zip'): 
      try: 
       with zipfile.ZipFile(f, 'r') as myzip: 
        for logfile in myzip.namelist(): 
         for line in myzip.open(logfile): 
          try: 
           line = ujson.loads(line.rstrip('\n').rstrip(',')) 
           if pattern in line: 
            for i in line['key1']: 
             yield i, line['key1']['key2'],\ 
            line['key3'], line['key4']['key5'],\ 
            line['key6']['key7'] 
          except ValueError as err: 
           pass 
      except zipfile.error, e: 
       pass 

def convertdfcsv(lines): 
     df = pd.DataFrame.from_records(lines) 
     df.to_csv('testdf2.csv', index=False, header=None) 

def main(pattern): 
     lines = readfiles(pattern) 
     convertdf(lines)

來源

2014-10-11 horatio1701d

是否有可能加載'文本文件的全部內容'一次調用'ujson.load'？ – unutbu 2014-10-11 23:50:02

每個文本文件包含大約50K行，每行代表一個json對象，所以我不這麼認爲。這就是爲什麼我不得不循環遍歷文本文件中的行。 – horatio1701d 2014-10-12 00:11:58

您可以使用DataFrame.from_records從行迭代器構建DataFrame。一個簡單的例子展示瞭如何from_records工作原理是：

iterator = (item for item in [[1, 2, 3], [2, 3, 4, 5]]) 
df = pd.DataFrame.from_records(iterator, 
           columns=list('abcd')) 
print(df) 
# a b c d 
# 0 1 2 3 NaN 
# 1 2 3 4 5

適用於您的情況，該代碼可能看起來是這樣的：

def readfiles(pattern, filenames): 
    for textfile in filenames: 
     with open(textfile, 'rb') as f: 
      for line in f: 
       try: 
        line = ujson.loads(line.rstrip('\n').rstrip(',')) 
        if pattern in line: 
         yield line['value'], line['value']['value'], line['value'], line['value']['value'], line['value']['value'] 
       except ValueError as err: 
        pass 

def convertodf(lines): 
    df = pd.DataFrame.from_records(lines) 
    df.to_csv('testdf2.csv', index=False, header=None) 

def main(pattern, filenames): 
    lines = readfiles(pattern, filenames) 
    convertodf(lines)

來源

2014-10-12 00:24:10 unutbu

這真的很棒。我應該提到的唯一古怪的部分是，在dict數據結構實現中，我發佈了一個key：value對是一個整數列表，它的行爲有點像索引，導致所有其他值被複制並連接到列表中的每個元素。例如，是否有辦法讓yield中的某行['value']具有相同的行爲方式？ – horatio1701d 2014-10-12 01:15:09

在這種情況下，使用'for-loop'遍歷列表併產生每一行。 – unutbu 2014-10-12 02:59:59

非常棒。用最終版本更新了問題。 – horatio1701d 2014-10-12 10:21:10

比列表追加方法更有效地結合Python Pandas Dataframe

回答

相關問題