在表格中間過濾掉額外的標題

我正在嘗試導入一個非常大的數據文件。它是一個像在表格中間過濾掉額外的標題

***** Information about Data *********** 
Information about data 
Information about Data 
Information about Data 

Information about Data 

    Col1  Col2 
    1.0  1.0 
    1.0  1.0 
    1.0  1.0 
    1.0  1.0 
    ...(10k+ lines) 
    1.0  1.0 
    1.0  1.0 
***** Information about Data *********** 
Information about data 
Information about Data 
Information about Data 

Information about Data 

    Col1  Col2 
    1.0  1.0 
    1.0  1.0 
    1.0  1.0 
    1.0  1.0 
    ...(10k+ lines) 
    1.0  1.0 
    1.0  1.0

並重復一些任意次數的文本文件。標題之間的行數變化，總文件大於100萬行。

有沒有一種方法剝離此標題而不逐行？我已經寫了一行一行的搜索，但這太慢而不實際。

每次顯示時，標題都會略有不同。

來源

2017-05-09 Davidallen353

是'頭info'實際上'頭info'？ – piRSquared

不，我會編輯 – Davidallen353

'np.genfromtxt'從任何可以逐行輸入的東西接受輸入。由於它已經用'readline'讀取了一個文件，因此在流水線中插入逐行搜索不會減慢搜索速度。有了「熊貓」編譯閱讀器，這可能是一個不同的故事。 – hpaulj

假設您的文件被命名爲test.txt

讀在整個文件作爲一個字符串

split上'\n*'

 new line 
      \ 
    1.0  1.0 
***** Information about Data *********** 
\ 
    followed by astricks

rsplit結果通過'\n\n'並採取最後

 first new line 
        \ 
Information about Data 

\ 
    second new line 
    Col1  Col2 
    1.0  1.0 
    1.0  1.0 
    1.0  1.0

read_csv
pd.concat

from io import StringIO 
import pandas as pd 

def rtxt(txt): 
    return pd.read_csv(StringIO(txt), delim_whitespace=True) 

fname = 'test.txt' 

pd.concat(
    [rtxt(st.rsplit('\n\n', 1)[-1]) 
    for st in open(fname).read().split('\n*')], 
    ignore_index=True 
) 

    Col1 Col2 
0 1.0 1.0 
1 1.0 1.0 
2 1.0 1.0 
3 1.0 1.0 
4 1.0 1.0 
5 1.0 1.0 
6 1.0 1.0 
7 1.0 1.0 
8 1.0 1.0 
9 1.0 1.0 
10 1.0 1.0 
11 1.0 1.0

來源

2017-05-09 02:00:23 piRSquared

在表格中間過濾掉額外的標題

回答

相關問題