2017-01-06 45 views
3

我試圖將一個包含兩個數據序列的txt文件拖入熊貓。到目前爲止,我已經嘗試了我從其他帖子在堆棧中獲取的變體。到目前爲止,它只能作爲一個系列閱讀。如何在熊貓中讀取.txt

我使用的數據是可用的here

icdencoding = pd.read_table("data/icd10cm_codes_2017.txt", delim_whitespace=True, header=None) 
icdencoding = pd.read_table("data/icd10cm_codes_2017.txt", header=None, sep="/t") 
icdencoding = pd.read_table("data/icd10cm_codes_2017.txt", header=None, delimiter=r"\s+") 

我知道我在做一些真正顯然是錯誤的,但我無法看到它。

回答

4

嘗試使用sep=r'\s{2,}'作爲分隔符 - 這意味着用作分隔或多個空格或製表符:

In [28]: df = pd.read_csv(url, sep=r'\s{2,}', engine='python', header=None, names=['ID','Name']) 

In [29]: df 
Out[29]: 
     ID            Name 
0  A000 Cholera due to Vibrio cholerae 01, biovar cholerae 
1  A001  Cholera due to Vibrio cholerae 01, biovar eltor 
2  A009        Cholera, unspecified 
3 A0100       Typhoid fever, unspecified 
4 A0101         Typhoid meningitis 
5 A0102    Typhoid fever with heart involvement 
6 A0103         Typhoid pneumonia 
7 A0104         Typhoid arthritis 
8 A0105        Typhoid osteomyelitis 
9 A0109    Typhoid fever with other complications 
10 A011         Paratyphoid fever A 
11 A012         Paratyphoid fever B 
12 A013         Paratyphoid fever C 
13 A014      Paratyphoid fever, unspecified 
14 A020        Salmonella enteritis 
15 A021         Salmonella sepsis 
16 A0220   Localized salmonella infection, unspecified 
17 A0221        Salmonella meningitis 
18 A0222        Salmonella pneumonia 
19 A0223        Salmonella arthritis 
20 A0224       Salmonella osteomyelitis 
21 A0225       Salmonella pyelonephritis 
22 A0229   Salmonella with other localized infection 
23 A028    Other specified salmonella infections 
24 A029     Salmonella infection, unspecified 
..  ...             ... 
671 B188      Other chronic viral hepatitis 
672 B189    Chronic viral hepatitis, unspecified 
673 B190  Unspecified viral hepatitis with hepatic coma 
674 B1910 Unspecified viral hepatitis B without hepatic coma 
675 B1911  Unspecified viral hepatitis B with hepatic coma 
676 B1920 Unspecified viral hepatitis C without hepatic coma 
677 B1921  Unspecified viral hepatitis C with hepatic coma 
678 B199 Unspecified viral hepatitis without hepatic coma 
679 B20   Human immunodeficiency virus [HIV] disease 
680 B250       Cytomegaloviral pneumonitis 
681 B251       Cytomegaloviral hepatitis 
682 B252      Cytomegaloviral pancreatitis 
683 B258      Other cytomegaloviral diseases 
684 B259    Cytomegaloviral disease, unspecified 
685 B260          Mumps orchitis 
686 B261         Mumps meningitis 
687 B262         Mumps encephalitis 
688 B263         Mumps pancreatitis 
689 B2681          Mumps hepatitis 
690 B2682         Mumps myocarditis 
691 B2683          Mumps nephritis 
692 B2684        Mumps polyneuropathy 
693 B2685          Mumps arthritis 
694 B2689       Other mumps complications 
695 B269       Mumps without complication 

[696 rows x 2 columns] 

或者您可以使用read_fwf()方法

+1

請問您能解釋'sep = r'\ s {2,}',engine ='python''參數嗎?我從來沒有用'r'\ s {2,}'作爲分隔符,或者在engine ='python''中聲明瞭一個'engine'。 –

+2

無法獲得第一個選項,但使用read_fwf()標題和名稱。現在正在工作。 read_fwf()對我來說是全新的。需要閱讀它。 –

+1

感謝您解釋sep = r'\ s {2,}'。真的有用! –

3

你的文件是一個固定寬度的文件,以便您可以使用read_fwf,此處默認參數能夠推斷出列寬:

In [106]: 
df = pd.read_fwf(r'icd10cm_codes_2017.txt', header=None) 
df.head() 

Out[106]: 
     0             1 
0 A000 Cholera due to Vibrio cholerae 01, biovar chol... 
1 A001 Cholera due to Vibrio cholerae 01, biovar eltor 
2 A009        Cholera, unspecified 
3 A0100       Typhoid fever, unspecified 
4 A0101         Typhoid meningitis 

如果你知道你想要的名稱的列名,你可以通過這些來read_fwf

In [107]: 
df = pd.read_fwf(r'C:\Users\alanwo\Downloads\icd10cm_codes_2017.txt', header=None, names=['col1', 'col2']) 
df.head() 

Out[107]: 
    col1            col2 
0 A000 Cholera due to Vibrio cholerae 01, biovar chol... 
1 A001 Cholera due to Vibrio cholerae 01, biovar eltor 
2 A009        Cholera, unspecified 
3 A0100       Typhoid fever, unspecified 
4 A0101         Typhoid meningitis 

或者只是簡單地覆蓋閱讀後columns屬性:

df.columns = ['col1', 'col2'] 

至於爲什麼你嘗試失敗,read_table使用製表符作爲默認分隔符,但文件只是有空格並且寬度固定