2017-06-16 119 views
1

首先,我認爲問題標題並沒有很好地解釋這個問題。請隨時更改標題或推薦更好的標題。按行名修改熊貓數據框

我讀一個CSV文件格式: enter image description here

"sample","module","status","tot.seq","seq.length","pct.gc","pct.dup" 
"ERR435952_cleaned_1","Basic Statistics","PASS","15529112","62",47,41.66 
"ERR435952_cleaned_1","Per base sequence quality","FAIL","15529112","62",47,41.66 
"ERR435952_cleaned_1","Per tile sequence quality","FAIL","15529112","62",47,41.66 
"ERR435952_cleaned_1","Per sequence quality scores","PASS","15529112","62",47,41.66 
"ERR435952_cleaned_1","Per base sequence content","PASS","15529112","62",47,41.66 
"ERR435952_cleaned_1","Per sequence GC content","PASS","15529112","62",47,41.66 
"ERR435952_cleaned_1","Per base N content","PASS","15529112","62",47,41.66 
"ERR435952_cleaned_1","Sequence Length Distribution","PASS","15529112","62",47,41.66 
"ERR435952_cleaned_1","Sequence Duplication Levels","WARN","15529112","62",47,41.66 
"ERR435952_cleaned_1","Overrepresented sequences","WARN","15529112","62",47,41.66 
"ERR435952_cleaned_1","Adapter Content","PASS","15529112","62",47,41.66 
"ERR435952_cleaned_1","Kmer Content","FAIL","15529112","62",47,41.66 
"ERR435952_cleaned_2","Basic Statistics","PASS","15529112","62",48,42.44 
"ERR435952_cleaned_2","Per base sequence quality","PASS","15529112","62",48,42.44 
"ERR435952_cleaned_2","Per tile sequence quality","WARN","15529112","62",48,42.44 
"ERR435952_cleaned_2","Per sequence quality scores","PASS","15529112","62",48,42.44 
"ERR435952_cleaned_2","Per base sequence content","PASS","15529112","62",48,42.44 
"ERR435952_cleaned_2","Per sequence GC content","WARN","15529112","62",48,42.44 
"ERR435952_cleaned_2","Per base N content","PASS","15529112","62",48,42.44 
"ERR435952_cleaned_2","Sequence Length Distribution","PASS","15529112","62",48,42.44 
"ERR435952_cleaned_2","Sequence Duplication Levels","WARN","15529112","62",48,42.44 
"ERR435952_cleaned_2","Overrepresented sequences","WARN","15529112","62",48,42.44 
"ERR435952_cleaned_2","Adapter Content","PASS","15529112","62",48,42.44 
"ERR435952_cleaned_2","Kmer Content","FAIL","15529112","62",48,42.44 

我想將其轉換爲這樣的事情,這樣我就可以創建基於PASS/FAIL/WARN值的簡單的熱圖(包括讀出的總數量:tot.seq): enter image description here

我知道可以通過計數的行數做(存在用於每個模塊/特徵值區間之間的相關性),但是這是不完全純的我不確定它對於大型數據集是否有效。有沒有辦法根據名稱,而不是下面的時間間隔(即I,I + N ...等等)

回答

2

使用set_index + unstack,也爲列從索引添加reset_indexrename_axis用於刪除映射值module - 列名:

df = df.set_index(['sample', 'tot.seq', 'module'])['status'].unstack() \ 
     .reset_index().rename_axis(None, axis=1) 
print (df) 
       sample tot.seq Adapter Content Basic Statistics \ 
0 ERR435952_cleaned_1 15529112   PASS    PASS 
1 ERR435952_cleaned_2 15529112   PASS    PASS 

    Kmer Content Overrepresented sequences Per base N content \ 
0   FAIL      WARN    PASS 
1   FAIL      WARN    PASS 

    Per base sequence content Per base sequence quality Per sequence GC content \ 
0      PASS      FAIL     PASS 
1      PASS      PASS     WARN 

    Per sequence quality scores Per tile sequence quality \ 
0      PASS      FAIL 
1      PASS      WARN 

    Sequence Duplication Levels Sequence Length Distribution 
0      WARN       PASS 
1      WARN       PASS 

但如果得到:

ValueError: Index contains duplicate entries, cannot reshape

再有重複,需要彙總數據:

print (df) 
       sample      module status tot.seq \ 
0 ERR435952_cleaned_1    Basic Statistics PASS 15529112 
1 ERR435952_cleaned_1 Per base sequence quality FAIL 15529112 
2 ERR435952_cleaned_1 Per base sequence quality FAIL 15529112 
3 ERR435952_cleaned_1 Per sequence quality scores PASS 15529112 

    seq.length pct.gc pct.dup 
0   62  47 41.66 
1   62  47 41.66 
2   62  47 41.66 
3   62  47 41.66 

df = df.pivot_table(index=['sample', 'tot.seq'], columns='module', values='status', aggfunc=', '.join) \ 
     .reset_index().rename_axis(None, axis=1) 
print (df) 
       sample tot.seq Basic Statistics Per base sequence quality \ 
0 ERR435952_cleaned_1 15529112    PASS    FAIL, FAIL 

    Per sequence quality scores 
0      PASS 

df = df.groupby(['sample', 'tot.seq', 'module'])['status'].apply(', '.join).unstack() \ 
     .reset_index().rename_axis(None, axis=1) 
print (df) 

       sample tot.seq Basic Statistics Per base sequence quality \ 
0 ERR435952_cleaned_1 15529112    PASS    FAIL, FAIL 

    Per sequence quality scores 
0      PASS 
+0

謝謝!我忘了在我最初的問題中添加讀取次數(tot.seq),因爲它是每個樣本的重複值(對每個模塊重複),我怎樣才能只添加一次? – Siddharth