2017-10-17 82 views
1

我正在使用最近10k-100k樣本(cell s)x 20k特徵(gene s)稀疏值的單細胞RNA測序數據,並且還包含大量元數據,例如,起源的組織(「大腦」與「肝臟」)。元數據是〜10-100列,我存儲爲pandas.DataFrame。現在,我正在製作xarray.DataSets字典,並將它們添加爲座標。由於我在筆記本之間複製片段,因此它看起來很笨重且容易出錯。有更容易的方法嗎?從元數據+值創建xarray數據集的簡單方法?

cell_metadata_dict = cell_metadata.to_dict(orient='list') 
coords = {k: ('cell', v) for k, v in cell_metadata_dict.items()} 
coords.update(dict(gene=counts.columns, cell=counts.index)) 

ds = xr.Dataset(
    {'counts': (['cell', 'gene'], counts), 
    }, 
    coords=coords) 

編輯:

表現出一些數據,例如,這裏的cell_metadata.head().to_csv()

cell,Uniquely mapped reads number,Number of input reads,EXP_ID,TAXON,WELL_MAPPING,Lysis Plate Batch,dNTP.batch,oligodT.order.no,plate.type,preparation.site,date.prepared,date.sorted,tissue,subtissue,mouse.id,FACS.selection,nozzle.size,FACS.instument,Experiment ID ,Columns sorted,Double check,Plate,Location ,Comments,mouse.age,mouse.number,mouse.sex 
A1-MAA100140-3_57_F-1-1,428699,502312,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F 
A10-MAA100140-3_57_F-1-1,324428,360285,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F 
A11-MAA100140-3_57_F-1-1,381310,431800,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F 
A12-MAA100140-3_57_F-1-1,393498,446705,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F 
A2-MAA100140-3_57_F-1-1,717,918,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F 

counts.iloc[:5, :20].to_csv()

cell,0610005C13Rik,0610007C21Rik,0610007L01Rik,0610007N19Rik,0610007P08Rik,0610007P14Rik,0610007P22Rik,0610008F07Rik,0610009B14Rik,0610009B22Rik,0610009D07Rik,0610009L18Rik,0610009O20Rik,0610010B08Rik,0610010F05Rik,0610010K14Rik,0610010O12Rik,0610011F06Rik,0610011L14Rik,0610012G03Rik 
A1-MAA100140-3_57_F-1-1,308,289,81,0,4,88,52,0,0,104,65,0,1,0,9,8,12,283,12,37 
A10-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 
A11-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 
A12-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 
A2-MAA100140-3_57_F-1-1,375,325,70,0,2,72,36,13,0,60,105,0,13,0,0,29,15,264,0,65 

回覆:pandas.DataFrame.to_xarray() - 這是令人難以置信的緩慢,似乎奇怪的是我編碼數字和類別這麼多Orical數據作爲100級MultiIndex。那個,每次我嘗試使用MultiIndex時,總是會導致我說「哦,這就是爲什麼我不使用MultiIndex」並恢復爲具有單獨的元數據並統計數據幀的原因。

+0

你能提供你的DataFrame('df.head()')的樣本和你的目標數據集或DataArray的詳細描述。你有沒有嘗試過使用熊貓的to_xarray()方法? – jhamman

+0

要加上Joe的評論,絕對看看xarray文檔的[working with pandas](http://xarray.pydata.org/en/stable/pandas.html)部分,看看是否有幫助。如果您可以爲您的數據設置適當的'pandas.MultiIndex',轉換爲xarray *通常很容易。 – shoyer

回答

0

Xarray使用大熊貓索引/列標籤作爲默認元數據。當所有變量共享相同的維度時,您可以在一次函數調用中進行轉換,但是如果不同的變量具有不同的維度,則需要將它們分別從熊貓轉換爲單獨的,然後將它們放在xarray一側。例如:

import pandas as pd 
import io 
import xarray 

# read your data 
cell_metadata = pd.read_csv(io.StringIO(u"""\ 
cell,Uniquely mapped reads number,Number of input reads,EXP_ID,TAXON,WELL_MAPPING,Lysis Plate Batch,dNTP.batch,oligodT.order.no,plate.type,preparation.site,date.prepared,date.sorted,tissue,subtissue,mouse.id,FACS.selection,nozzle.size,FACS.instument,Experiment ID ,Columns sorted,Double check,Plate,Location ,Comments,mouse.age,mouse.number,mouse.sex 
A1-MAA100140-3_57_F-1-1,428699,502312,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F 
A10-MAA100140-3_57_F-1-1,324428,360285,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F 
A11-MAA100140-3_57_F-1-1,381310,431800,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F 
A12-MAA100140-3_57_F-1-1,393498,446705,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F 
A2-MAA100140-3_57_F-1-1,717,918,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F""")) 
counts = pd.read_csv(io.StringIO(u"""\ 
cell,0610005C13Rik,0610007C21Rik,0610007L01Rik,0610007N19Rik,0610007P08Rik,0610007P14Rik,0610007P22Rik,0610008F07Rik,0610009B14Rik,0610009B22Rik,0610009D07Rik,0610009L18Rik,0610009O20Rik,0610010B08Rik,0610010F05Rik,0610010K14Rik,0610010O12Rik,0610011F06Rik,0610011L14Rik,0610012G03Rik 
A1-MAA100140-3_57_F-1-1,308,289,81,0,4,88,52,0,0,104,65,0,1,0,9,8,12,283,12,37 
A10-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 
A11-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 
A12-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 
A2-MAA100140-3_57_F-1-1,375,325,70,0,2,72,36,13,0,60,105,0,13,0,0,29,15,264,0,65""")) 

# build the output 
xarray_counts = xarray.DataArray(counts.set_index('cell'), dims=['cell', 'gene']) 
xarray_counts.coords.update(cell_metadata.set_index('cell').to_xarray()) 
print(xarray_counts) 

這導致在一個不錯的,整潔xarray.DataArray用於計數:

<xarray.DataArray (cell: 5, gene: 20)> 
array([[308, 289, 81, 0, 4, 88, 52, 0, 0, 104, 65, 0, 1, 0, 
      9, 8, 12, 283, 12, 37], 
     [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
      0, 0, 0, 0, 0, 0], 
     [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
      0, 0, 0, 0, 0, 0], 
     [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
      0, 0, 0, 0, 0, 0], 
     [375, 325, 70, 0, 2, 72, 36, 13, 0, 60, 105, 0, 13, 0, 
      0, 29, 15, 264, 0, 65]]) 
Coordinates: 
    * cell       (cell) object 'A1-MAA100140-3_57_F-1-1' ... 
    * gene       (gene) object '0610005C13Rik' ... 
    Uniquely mapped reads number (cell) int64 428699 324428 381310 393498 717 
    Number of input reads   (cell) int64 502312 360285 431800 446705 918 
    EXP_ID      (cell) object '170928_A00111_0068_AH3YKKDMXX' ... 
    TAXON       (cell) object 'mus' 'mus' 'mus' 'mus' 'mus' 
    WELL_MAPPING     (cell) object 'MAA100140' 'MAA100140' ... 
    Lysis Plate Batch    (cell) float64 nan nan nan nan nan 
    dNTP.batch     (cell) float64 nan nan nan nan nan 
    oligodT.order.no    (cell) float64 nan nan nan nan nan 
    plate.type     (cell) object 'Biorad 96well' ... 
    preparation.site    (cell) object 'Stanford' 'Stanford' ... 
    date.prepared     (cell) float64 nan nan nan nan nan 
    date.sorted     (cell) int64 170720 170720 170720 170720 ... 
    tissue      (cell) object 'Liver' 'Liver' 'Liver' ... 
    subtissue      (cell) object 'Hepatocytes' 'Hepatocytes' ... 
    mouse.id      (cell) object '3_57_F' '3_57_F' '3_57_F' ... 
    FACS.selection    (cell) float64 nan nan nan nan nan 
    nozzle.size     (cell) float64 nan nan nan nan nan 
    FACS.instument    (cell) float64 nan nan nan nan nan 
    Experiment ID     (cell) float64 nan nan nan nan nan 
    Columns sorted    (cell) float64 nan nan nan nan nan 
    Double check     (cell) float64 nan nan nan nan nan 
    Plate       (cell) float64 nan nan nan nan nan 
    Location      (cell) float64 nan nan nan nan nan 
    Comments      (cell) float64 nan nan nan nan nan 
    mouse.age      (cell) int64 3 3 3 3 3 
    mouse.number     (cell) int64 57 57 57 57 57 
    mouse.sex      (cell) object 'F' 'F' 'F' 'F' 'F' 

如果你想有一個數據集代替,把DataArray中對象到數據集的構造函數,例如,

# shouldn't really need to use .data_vars here, that might be an xarray bug 
>>> xarray.Dataset({'counts': xarray.DataArray(counts.set_index('cell'), 
...           dims=['cell', 'gene'])}, 
...    coords=cell_metadata.set_index('cell').to_xarray().data_vars) <xarray.Dataset> 

Dimensions:      (cell: 5, gene: 20) 
Coordinates: 
    * cell       (cell) object 'A1-MAA100140-3_57_F-1-1' ... 
    * gene       (gene) object '0610005C13Rik' ... 
    Uniquely mapped reads number (cell) int64 428699 324428 381310 393498 717 
    Number of input reads   (cell) int64 502312 360285 431800 446705 918 
    EXP_ID      (cell) object '170928_A00111_0068_AH3YKKDMXX' ... 
    TAXON       (cell) object 'mus' 'mus' 'mus' 'mus' 'mus' 
    WELL_MAPPING     (cell) object 'MAA100140' 'MAA100140' ... 
    Lysis Plate Batch    (cell) float64 nan nan nan nan nan 
    dNTP.batch     (cell) float64 nan nan nan nan nan 
    oligodT.order.no    (cell) float64 nan nan nan nan nan 
    plate.type     (cell) object 'Biorad 96well' ... 
    preparation.site    (cell) object 'Stanford' 'Stanford' ... 
    date.prepared     (cell) float64 nan nan nan nan nan 
    date.sorted     (cell) int64 170720 170720 170720 170720 ... 
    tissue      (cell) object 'Liver' 'Liver' 'Liver' ... 
    subtissue      (cell) object 'Hepatocytes' 'Hepatocytes' ... 
    mouse.id      (cell) object '3_57_F' '3_57_F' '3_57_F' ... 
    FACS.selection    (cell) float64 nan nan nan nan nan 
    nozzle.size     (cell) float64 nan nan nan nan nan 
    FACS.instument    (cell) float64 nan nan nan nan nan 
    Experiment ID     (cell) float64 nan nan nan nan nan 
    Columns sorted    (cell) float64 nan nan nan nan nan 
    Double check     (cell) float64 nan nan nan nan nan 
    Plate       (cell) float64 nan nan nan nan nan 
    Location      (cell) float64 nan nan nan nan nan 
    Comments      (cell) float64 nan nan nan nan nan 
    mouse.age      (cell) int64 3 3 3 3 3 
    mouse.number     (cell) int64 57 57 57 57 57 
    mouse.sex      (cell) object 'F' 'F' 'F' 'F' 'F' 
Data variables: 
    counts      (cell, gene) int64 308 289 81 0 4 88 52 0 ... 
相關問題