2017-04-12 31 views
0

謝謝你的幫助。我想使用下面的python代碼來讀取和處理來自affymetrix微陣列數據集的數據。我想闡明單核細胞中克隆氏病和潰瘍性結腸炎的疾病狀態中的差異基因表達。代碼運行完美,但是當我嘗試查看X的內容時,我在輸出處得到一個空數組(如:array([],dtype = float64)),這當然沒有用。這裏是一個鏈接到原始數據集:https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS1615 我試圖很長時間來弄清楚爲什麼我有一個空的和無法處理的輸出,但無濟於事。下面是代碼:如何獲取微陣列數據?

import gzip 
import numpy as np 

""" 
Read in a SOFT format data file. The following values can be exported: 

GID : A list of gene identifiers of length d 
SID : A list of sample identifiers of length n 
STP : A list of sample descriptions of length d 
X : A dxn array of gene expression values 
""" 
#path to the data file 
fname = "../data/GDS1615_full.soft.gz" 

## Open the data file directly as a gzip file 
with gzip.open(fname) as fid: 
    SIF = {} 
    for line in fid: 
     if line.startswith(line, len("!dataset_table_begin")): 
      break 
     elif line.startswith(line, len("!subject_description")): 
      subset_description = line.split("=")[1].strip() 
     elif line.startswith(line, len("!subset_sample_id")): 
      subset_ids = [x.strip() for x in subset_ids] 
      for k in subset_ids: 
       SIF[k] = subset_description 
    ## Next line is the column headers (sample id's) 
    SID = next(fid).split("\t") 

    ## The column indices that contain gene expression data 
    I = [i for i,x in enumerate(SID) if x.startswith("GSM")] 

    ## Restrict the column headers to those that we keep 
    SID = [SID[i] for i in I] 

    ## Get a list of sample labels 
    STP = [SIF[k] for k in SID] 

    ## Read the gene expression data as a list of lists, also get the gene 
    ## identifiers 
    GID,X = [],[] 
    for line in fid: 

     ## This is what signals the end of the gene expression data 
     ## section in the file 
     if line.startswith("!dataset_table_end"): 
      break 

     V = line.split("\t") 

     ## Extract the values that correspond to gene expression measures 
     ## and convert the strings to numbers 
     x = [float(V[i]) for i in I] 

     X.append(x) 
     GID.append(V[0] + ";" + V[1]) 
X = np.array(X) 

## The indices of samples for the ulcerative colitis group 
UC = [i for i,x in enumerate(STP) if x == "ulcerative colitis"] 

## The indices of samples for the Crohn's disease group 
CD = [i for i,x in enumerate(STP) if x == "Crohn's disease"] 

在控制檯,我得到這樣的輸出: X 缺貨[94]:陣列([],D型細胞= float64)

X.shape 缺貨[95] :(0,)

再次感謝您的建議。

回答

0

這完美地工作:

import gzip 
    import numpy as np 


    """ 
    Read in a SOFT format data file. The following values can be exported: 

    GID : A list of gene identifiers of length d 
    SID : A list of sample identifiers of length n 
    STP : A list of sample desriptions of length d 
    X : A dxn array of gene expression values 
    """ 
    #path to the data file 
    fname = "../data/GDS1615_full.soft.gz" 

    ## Open the data file directly as a gzip file 
    with gzip.open(fname) as fid: 
     SIF = {} 
     for line in fid: 
      if line.startswith(b"!dataset_table_begin"): 
       break 
      elif line.startswith(b"!subset_description"): 

       subset_description = line.decode('utf8').split("=")[1].strip() 
      elif line.startswith(b"!subset_sample_id"): 
       subset_ids = line.decode('utf8').split("=")[1].split(",") 
       subset_ids = [x.strip() for x in subset_ids] 
       for k in subset_ids: 
        SIF[k] = subset_description 
     ## Next line is the column headers (sample id's) 
     SID = next(fid).split(b"\t") 
     ## The column indices that contain gene expression data 
     I = [i for i,x in enumerate(SID) if x.startswith(b"GSM")] 
     ## Restrict the column headers to those that we keep 
     SID = [SID[i] for i in I] 
     ## Get a list of sample labels 
     STP = [SIF[k.decode('utf8')] for k in SID] 
    ## Read the gene expression data as a list of lists, also get the gene 
    ## identifiers 
    GID,X = [],[] 
    for line in fid: 
     ## This is what signals the end of the gene expression data 
     ## section in the file 
     if line.startswith(b"!dataset_table_end"): 
      break 
     V = line.split(b"\t") 
     ## Extract the values that correspond to gene expression measures 
     ## and convert the strings to numbers 
     x = [float(V[i]) for i in I] 
     X.append(x) 
     GID.append(V[0].decode() + ";" + V[1].decode()) 

X = np.array(X) 
## The indices of samples for the ulcerative colitis group 
UC = [i for i,x in enumerate(STP) if x == "ulcerative colitis"] 
## The indices of samples for the Crohn's disease group 
CD = [i for i,x in enumerate(STP) if x == "Crohn's disease"] 

結果:

X.shape 缺貨[4]:(22283,127)