0
謝謝你的幫助。我想使用下面的python代碼來讀取和處理來自affymetrix微陣列數據集的數據。我想闡明單核細胞中克隆氏病和潰瘍性結腸炎的疾病狀態中的差異基因表達。代碼運行完美,但是當我嘗試查看X的內容時,我在輸出處得到一個空數組(如:array([],dtype = float64)),這當然沒有用。這裏是一個鏈接到原始數據集:https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS1615 我試圖很長時間來弄清楚爲什麼我有一個空的和無法處理的輸出,但無濟於事。下面是代碼:如何獲取微陣列數據?
import gzip
import numpy as np
"""
Read in a SOFT format data file. The following values can be exported:
GID : A list of gene identifiers of length d
SID : A list of sample identifiers of length n
STP : A list of sample descriptions of length d
X : A dxn array of gene expression values
"""
#path to the data file
fname = "../data/GDS1615_full.soft.gz"
## Open the data file directly as a gzip file
with gzip.open(fname) as fid:
SIF = {}
for line in fid:
if line.startswith(line, len("!dataset_table_begin")):
break
elif line.startswith(line, len("!subject_description")):
subset_description = line.split("=")[1].strip()
elif line.startswith(line, len("!subset_sample_id")):
subset_ids = [x.strip() for x in subset_ids]
for k in subset_ids:
SIF[k] = subset_description
## Next line is the column headers (sample id's)
SID = next(fid).split("\t")
## The column indices that contain gene expression data
I = [i for i,x in enumerate(SID) if x.startswith("GSM")]
## Restrict the column headers to those that we keep
SID = [SID[i] for i in I]
## Get a list of sample labels
STP = [SIF[k] for k in SID]
## Read the gene expression data as a list of lists, also get the gene
## identifiers
GID,X = [],[]
for line in fid:
## This is what signals the end of the gene expression data
## section in the file
if line.startswith("!dataset_table_end"):
break
V = line.split("\t")
## Extract the values that correspond to gene expression measures
## and convert the strings to numbers
x = [float(V[i]) for i in I]
X.append(x)
GID.append(V[0] + ";" + V[1])
X = np.array(X)
## The indices of samples for the ulcerative colitis group
UC = [i for i,x in enumerate(STP) if x == "ulcerative colitis"]
## The indices of samples for the Crohn's disease group
CD = [i for i,x in enumerate(STP) if x == "Crohn's disease"]
在控制檯,我得到這樣的輸出: X 缺貨[94]:陣列([],D型細胞= float64)
X.shape 缺貨[95] :(0,)
再次感謝您的建議。