3
我有一個文本文件,其中包含的基因信息就像一個關係和基因之間關係的一部分。從文本文件導入並在matlab中創建一個單元格陣列
此文本文件包含每個GOTerm的段落(GO術語是包含某些代碼編號的節點,例如:GO:0030436),其中包含:Go術語ID(每個段落的第一行)和isa(如果有的話) (始於ISA與ISA結束而終止)和partof轉到條款(如果有的話)(與partof開始:與partof結束而終止)從這個文本文件中的小樣本是:
GO:0030436
isa:
GO:0034297
GO:0043936
GO:0048315
end of isa
partof:
GO:0042243
end of partof
genes:
end of genes
GO:0034297
isa:
end of isa
partof:
end of partof
genes:
end of genes
GO:0043936
isa:
GO:0001410
GO:0034300
GO:0034301
GO:0034302
GO:0034303
GO:0034304
end of isa
partof:
end of partof
genes:
end of genes
我需要讀取這個文本文件並從中取出三個數據並製作一個有3列的單元矩陣,如下所示:
map=
ID GoTerms is_a partof
GO:0030436 GO:0034297 GO:0042243
GO:0030436 GO:0043936 0
GO:0030436 GO:0048315 0
GO:0034297 0 0
GO:0043936 GO:0001410 0
GO:0043936 GO:0034300 0
GO:0043936 GO:0034301 0
GO:0043936 GO:0034302 0
GO:0043936 GO:0034303 0
GO:0043936 GO:0034304 0
請注意,如果每個Go項包含多個項是術語的一部分或一部分,我應該重複Go術語ID以便使單元格矩陣合理且組織良好。
有關如何製作此代碼的任何想法?
我試圖做一個代碼,但它不工作,因爲我不知道如何把超過1個ISA和部分條款:
s={};
fid = fopen('Opt.pad'); % read from the certain text file
tline = fgetl(fid);
while ischar(tline)
s=[s;tline];
tline = fgetl(fid);
end
% find start and end positions of every [Term] marker in s
terms = [find(~cellfun('isempty', regexp(s, '\GO:\w*'))); numel(s)+1];
% for every [Term] section, run the previously implemented regexps
% and save the results into a map - a cell array with 3 columns map = cell(0,3);
for term=1:numel(terms)-1
% extract single [Term] data
s_term = s(terms(term):terms(term+1)-1); % match regexps
%To generate the GO_Terms vector from the text file
tok = regexp(s_term, '^(GO:\w*)', 'tokens');
idx = ~cellfun('isempty', tok);
GO_Terms=cellfun(@(x)x{1}, (tok(idx))); %To generate the is_a relations vector from the text file
tok = regexp(s_term, '^isa: (GO:\w*)', 'tokens');
idx = ~cellfun('isempty', tok);
is_a_relations =cellfun(@(x)x{1}, (tok(idx))); %To generate the part_of relaions vector from the text file
tok = regexp(s_term, '^partof: (GO:\w*)', 'tokens');
idx = ~cellfun('isempty', tok);
part_of_relations =cellfun(@(x)x{1}, (tok(idx))); % map. note the end+1 - here we create a new map row. Only once!
map{end+1,1} = GO_Terms;
map{end, 2} = is_a_relations;
map{end, 3} = part_of_relations;
end map(cellfun(@isempty, map)) = {0};
編輯:修改了輸出單元陣列的結構。 –