2012-12-16 44 views
3

我有一個文本文件,其中包含的基因信息就像一個關係和基因之間關係的一部分。從文本文件導入並在matlab中創建一個單元格陣列

此文本文件包含每個GOTerm的段落(GO術語是包含某些代碼編號的節點,例如:GO:0030436),其中包含:Go術語ID(每個段落的第一行)和isa(如果有的話) (始於ISA與ISA結束而終止)和partof轉到條款(如果有的話)(與partof開始:與partof結束而終止)從這個文本文件中的小樣本是:

GO:0030436 
isa: 
GO:0034297 
GO:0043936 
GO:0048315 
end of isa 
partof: 
GO:0042243 
end of partof 
genes: 
end of genes 
GO:0034297 
isa: 
end of isa 
partof: 
end of partof 
genes: 
end of genes 
GO:0043936 
isa: 
GO:0001410 
GO:0034300 
GO:0034301 
GO:0034302 
GO:0034303 
GO:0034304 
end of isa 
partof: 
end of partof 
genes: 
end of genes 

我需要讀取這個文本文件並從中取出三個數據並製作一個有3列的單元矩陣,如下所示:

map= 

ID GoTerms    is_a   partof 
GO:0030436    GO:0034297  GO:0042243 
GO:0030436    GO:0043936    0 
GO:0030436    GO:0048315    0 
GO:0034297     0     0 
GO:0043936    GO:0001410    0 
GO:0043936    GO:0034300    0 
GO:0043936    GO:0034301    0 
GO:0043936    GO:0034302    0 
GO:0043936    GO:0034303    0 
GO:0043936    GO:0034304    0 

請注意,如果每個Go項包含多個項是術語的一部分或一部分,我應該重複Go術語ID以便使單元格矩陣合理且組織良好。

有關如何製作此代碼的任何想法?

我試圖做一個代碼,但它不工作,因爲我不知道如何把超過1個ISA和部分條款:

s={}; 
     fid = fopen('Opt.pad'); % read from the certain text file 
     tline = fgetl(fid); 
     while ischar(tline) 
      s=[s;tline]; 
      tline = fgetl(fid); 
     end 
% find start and end positions of every [Term] marker in s 
    terms = [find(~cellfun('isempty', regexp(s, '\GO:\w*'))); numel(s)+1]; 
     % for every [Term] section, run the previously implemented regexps 
     % and save the results into a map - a cell array with 3 columns  map = cell(0,3); 
     for term=1:numel(terms)-1 
      % extract single [Term] data 
      s_term = s(terms(term):terms(term+1)-1);   % match regexps 
      %To generate the GO_Terms vector from the text file 
      tok = regexp(s_term, '^(GO:\w*)', 'tokens'); 
      idx = ~cellfun('isempty', tok); 
      GO_Terms=cellfun(@(x)x{1}, (tok(idx)));   %To generate the is_a relations vector from the text file 
      tok = regexp(s_term, '^isa: (GO:\w*)', 'tokens'); 
      idx = ~cellfun('isempty', tok); 
      is_a_relations =cellfun(@(x)x{1}, (tok(idx)));   %To generate the part_of relaions vector from the text file 
      tok = regexp(s_term, '^partof: (GO:\w*)', 'tokens'); 
      idx = ~cellfun('isempty', tok); 
      part_of_relations =cellfun(@(x)x{1}, (tok(idx)));   % map. note the end+1 - here we create a new map row. Only once! 
      map{end+1,1} = GO_Terms; 
      map{end, 2} = is_a_relations; 
      map{end, 3} = part_of_relations; 
     end  map(cellfun(@isempty, map)) = {0}; 

回答

0

短而簡單的解決方案(儘管也許不是最快):

% # Parse text file 
C = textread('Opt.pad', '%s', 'delimiter', ''); 

% # Obtain indices for isa elements 
idx = reshape(find(~cellfun(@isempty, strfind(C, 'isa')))', 2, []); 
isa = arrayfun(@(x, y)x + 1:y - 1, idx(1, :), idx(2, :), 'Uniform', false); 

% # Obtain indices for partof elements 
idx = reshape(find(~cellfun(@isempty, strfind(C, 'partof')))', 2, []); 
partof = arrayfun(@(x, y)x + 1:y - 1, idx(1, :), idx(2, :), 'Uniform', false); 

% # Obtain indices of GO term elements and IDs 
go = find(cellfun(@(s)any(strfind(s, 'GO:')), C)); 
id = go(~ismember(go, [isa{:}, partof{:}])); 

% # Construct a new cell array 
N = cellfun(@(x, y)max([numel(x), numel(y), 1]), isa, partof); 
k = cumsum([1, N(1:end - 1)]); 
X = cell(sum(N), 3); % # Preallocate memory! 
repcell = @(x, n)arrayfun(@(y)x, 1:n, 'Uniform', false); 
for ii = 1:numel(id) 
    idx = k(ii):k(ii) + N(ii) - 1; 
    X(idx, 1) = repcell(C{id(ii)}, N(ii)); 
    X(idx, 2) = [C{isa{ii}}, repcell('0', N(ii) - numel(isa{ii}))]; 
    X(idx, 3) = [C{partof{ii}}, repcell('0', N(ii) - numel(partof{ii}))]; 
end 

這應該產生以下輸出:

X = 

    'GO:0030436' 'GO:0034297' 'GO:0042243' 
    'GO:0030436' 'GO:0043936' '0'   
    'GO:0030436' 'GO:0048315' '0'   
    'GO:0034297' '0'    '0'   
    'GO:0043936' 'GO:0001410' '0'   
    'GO:0043936' 'GO:0034300' '0'   
    'GO:0043936' 'GO:0034301' '0'   
    'GO:0043936' 'GO:0034302' '0'   
    'GO:0043936' 'GO:0034303' '0'   
    'GO:0043936' 'GO:0034304' '0' 
+0

編輯:修改了輸出單元陣列的結構。 –

相關問題