2011-03-15 54 views
0

我有一個小生物信息學問題,我認爲應該很容易解決。與「基因型分期」有關。但我不知道如何解決它。 在下面的摘錄中,第一列是標識符,後續列是用「a」或「b」標記的二元基因型。 「 - 」表示缺少值。在文本中查找相似的行 - 階段 - 動態比較

Si_gnF.scaffold10533.53688bp_tag414456 b a a b b a b a a a b a b b a b a a b b a a b b 
Si_gnF.scaffold10533.76297bp_tag414484 a b b a a b a b b b a b a a b a b b a - b b a a 
Si_gnF.scaffold10533.98416bp_tag414526 a b b a a b a b b b a b a a b a b b a a b b a a 
Si_gnF.scaffold10534.48805bp_tag414546 b a a b a b a b b b b b b a a a a b a b b b b a 
Si_gnF.scaffold10535.1091787bp_tag414684 a a a b b a a a b a b a a a a b b b a a b b a a 
Si_gnF.scaffold10535.1151107bp_tag414765 b b b a a b b b a b a - b b b a a a b b a a b b 
Si_gnF.scaffold10535.1220879bp_tag414877 a a a b b a a a b a b a a a a b b b a a b b a a 
Si_gnF.scaffold10535.1304464bp_tag414988 b b b a a b b b a b a b b b b a a a b b a a b b 
Si_gnF.scaffold10535.1347462bp_tag415047 b b b a a b b b a b a b b b b a a a b b a a b b 
Si_gnF.scaffold10535.1379804bp_tag415090 b b b a a b b b a b a b b b b a a a b b a a b b 
Si_gnF.scaffold10535.1540335bp_tag415345 a a a b b a a a b a b a a a a b b b a a b b a a 
Si_gnF.scaffold10535.1585442bp_tag415410 a a a b b a a a b a b a a a a b b b a a b b a a 
Si_gnF.scaffold10535.1609908bp_tag415431 b b b a a b a b a b a b b b b a a a b b a a b b 
Si_gnF.scaffold10535.1711158bp_tag415567 b b b a a b b b a b a b b b b a a a b b a a b b 
Si_gnF.scaffold10535.1744394bp_tag415609 b b b a a b b b a b a b b b b a a a b b a a b b 
Si_gnF.scaffold10535.1751886bp_tag415620 a a a b b a a a b a b a a a a b b b a a b b a a 
Si_gnF.scaffold10535.1752774bp_tag415622 a a a b b a a a b a b a a a a b b b a a b b a a 
Si_gnF.scaffold10535.1789478bp_tag415675 b b - a a b b b a b a b b b b a a a b b a a b b 
Si_gnF.scaffold10535.1800135bp_tag415687 b b b a a b b b a b a b b b b a a a b b a a b b 
Si_gnF.scaffold10535.1885424bp_tag415814 a a a b b a a a b a b a a a a b b b a a b b a a 

基本上,我想盡量減少行之間的差異數量。 (我無法編輯單個列,但可以翻轉整行標籤)。前四條線的結果是這樣的:

Si_gnF.scaffold10533.53688bp_tag414456 b a a b b a b a a a b a b b a b a a b b a a b b 
Si_gnF.scaffold10533.76297bp_tag414484 b a a b b a b a a a b a b b a b a a b - a a b b <-- this one flipped 
Si_gnF.scaffold10533.98416bp_tag414526 b a a b b a b a a a b a b b a b a a b b a a b b <-- this one flipped 
Si_gnF.scaffold10533.53688bp_tag414456 b a a b b a b a a a b a b b a b a a b b a a b b 

作爲第一步,我需要進行配對比較。但是量化差異的好方法是什麼,以便我知道哪些行必須翻轉標籤? (連續兩行很少匹配100%;可能有多個(甚至很多)不匹配以及缺失值)。

(最好在紅寶石或R)

+0

你應該清楚你的意思是「第一列哪一部分」。這裏的'基因型'與'標籤'具有相同的含義嗎?如果是這樣,你應該保持一致。部分標籤在某些行內......標籤之間的差異難以理解。 – sawa 2011-03-16 03:54:58

回答

2

可以使用Levenshtein算法來量化兩個字符串之間的區別。一種方式做到這一點:

require 'text' # See http://rubygems.org/gems/text 

lines # => a array with each line 

def compare(line1, line2) 
    Text::Levenshtein.distance(line1.sub(/.*\s/, '').sort, 
          line2.sub(/.*\s/, '').sort) 
end 

compare(lines[0], lines[1]) # => 1 (one value different) 

(如果「abaa中」不等於「AAAB」,從方法去除sort

+0

優秀,謝謝! – 2011-03-16 07:58:53