如何從大文件中提取字符串只有在以前使用Ruby出現特定字符串？

我想從大文件中提取信息，但無法弄清楚如何從文件行中提取字符串，只有當文件中的相同記錄中的前一行已被正則表達式匹配時。文件中一條記錄的示例如下：如何從大文件中提取字符串只有在以前使用Ruby出現特定字符串？

*NEW RECORD 
RECTYPE = D 
MH = Informed Consent 
AQ = ES HI LJ PX SN ST 
ENTRY = Consent, Informed 
MN = N03.706.437.650.312 
MN = N03.706.535.489 
FX = Disclosure 
FX = Mental Competency 
FX = Therapeutic Misconception 
FX = Treatment Refusal 
ST = T058 
ST = T078 
AN = competency to consent: coordinate IM with MENTAL COMPETENCY (IM) 
PI = Jurisprudence (1966-1970) 
PI = Physician-Patient Relations (1966-1970) 
MS = Voluntary authorization, by a patient or research subject, etc,...

此文件包含超過20,000條記錄，例如此示例。我想用「MH」字段來識別這些記錄中的一小部分。在這個例子中，我想查找「知情同意書」，然後使用正則表達式僅在該記錄中提取FX，AN和MS字段中的信息。到目前爲止，我已經打開文件，訪問MH術語所存儲的散列，並且能夠從文件中的記錄中提取這些條款。我也有一個正常運行的正則表達式，用於標識「FX」字段中的內容。

File.open('mesh_descriptor.bin').each do |file_line| 
file_line = file_line.chomp 

# read each key of candidate_descriptor_keys 
candidate_descriptor_keys.each do |cand_term| 

    if file_line =~ /^MH\s=\s(#{cand_term})$/ 
    mesh_header = $1 
    puts "MH from Mesh Descriptor file is: #{mesh_header}" 


    if file_line =~ /^FX\s=\s(.*)$/ 
    see_also = $1 
    puts " See_Also from Descriptor file is: #{see_also}" 
    end 
end 
end 
end

哈希包含以下MH（鍵）：

candidate_descriptor_keys = ["Body Weight", "Obesity", "Thinness", "Fetal Weight", "Overweight"]

當我把語句「if」語句之外，以提取「MH」我已經成功提取「FX」，但整個文件中的所有「FX」都被檢索到 - 而不是我所需要的。我認爲在前面的「if」語句中將「FX」語句放在「if」語句中會將結果限制爲僅當第一個語句爲真時纔會發現的結果，但我沒有得到此策略的結果（也沒有錯誤）。我想作爲一個結果是：

> Informed Consent 
> Disclosure 
> Mental Competency 
> Therapeutic Misconception 
> Treatment Refusal

還有「AN」和只有符合「MH」記錄「MS」字段中的字符串。任何的意見都將會有幫助！

來源

2014-04-13 user3385593

就在我的答案的「代碼」部分之前，我寫了幾行開始，「我假設......」。我建議你在你的問題中加入類似的東西，也許就在你的段落結束之後「到目前爲止的腳本如下：」（並且移動該句子以跟隨添加的文本）。一旦你做完了，我會刪除我的答案的那一部分。如果你想用我寫的東西，我不反對。 –

讀者：我和提問者進行了長時間的交談（評論）。如果您不明白該問題，請在「代碼」部分之前閱讀我的答案的開頭部分。還要注意提問者對我答案的評論。明天，提問者將清理問題並刪除他/她的意見，這些意見不再相關。（我已經刪除了我的。） –

我認爲這可能是你正在尋找的，但如果沒有，讓我知道，我會改變它。尤其要特別注意最後的情況，看看是否是你想要的那種輸出（對於有兩條記錄的輸入，都帶有「MH」字段）。一旦我正確理解了你的問題，我還會在最後添加一個「解釋」部分。

我假定每個記錄開始

*NEW_RECORD

，並希望找出開始"MH"，其場的元素之一的所有行：

candidate_descriptor_keys = 
    ["Body Weight", "Obesity", "Thinness", "Informed Consent"]

和每場比賽，你會喜歡打印以"FX"，"AN"和"MS"開頭的相同記錄的行內容。

代碼

NEW_RECORD_MARKER = "*NEW RECORD" 

def getem(fname, candidate_descriptor_keys) 
    line = 0 
    found_mh = false 
    File.open(fname).each do |file_line| 
    file_line = file_line.strip 
    case 
    when file_line == NEW_RECORD_MARKER 
     puts # space between records 
     found_mh = false 
    when found_mh == false 
     candidate_descriptor_keys.each do |cand_term| 
     if file_line =~ /^MH\s=\s(#{cand_term})$/ 
      found_mh = true 
      puts "MH from line #{line} of file is: #{cand_term}" 
      break 
     end 
     end 
    when found_mh 
     ["FX", "AN", "MS"].each do |des| 
     if file_line =~ /^#{des}\s=\s(.*)$/ 
      see_also = $1 
      puts " Line #{line} of file is: #{des}: #{see_also}" 
     end 
     end 
    end 
    line += 1 
    end 
end

例

首先，讓我們來創建一個文件時，使用「包含兩個記錄這裏的文檔」 starging：

records =<<_ 
*NEW RECORD 
RECTYPE = D 
MH = Informed Consent 
AQ = ES HI LJ PX SN ST 
ENTRY = Consent, Informed 
MN = N03.706.437.650.312 
MN = N03.706.535.489 
FX = Disclosure 
FX = Mental Competency 
FX = Therapeutic Misconception 
FX = Treatment Refusal 
ST = T058 
ST = T078 
AN = competency to consent 
PI = Jurisprudence (1966-1970) 
PI = Physician-Patient Relations (1966-1970) 
MS = Voluntary authorization 
*NEW RECORD 
MH = Obesity 
AQ = ES HI LJ PX SN ST 
ENTRY = Obesity 
MN = N03.706.437.650.312 
MN = N03.706.535.489 
FX = 1st FX 
FX = 2nd FX 
AN = Only AN 
PI = Jurisprudence (1966-1970) 
PI = Physician-Patient Relations (1966-1970) 
MS = Only MS 
_

如果puts records你會看到它只是一個字符串。（你會看到我縮短了其中的兩個。）現在把它寫入一個文件：

File.write('mesh_descriptor', records)

如果要確認文件內容，你可以這樣做：

puts File.read('mesh_descriptor')

我們還需要定義定義數組candidate_descriptor_keys：

candidate_descriptor_keys = 
    ["Body Weight", "Obesity", "Thinness", "Informed Consent"]

我們現在可以執行方法getem：

getem('mesh_descriptor', candidate_descriptor_keys) 

MH from line 2 of file is: Informed Consent 
Line 7 of file is: FX: Disclosure 
Line 8 of file is: FX: Mental Competency 
Line 9 of file is: FX: Therapeutic Misconception 
Line 10 of file is: FX: Treatment Refusal 
Line 13 of file is: AN: competency to consent 
Line 16 of file is: MS: Voluntary authorization 

MH from line 18 of file is: Obesity 
Line 23 of file is: FX: 1st FX 
Line 24 of file is: FX: 2nd FX 
Line 25 of file is: AN: Only AN 
Line 28 of file is: MS: Only MS

來源

2014-04-13 21:17:45

謝謝你的出色答案！你的假設是正確的，並且輸出是我正在尋找的。我會回到明天寫劇本並給你反饋 - 謝謝！ – user3385593

我認爲'case'語句是可以的。當'case'後面沒有任何東西時，它會執行第一個'when'來評估'true'，這是我們想要的，因爲我們檢查'file_line == NEW_RECORD_MARKER'，然後'found_mh == false'。按原樣保留'def getem（fname，candidate_descriptor_keys）';調用它：getem（「mesh_descriptor.bin」，candidate_descriptor_keys）'。如果你從終端運行它，在調用Ruby之前確保當前目錄包含「mesh_descriptor.bin」（因此它可以找到它）。如果你仍然有問題，請給我確切的錯誤信息和它發生的地方。 –

如何從大文件中提取字符串只有在以前使用Ruby出現特定字符串？

回答

相關問題