2011-06-21 117 views
2

我正在寫一個簡短的類來從文檔中提取電子郵件地址。這是我到目前爲止的代碼:Ruby正則表達式不匹配

# Class to scrape documents for email addresses 

class EmailScraper 

    EmailRegex = /\A[\w+\-.][email protected][a-z\d\-.]+\.[a-z]+\z/i 

    def EmailScraper.scrape(doc) 
    email_addresses = [] 
    File.open(doc) do |file| 
     while line = file.gets 
     temp = line.scan(EmailRegex) 

     temp.each do |email_address| 
      puts email_address 
      emails_addresses << email_address 
     end 

     end 
    end 
    return email_addresses 
    end 
end 


if EmailScraper.scrape("email_tests.txt").empty? 
    puts "Empty array" 
else 
    puts EmailScraper.scrape("email_tests.txt") 
end 

我「email_tests.txt」文件看起來像這樣:

[email protected] 
[email protected] 
[email protected] 

當我運行該腳本,我得到的是「空陣」打印輸出。但是,當我啓動irb並輸入上述正則表達式時,電子郵件地址串就會與之匹配,並且String.scan函數返回每個字符串中所有電子郵件地址的數組。爲什麼這在irb中而不是在我的腳本中工作?

回答

3

幾件事情(有些已經提到並擴大在下面):

  • \z匹配字符串的結尾,這與IO#得到通常包括\n字符。 \Z(大寫字母'z')匹配字符串的結尾,除非字符串以\n結尾,在這種情況下,它與之前匹配。
  • 的使用\A\Zemails_addresses
  • 錯字是罰款,而整條生產線是或不是一個電子郵件地址。你說你想要從文檔中提取地址,所以我會考慮在每一端使用\b來提取由單詞邊界分隔的電子郵件。
  • 你可以使用File.foreach()...而非笨拙的前瞻性File.open...while...gets事情
  • 我不是由正則表達式堅信 - 有作品的實質身體已經圍:

這裏有一個聰明的一個:http://www.regular-expressions.info/email.html(點擊那個奇怪的小內嵌圖標會帶你到piece-by-piece explanation)。值得一讀的討論,它指出了幾個潛在的陷阱。

更令人難以置信的複雜的可能是here

class EmailScraper 

    EmailRegex = /\A[\w+\-.][email protected][a-z\d\-.]+\.[a-z]+\Z/i # changed \z to \Z 

    def EmailScraper.scrape(doc) 

    email_addresses = [] 

    File.foreach(doc) do |line| # less code, same effect 
     temp = line.scan(EmailRegex) 
     temp.each do |email_address| 
     email_addresses << email_address 
     end 
    end   
    email_addresses # "return" isn't needed 
    end 
end 

result = EmailScraper.scrape("email_tests.txt") # store it so we don't print them twice if successful 
if result.empty? 
    puts "Empty array" 
else 
    puts result 
end 
+0

+1用於解釋'\ z'和'\ Z'的區別不知道。 – stema

3

看起來您將結果放入emails_addresses,但正在返回email_addresses。這意味着您總是返回您爲email_addresses定義的空數組,使「空數組」響應正確。

0

你有一個錯字,嘗試:

class EmailScraper 

    EmailRegex = /\A[\w+\-.][email protected][a-z\d\-.]+\.[a-z]+\z/i 

    def EmailScraper.scrape(doc) 
    email_addresses = [] 
    File.open(doc) do |file| 
     while line = file.gets 
     temp = line.scan(EmailRegex) 

     temp.each do |email_address| 
      puts email_address 
      email_addresses << email_address 
     end 

     end 
    end 
    return email_addresses 
    end 
end 


if EmailScraper.scrape("email_tests.txt").empty? 
    puts "Empty array" 
else 
    puts EmailScraper.scrape("email_tests.txt") 

end 
0

當你讀文件,線的末端是使正則表達式失敗。在irb中,可能沒有結束。如果是這種情況,請先排隊。

regex=/\A[\w+\-.][email protected][a-z\d\-.]+\.[a-z]+\z/i 
line_from_irb = "[email protected]" 
line_from_file = line_from_irb +"/n" 

p line_from_irb.scan(regex) # => ["[email protected]"] 
p line_from_file.scan(regex) # => []