2009-07-17 14 views
4

電話號碼的數據(我選擇這些,因爲數據進來是不可靠的,而不是預期的格式):提取電話號碼並重新格式化的更好方法?在各種格式

+1 480-874-4666 
404-581-4000 
(805) 682-4726 
978-851-7321, Ext 2606 
413- 658-1100 
(513) 287-7000,Toll Free (800) 733-2077 
1 (813) 274-8130 
212-363-3200,Media Relations: 212-668-2251. 
323/221-2164 

我的Ruby代碼提取所有的數字,刪除任何領先的1對的美國國家代碼,然後用前10位中所需的格式以創建「新」的電話號碼:

nums = phone_number_string.scan(/[0-9]+/) 
    if nums.size > 0 
    all_nums = nums.join 
    all_nums = all_nums[0..0] == "1" ? all_nums[1..-1] : all_nums 
    if all_nums.size >= 10 
     ten_nums = all_nums[0..9] 
     final_phone = "#{ten_nums[0..2]}-#{ten_nums[3..5]}-#{ten_nums[6..9]}" 
    else 
     final_phone = "" 
    end 
    puts "#{final_phone}" 
    else 
    puts "No number to fix." 
    end 

的結果是很好

480-874-4666 
404-581-4000 
805-682-4726 
978-851-7321 
413-658-1100 
513-287-7000 
813-274-8130 
212-363-3200 
323-221-2164 

但是,我認爲還有更好的辦法。你可以重構這個更高效,更清晰或更有用嗎?

+0

+ 49- 2345-123456789 – Svante 2009-07-17 10:28:40

回答

13

這裏有一個更簡單的方法只使用正則表達式和替代:

def extract_phone_number(input) 
    if input.gsub(/\D/, "").match(/^1?(\d{3})(\d{3})(\d{4})/) 
    [$1, $2, $3].join("-") 
    end 
end 

這條所有非數字(\D),跳過一個可選的主導一個(^1?),然後提取以塊的第一剩餘的10個位數((\d{3})(\d{3})(\d{4}))和格式。

這裏的測試:

test_data = { 
    "+1 480-874-4666"        => "480-874-4666", 
    "404-581-4000"        => "404-581-4000", 
    "(805) 682-4726"        => "805-682-4726", 
    "978-851-7321, Ext 2606"      => "978-851-7321", 
    "413- 658-1100"        => "413-658-1100", 
    "(513) 287-7000,Toll Free (800) 733-2077"  => "513-287-7000", 
    "1 (813) 274-8130"       => "813-274-8130", 
    "212-363-3200,Media Relations: 212-668-2251." => "212-363-3200", 
    "323/221-2164"        => "323-221-2164", 
    ""           => nil, 
    "foobar"          => nil, 
    "1234567"          => nil, 
} 

test_data.each do |input, expected_output| 
    extracted = extract_phone_number(input) 
    print "FAIL (expected #{expected_output}): " unless extracted == expected_output 
    puts extracted 
end 
+0

這種方法可能更快。 – 2009-07-17 14:24:33

0

對於北美計劃一個可以提取使用phone_number_string.gsub(/\D/, '').match(/^1?(\d{10})/)[1]

例如,第一個數字號碼:

test_phone_numbers = ["+1 480-874-4666", 
         "404-581-4000", 
         "(805) 682-4726", 
         "978-851-7321, Ext 2606", 
         "413- 658-1100", 
         "(513) 287-7000,Toll Free (800) 733-2077", 
         "1 (813) 274-8130", 
         "212-363-3200,Media Relations: 212-668-2251.", 
         "323/221-2164", 
         "foobar"] 

test_phone_numbers.each do | phone_number_string | 
    match = phone_number_string.gsub(/\D/, '').match(/^1?(\d{10})/) 
    puts(
    if (match) 
     "#{match[1][0..2]}-#{match[1][3..5]}-#{match[1][6..9]}" 
    else 
     "No number to fix." 
    end 
) 
end 

與起始代碼,這不捕捉多個號碼,例如, 「(513)287-7000,免費電話(800)733-2077」

FWIW,我發現從長遠來看,它更容易捕獲和存儲完整的數字,即包括國家代碼和沒有分隔符;在拍攝期間進行猜測,其中numbering plan人丟失前綴,並且在渲染時選擇格式,例如NANP v.DE。

2

我的做法是有點不同的(我認爲更好恕我直言:-):我需要不會錯過任何電話號碼,即使有一行是2。我也不想讓有3組數字的線路相距甚遠(請參閱cookie示例),我也不想將IP地址誤認爲電話號碼。

代碼,允許每行多個號碼,還需要設置的數字是「關閉」對方:

def extract_phone_number(input) 
    result = input.scan(/(\d{3})\D{0,3}(\d{3})\D{0,3}(\d{4})/).map{|e| e.join('-')} 
    # <result> is an Array of whatever phone numbers were extracted, and the remapping 
    # takes care of cleaning up each number in the Array into a format of 800-432-1234 
    result = result.join(' :: ') 
    # <result> is now a String, with the numbers separated by ' :: ' 
    # ... or there is another way to do it (see text below the code) that only gets the 
    # first phone number only. 

    # Details of the Regular Expressions and what they're doing 
    # 1. (\d{3}) -- get 3 digits (and keep them) 
    # 2. \D{0,3} -- allow skipping of up to 3 non-digits. This handles hyphens, parentheses, periods, etc. 
    # 3. (\d{3}) -- get 3 more digits (and keep them) 
    # 4. \D{0,3} -- skip up to 0-3 non-digits 
    # 5. (\d{4}) -- keep the final 4 digits 

    result.empty? ? nil : result 
end 

這裏是測試(有一些額外的測試)

test_data = { 
    "DB=Sequel('postgres://user:[email protected]/test_test')" => nil, # DON'T MISTAKE IP ADDRESSES AS PHONE NUMBERS 
    "100 cookies + 950 cookes = 1050 cookies"  => nil, # THIS IS NEW 
    "this 123 is a 456 bad number 7890"   => nil, # THIS IS NEW 
    "212-363-3200,Media Relations: 212-668-2251." => "212-363-3200 :: 212-668-2251", # THIS IS CHANGED 
    "this is +1 480-874-4666"      => "480-874-4666", 
    "something 404-581-4000"      => "404-581-4000", 
    "other (805) 682-4726"      => "805-682-4726", 
    "978-851-7321, Ext 2606"      => "978-851-7321", 
    "413- 658-1100"        => "413-658-1100", 
    "(513) 287-7000,Toll Free (800) 733-2077"  => "513-287-7000 :: 800-733-2077", # THIS IS CHANGED 
    "1 (813) 274-8130"       => "813-274-8130", 
    "323/221-2164"        => "323-221-2164", 
    ""           => nil, 
    "foobar"          => nil, 
    "1234567"          => nil, 
} 

def test_it(test_data) 
    test_data.each do |input, expected_output| 
    extracted = extract_phone_number(input) 
    puts "#{extracted == expected_output ? 'good': 'BAD!'} ::#{input} => #{extracted.inspect}" 
    end 
end 

test_it(test_data) 

替代實現:通過使用「掃描」它會自動應用正則表達式多次,如果您想每行超過1個電話號碼,這是很好的。如果你只是想獲得一個行的第一個電話號碼,那麼你還可以使用:

first_phone_number = begin 
    m = /(\d{3})\D{0,3}(\d{3})\D{0,3}(\d{4})/.match(input) 
    [m[1],m[2],m[3]].join('-') 
rescue nil; end 

(只是一個不同的做事方式,使用正則表達式的「匹配」功能)