2016-09-23 22 views
0

我正在使用Rails 4.2.7。我下載並從Web編寫PDF內容,像這樣......在Ruby中,如何處理PDF內容中的非UTF 8字符?

res1 = Net::HTTP.SOCKSProxy('127.0.0.1', 50001).start(uri.host, uri.port) do |http| 
     puts "launching #{uri}" 
     resp = http.get(uri) 
     status = resp.code 
     content = resp.body 
     content_type = resp['content-type'] 
     content_encoding = resp['content-encoding'] 
    end 
… 
    if content_type == 'application/pdf' || content_type.include?('application/x-javascript') 
    File.open(file_location, "w") { |file| file.write content } 

我注意到,對於一些內容,我得到下面的錯誤

Error during processing: "\xC2" from ASCII-8BIT to UTF-8 
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_service.rb:8:in `write' 
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_service.rb:8:in `block in pre_process_data' 
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_service.rb:8:in `open' 
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_service.rb:8:in `pre_process_data' 
/Users/davea/Documents/workspace/myproject/app/services/abstract_import_service.rb:76:in `process_race_data' 
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_race_finder_service.rb:75:in `process_race_link' 
/Users/davea/Documents/workspace/myproject/app/services/abstract_race_finder_service.rb:29:in `block in process_data' 
/Users/davea/Documents/workspace/myproject/app/services/abstract_race_finder_service.rb:28:in `each' 
/Users/davea/Documents/workspace/myproject/app/services/abstract_race_finder_service.rb:28:in `process_data' 
/Users/davea/Documents/workspace/myproject/app/services/run_crawlers_service.rb:18:in `block in run_all_crawlers' 
/Users/davea/.rvm/gems/ruby-2.3.0/gems/activerecord-4.2.7.1/lib/active_record/relation/delegation.rb:46:in `each' 

我想佔它,由替換無效字符,像這樣......

File.open(file_location, "w") { |file| file.write content } 
content.encode('UTF-8', :invalid => :replace, :undef => :replace) 

但後來我得到的錯誤

error: PDF malformed, expected 'endstream' but found 0 instead 

試圖讀取PDF文件時。有誰知道更好的方式來處理下載的PDF文件,不會破壞它們?

回答

0

我認爲最簡單的解決辦法是把它寫成使用IO#binwrite是:

File.binwrite(file_location, content) 

以上可能會失敗,如果您收到的文件可能會在不同編碼,在這種情況下,我會嘗試

content.force_encoding(Encoding::ISO_8859_1).encode(Encoding::UTF_8)