2017-04-10 28 views
0

對於多類預測,通過遵循針對this gem給出的庫示例返回稍微不準確的預測。ruby​​ libsvm對於多類問題

測試集(老師大聲訓斥學生誰是上課遲到,但後來道歉。)應與代碼已經返回EDUCATION代替HEALTH

require 'libsvm' 

# Let take our documents and create word vectors out of them. 
# 
documents = [ # 0 is JOKES, 1 is EDUCATION and 2 is HEALTH 
      [0, "Why did the chicken cross the road? Because a car was coming"], 
      [0, "You're an elevator tech? I bet that job has its ups and downs"], 
      [0, "Why did the chicken cross the road? To get the worm"], 

      [1, "The university admitted more students this year and dropout rate is lessening."], 
      [1, "The students turned in their homework at school before summer break."], 
      [1, "The students and teachers agreed on a plan for study."], 

      [2, "The cold outbreak was bad but not an epidemic."], 
      [2, "The doctor and the nurse advised be to get rest because of my cold."], 
      [2, "The doctor had to go to the hospital."] 
     ] 

# Lets create a dictionary of unique words and then we can 
# create our vectors. This is a very simple example. If you 
# were doing this in a production system you'd do things like 
# stemming and removing all punctuation (in a less casual way). 
# 
dictionary = documents.map(&:last).map(&:split).flatten.uniq 
dictionary = dictionary.map { |x| x.gsub(/\?|,|\.|\-/,'') } 

training_set = [] 
documents.each do |doc| 
    @features_array = dictionary.map { |x| doc.last.include?(x) ? 1 : 0 } 
    training_set << [doc.first, Libsvm::Node.features(@features_array)] 
end 

# Lets set up libsvm so that we can test our prediction 
# using the test set 
# 
problem = Libsvm::Problem.new 
parameter = Libsvm::SvmParameter.new 

parameter.cache_size = 1 # in megabytes 
parameter.eps = 0.001 
parameter.c = 10 

# Train classifier using training set 
# 
problem.set_examples(training_set.map(&:first), training_set.map(&:last)) 
model = Libsvm::Model.train(problem, parameter) 

# Now lets test our classifier using the test set 
# 
test_set = [1, "The teacher yelled at the student who was late to class but later apologized."] 
test_document = test_set.last.split.map{ |x| x.gsub(/\?|,|\.|\-/,'') } 

doc_features = dictionary.map{|x| test_document.include?(x) ? 1 : 0 } 
pred = model.predict(Libsvm::Node.features(doc_features)) 
puts pred # returns 2.0 BUT should have been 1.0 
result = case pred 
    when 0.0 then "predicted #{pred} as joke" 
    when 1.0 then "predicted #{pred} as education" 
    when 2.0 then "predicted #{pred} as health" 
end 
puts result 

問題或需要嘗試其他內核和參數。

+0

從代碼的角度來看,我對多類實現並不特別清楚。 – arjun

回答

0

該代碼本身沒有特定的問題。原因就是缺乏訓練數據。

嘗試使用「大學今年錄取更多學生,輟學率下降」,這與測試實例完全相同。該計劃將其分類爲教育。

3個SVM訓練實例是不夠的。通過使用交叉驗證來使用更多訓練數據和調整參數C的最佳方法。