最便宜的方法來分類HTTP發佈對象

我可以使用SciPy對我的機器上的文本進行分類，但是我需要在實時或近乎實時的HTTP POST請求中對字符串對象進行分類。如果我的目標是高併發性，接近實時輸出和小內存佔用，我應該研究哪些算法？我想我可以通過Go中的支持向量機（SVM）實現，但是對我的用例來說，這是最好的算法嗎？最便宜的方法來分類HTTP發佈對象

來源

2016-10-27 Louisrr

是的，SVM（帶線性內核）應該是一個很好的起點。你可以使用scikit-learn（我相信它包裝liblinear）來訓練你的模型。學習模型後，該模型只是您要分類的每個類別的一個feature:weight列表。像這樣的東西（假設你只有3班）：

class1[feature1] = weight11 
class1[feature2] = weight12 
... 
class1[featurek] = weight1k ------- for class 1 

... different <feature, weight> ------ for class 2 
... different <feature, weight> ------ for class 3 , etc

在預測時間，你不需要scikit學習的一切，你可以使用你正在使用的服務器後端的任何一種語言做線性計算。假設一個具體的POST請求中包含的特徵（特徵3，feature5），你需要做的是這樣的：

linear_score[class1] = 0 
linear_score[class1] += lookup weight of feature3 in class1 
linear_score[class1] += lookup weight of feature5 in class1 

linear_score[class2] = 0 
linear_score[class2] += lookup weight of feature3 in class2 
linear_score[class2] += lookup weight of feature5 in class2 

..... same thing for class3 
pick class1, or class2 or class3 whichever has the highest linear_score

深入一步：如果你能有一些方法來定義特徵重量（例如，使用TF-IDF令牌的得分），那麼你的預測可能會成爲：

linear_score[class1] += class1[feature3] x feature_weight[feature3] 
so on and so forth.

注意feature_weight[feature k]是爲每個請求通常是不同的。由於對於每個請求，活動特徵的總數量必須遠小於所考慮特徵的總數量（考慮50個令牌或特徵與1 MM標記的整個詞彙量），預測速度應該非常快。我可以想象一旦你的模型準備好了，預測的實現就可以基於鍵值存儲（例如，redis）來編寫。

來源

2016-10-27 03:26:52 greeness

最便宜的方法來分類HTTP發佈對象

回答

相關問題