將好文本與'gibber'分開並不是一項簡單的任務,尤其是在處理文本消息/聊天時(這就是我看起來的樣子)。
拼寫錯誤的單詞不會使樣本不可用,甚至語法錯誤的句子也不應使整個文本不合格。這是您可以用於報紙文本的標準,但不適用於原始的,用戶生成的內容。
我會註釋一個語料庫,您可以將好樣本與壞樣本分開,並在其上訓練一個簡單的分類器。註釋不一定非常費功夫,因爲這些亂碼文本比好的文本短,應該很容易識別(至少有一些)。此外,您可以嘗試從大約100個數據點(50個好/ 50個壞點)的語料庫開始,並在第一個模型或多或少工作時擴展它。
這是我總是用於文本分類的示例代碼。你需要安裝scikit學習和numpy的,但:
import re
import random
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
# Prepare data
def prepare_data(data):
"""
data is expected to be a list of tuples of category and texts.
Returns a tuple of a list of lables and a list of texts
"""
random.shuffle(data)
return zip(*data)
# Format training data
training_data = [
("good", "rain a lot the packs maybe damage."),
("good", "15107 Lane Pflugerville, TX customer called me and his phone number and my phone numbers were not masked. thank you customer has had a stroke and items were missing from his delivery the cleaning supplies for his wet vacuum steam cleaner. he needs a call back from customer support "),
("gibber", "wh. screen"),
("gibber", "How will I know if I")
]
training_labels, training_texts = prepare_data(training_data)
# Format test set
test_data = [
("gibber", "an quality"),
("good", "<datapoint with valid text>",
# ...
]
test_labels, test_texts = prepare_data(test_data)
# Create feature vectors
"""
Convert a collection of text documents to a matrix of token counts.
See: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
"""
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(training_texts)
y = training_labels
# Train the classifier
clf = LogisticRegression()
clf.fit(X, y)
# Test performance
X_test = vectorizer.transform(test_texts)
y_test = test_labels
# Generates a list of labels corresponding to the samples
test_predictions = clf.predict(X_test)
# Convert back to the usual format
annotated_test_data = list(zip(test_predictions, test_texts))
# evaluate predictions
y_test = np.array(test_labels)
print(metrics.classification_report(y_test, test_predictions))
print("Accuracy: %0.4f" % metrics.accuracy_score(y_test, test_predictions))
# predict labels for unknown texts
data = ["text1", "text2",]
# Important: use the same vectorizer you used for the training.
# When saving the model (e.g. via pickle) always serialize
# classifier & vectorizer
X = vectorizer.transform(data)
# Now predict the labels for the texts in 'data'
labels = clf.predict(X)
# And put them back together
result = list(zip(labels, data))
# result = [("good", "text1"), ("gibber", "text2")]
它是如何工作的幾句話:伯爵矢量化的標記化文本,並創建包含在語料庫中的所有單詞計數向量。基於這些向量,分類器嘗試識別模式以區分這兩個類別。只有少數不常見的(b/c拼寫錯誤)單詞的文本寧願處於'gibber'類別中,而文本中含有大量常見句子的典型單詞(想想這裏的所有停用詞:'I ','你','是'...)更容易成爲一個好文本。
如果此方法適用於您,您還應該嘗試其他分類器並使用第一個模型半自動註釋更大的訓練語料庫。
非常感謝!但是,我將如何讓它接受號碼和街道地址?另外,如果整個句子在英語單詞下面,除了說一個/兩個單詞,它會返回錯誤嗎? – Arman
此外,從樣本數據(最後一行)中,「質量」屬於英文單詞,但對我們來說,它是無用的,因爲它沒有意義。但是用你的建議english_words,我認爲它會返回true – Arman
好問題,你是對的。確定所有句子是否在句法上是正確的與確定單詞是否是真實的非常不同。我會更新答案以提供更多信息。 – user2263572