2

我正在做一些機器學習,並需要我的編碼方面的幫助。在我的培訓數據中,我有許多網頁網址和這些網頁的一些功能。我在網頁文本的文本上運行TF-IDF以創建更多功能。如何使scikit中的排名數據正常化學習?

我已經提取的一個功能是,對於每個網址,我檢索Google Page排名。這個價值可以是世界上任何價值,但排名越低,谷歌認爲它的「質量越好」。

我該如何規範化這個數字,因爲我有7,000個網址,排名可能有很大的不同(例如,www.google.com可能排名第一,而www.bbc.co.uk可能是#1,117,其他等級將遠遠超出我們的7000個網址)。

如何使用scikit學習來有效地規範化這些數據,以便它可以用於我的機器學習算法?我正在運行一個Logistic迴歸,它只是試圖預測一個網頁是否「好」。我目前使用的唯一功能是使用TF-IDF在網頁文本中創建的功能。理想情況下,我希望將這些與我的網頁排名功能結合起來,這樣可以給我最高的交叉驗證分數。

非常感謝!

所以我們可以假設我的數據是在形式的TSV:

URL GooglePageRank WebsiteText 

一排的一個例子:

http://www.google.com 1 This would be the text of the google webpage. 

我想歸我的排名數據,並在使用它我邏輯迴歸。目前,我只使用「WebsiteText」列,在其上運行TF-IDF,並將其插入Logistic迴歸。我想了解如何將此列與我的標準化GooglePageRank列結合使用,並在Logistic迴歸中使用這兩列 - 我該如何做到這一點?

這裏是我的代碼至今:

import numpy as np 
    from sklearn import metrics,preprocessing,cross_validation 
    from sklearn.feature_extraction.text import TfidfVectorizer 
    import sklearn.linear_model as lm 
    import pandas as p 
    loadData = lambda f: np.genfromtxt(open(f,'r'), delimiter=' ') 

    print "loading data.." 
    traindata = list(np.array(p.read_table('train.tsv'))[:,2]) 
    testdata = list(np.array(p.read_table('test.tsv'))[:,2]) 
    y = np.array(p.read_table('train.tsv'))[:,-1] 

    tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode', 
     analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1) 

    rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, 
          C=1, fit_intercept=True, intercept_scaling=1.0, 
          class_weight=None, random_state=None) 

    X_all = traindata + testdata 
    lentrain = len(traindata) 

    print "fitting pipeline" 
    tfv.fit(X_all) 
    print "transforming data" 
    X_all = tfv.transform(X_all) 

    X = X_all[:lentrain] 
    X_test = X_all[lentrain:] 

    print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=20, scoring='roc_auc')) 

    print "training on full data" 
    rd.fit(X,y) 
    pred = rd.predict_proba(X_test)[:,1] 
    testfile = p.read_csv('test.tsv', sep="\t", na_values=['?'], index_col=1) 
    pred_df = p.DataFrame(pred, index=testfile.index, columns=['label']) 
    pred_df.to_csv('benchmark.csv') 
    print "submission file created.." 

* 編輯:*

這是我目前運行 -

from sklearn import metrics,preprocessing,cross_validation 
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.feature_extraction import DictVectorizer 
import sklearn.preprocessing 
import sklearn.linear_model as lm 
import pandas as p 
loadData = lambda f: np.genfromtxt(open(f,'r'), delimiter=',') 
print "loading data.." 

#load train/test data for TF-IDF -- I know this is bad practice, but keeping it this way for the moment! 
traindata = list(np.array(p.read_csv('FinalCSVFin.csv', delimiter=";"))[:,2]) 
testdata = list(np.array(p.read_csv('FinalTestCSVFin.csv', delimiter=";"))[:,2]) 

#load labels 
y = np.array(p.read_csv('FinalCSVFin.csv', delimiter=";"))[:,-2] 

#Load Integer values and append together 
AllAlexaInfo = np.array(p.read_csv('FinalCSVFin.csv', delimiter=";"))[:,-1] 

#make tfidf object 
tfv = TfidfVectorizer(min_df=1, max_features=None, strip_accents='unicode', 
         analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1, 2), 
         use_idf=1,smooth_idf=1,sublinear_tf=1) 
div = DictVectorizer() 
X = [] 
X_all = traindata + testdata 
lentrain = len(traindata) 
# fit/transform the TfidfVectorizer on the training data 
vect = tfv.fit_transform(X_all) #bad practice, but using this for the moment! 

for i, alexarank in enumerate(AllAlexaInfo): 
    feature_dict = {'alexarank': AllAlexaInfo} 
    # get ith row of the tfidf matrix (corresponding to sample) 
    row = vect.getrow(i)  

    # filter the feature names corresponding to the sample 
    all_words = tfv.get_feature_names() 
    words = [all_words[ind] for ind in row.indices] 

    # associate each word (feature) with its corresponding score 
    word_score = dict(zip(words, row.data)) 

    # concatenate the word feature/score with the datamining feature/value 
    X.append(dict(word_score.items() + feature_dict.items())) 

div.fit_transform(X) # training data based on both Tfidf features and pagerank 
sc = preprocessing.StandardScaler().fit(X) 
X = sc.transform(X) 
X_test = X_all[lentrain:] 
X_test = sc.transform(X_test) 

print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=20, scoring='roc_auc')) 

print "training on full data" 
rd.fit(X,y) 
pred = rd.predict_proba(X_test)[:,1] 
testfile = p.read_csv('test.tsv', sep="\t", na_values=['?'], index_col=1) 
pred_df = p.DataFrame(pred, index=testfile.index, columns=['label']) 
pred_df.to_csv('benchmark.csv') 
print "submission file created.." 

這似乎是永遠運行,我也相信我沒有正確輸入「alexarank」值的問題 - 我該如何解決這個問題?

+0

IIRC,你想的功能從TfidfVectorizer與PR值結合起來,從而讓您的迴歸classfier使得基於選擇文本功能和網頁排名值? –

+0

@BarthazarRouberol這是正確的,是的:) –

回答

3

根據您的回答我的意見,我會相應地執行:

tfv = TfidfVectorizer(
    min_df=3, 
    max_features=None, 
    strip_accents='unicode',      
    analyzer='word', 
    token_pattern=r'\w{1,}', 
    ngram_range=(1, 2), 
    use_idf=1, 
    smooth_idf=1, 
    sublinear_tf=1) 
div = DictVectorizer() 

X = [] 

# fit/transform the TfidfVectorizer on the training data 
vectors = tfv.fit_transform(traindata) 

for i, pagerank in enumerate(pageranks): 
    feature_dict = {'pagerank': pagerank} 
    # get ith row of the tfidf matrix (corresponding to sample) 
    row = vect.getrow(i)  

    # filter the feature names corresponding to the sample 
    all_words = tfv.get_feature_names() 
    words = [all_words[ind] for ind in row.indices] 

    # associate each word (feature) with its corresponding score 
    word_score = dict(zip(words, row.data)) 

    # concatenate the word feature/score with the datamining feature/value 
    X.append(dict(word_score.items() + feature_dict.items())) 

div.fit_transform(X) # training data based on both Tfidf features and pagerank 
+0

是否有任何幫助? –

+0

非常感謝您的回覆。在這種情況下,你如何列舉頁面排名?你怎麼看他們?你的迴應非常有幫助,只是努力使它在目前運行 - 我是一個Python的初學者,請耐心等待! :)謝謝:) –

+0

我已更新我的問題,以顯示我使用您的建議對我的代碼所做的補充。不幸的是,我仍然無法運行:( –