2017-09-11 37 views
0

數據集:匹配字

> df 
Id  Clean_Data 
1918916 Luxury Apartments consisting 11 towers Well equipped gymnasium Swimming Pool Toddler Pool Health Club Steam Room Sauna Jacuzzi Pool Table Chess Billiards room Carom Table Tennis indoor games 
1495638 near medavakkam junction calm area near global hospital 
1050651 No Pre Emi No Booking Amount No Floor Rise Charges No Processing Fee HLPROJECT HIGHLIGHTS 

下面是被成功地從值列表返回匹配的單詞在n元語法Category.py

df['one_word_tokenized_text'] =df["Clean_Data"].str.split() 
df['bigram'] = df['Clean_Data'].apply(lambda row: list(ngrams(word_tokenize(row), 2))) 
df['trigram'] = df['Clean_Data'].apply(lambda row: list(ngrams(word_tokenize(row), 3))) 
df['four_words'] = df['Clean_Data'].apply(lambda row: list(ngrams(word_tokenize(row), 4))) 
token=pd.Series(df["one_word_tokenized_text"]) 
Lid=pd.Series(df["Id"]) 
matches= token.apply(lambda x: pd.Series(x).str.extractall("|".join(["({})".format(cat) for cat in Categories.HealthCare]))) 
match_list= [[m for m in match.values.ravel() if isinstance(m, str)] for match in matches] 
match_df = pd.DataFrame({"ID":Lid,"jc1":match_list}) 


def match_word(feature, row): 
    categories = [] 

    for bigram in row.bigram: 
     joined = ' '.join(bigram) 
     if joined in feature: 
      categories.append(joined) 
    for trigram in row.trigram: 
     joined = ' '.join(trigram) 
     if joined in feature: 
      categories.append(joined) 
    for fourwords in row.four_words: 
     joined = ' '.join(fourwords) 
     if joined in feature: 
      categories.append(joined) 
    return categories 

match_df['Health1'] = df.apply(partial(match_word, HealthCare), axis=1) 
match_df['HealthCare'] = match_df[match_df.columns[[1,2]]].apply(lambda x: ','.join(x.dropna().astype(str)),axis=1) 
代碼

Category.py

category = [('steam room','IN','HealthCare'), 
     ('sauna','IN','HealthCare'), 
     ('Jacuzzi','IN','HealthCare'), 
     ('Aerobics','IN','HealthCare'), 
     ('yoga room','IN','HealthCare'),] 
    HealthCare= [e1 for (e1, rel, e2) in category if e2=='HealthCare'] 

輸出:

ID HealthCare 
1918916 Jacuzzi 
1495638 
1050651 Aerobics, Jacuzzi, yoga room 

在這裏,如果我提到的確切字母大小寫在「類別列表」的功能,如數據集中提到的,那麼代碼標識,並返回值,否則它不會。 所以我希望我的代碼不區分大小寫,甚至可以跟蹤健康類別下的「蒸汽房」,「桑拿房」。我嘗試使用「.lower()」函數,但不知道如何實現它。

回答

1

編輯2:只category.py更新

Category.py

category = [('steam room','IN','HealthCare'), 
     ('sauna','IN','HealthCare'), 
     ('jacuzzi','IN','HealthCare'), 
     ('aerobics','IN','HealthCare'), 
     ('Yoga room','IN','HealthCare'), 
     ('booking','IN','HealthCare'),   
     ] 
category1 = [value[0].capitalize() for index, value in enumerate(category)] 
category2 = [value[0].lower() for index, value in enumerate(category)] 

test = [] 
test2 =[] 

for index, value in enumerate(category1): 
    test.append((value, category[index][1],category[index][2])) 

for index, value in enumerate(category2): 
    test2.append((value, category[index][1],category[index][2])) 

category = category + test + test2 


HealthCare = [e1 for (e1, rel, e2) in category if e2=='HealthCare'] 

你不變的數據集

import pandas as pd 
from nltk import ngrams, word_tokenize 
import Categories 
from Categories import * 
from functools import partial 


data = {'Clean_Data':['Luxury Apartments consisting 11 towers Well equipped gymnasium Swimming Pool Toddler Pool Health Club Steam Room Sauna Jacuzzi Pool Table Chess Billiards room Carom Table Tennis indoor games', 
        'near medavakkam junction calm area near global hospital', 
        'No Pre Emi No Booking Amount No Floor Rise Charges No Processing Fee HLPROJECT HIGHLIGHTS '], 
'Id' : [1918916, 1495638,1050651]} 

df = pd.DataFrame(data) 


df['one_word_tokenized_text'] =df["Clean_Data"].str.split() 
df['bigram'] = df['Clean_Data'].apply(lambda row: list(ngrams(word_tokenize(row), 2))) 
df['trigram'] = df['Clean_Data']).apply(lambda row: list(ngrams(word_tokenize(row), 3))) 
df['four_words'] = df['Clean_Data'].apply(lambda row: list(ngrams(word_tokenize(row), 4))) 
token=pd.Series(df["one_word_tokenized_text"]) 
Lid=pd.Series(df["Id"]) 
matches= token.apply(lambda x: pd.Series(x).str.extractall("|".join(["({})".format(cat) for cat in Categories.HealthCare]))) 
match_list= [[m for m in match.values.ravel() if isinstance(m, str)] for match in matches] 
match_df = pd.DataFrame({"ID":Lid,"jc1":match_list}) 


def match_word(feature, row): 
    categories = [] 

    for bigram in row.bigram: 
     joined = ' '.join(bigram) 
     if joined in feature: 
      categories.append(joined) 
    for trigram in row.trigram: 
     joined = ' '.join(trigram) 
     if joined in feature: 
      categories.append(joined) 
    for fourwords in row.four_words: 
     joined = ' '.join(fourwords) 
     if joined in feature: 
      categories.append(joined) 
    return categories 

match_df['Health1'] = df.apply(partial(match_word, HealthCare), axis=1) 
match_df['HealthCare'] = match_df[match_df.columns[[1,2]]].apply(lambda x: ','.join(x.dropna().astype(str)),axis=1)enize(row), 4))) 

輸出

print match_df 

+--------+----------------+-------------+------------------------------------+ 
|ID  |jc1    |Health1  |HealthCare       | 
+--------+----------------+-------------+------------------------------------+ 
|1918916 |[sauna, jacuzzi]|    |['sauna', 'jacuzzi'],['steam room'] | 
+--------+----------------+-------------+------------------------------------+ 
|1495638 |    |    |         | 
+--------+----------------+-------------+------------------------------------+ 
|1050651 | [Booking] |    | ['Booking'],[]     |    | 
+--------+----------------+-------------+------------------------------------+ 
+0

不,我不應該修改我的數據集值。我只是想將這些詞與類別值進行匹配,而不管情況如何。 –

+0

好吧,您已經爲您的數據集添加了列,我剛纔從我看到的方式編輯了我的答案,您可以: - a)爲您正在創建的3列設置較低/大寫變量 - b)嘗試在您的Category.py 中重現(使用python代碼)所有可能的大小寫格式,後者似乎是矯枉過正。 – Pelican

+0

對不起,如果我的問題很混亂,我理解你的觀點,但我擔心的是,我的最終輸出值案例不應該與我在數據集中收到的不同。如果「桑拿房」,「蒸汽房」有InitialCaps,則輸出時必須一致。我的意思是,如果我的數據集將來會包含類似的單詞,那麼我的代碼必須不區分大小寫以檢測它。 :) –