0
數據集:匹配字
> df
Id Clean_Data
1918916 Luxury Apartments consisting 11 towers Well equipped gymnasium Swimming Pool Toddler Pool Health Club Steam Room Sauna Jacuzzi Pool Table Chess Billiards room Carom Table Tennis indoor games
1495638 near medavakkam junction calm area near global hospital
1050651 No Pre Emi No Booking Amount No Floor Rise Charges No Processing Fee HLPROJECT HIGHLIGHTS
下面是被成功地從值列表返回匹配的單詞在n元語法在Category.py
df['one_word_tokenized_text'] =df["Clean_Data"].str.split()
df['bigram'] = df['Clean_Data'].apply(lambda row: list(ngrams(word_tokenize(row), 2)))
df['trigram'] = df['Clean_Data'].apply(lambda row: list(ngrams(word_tokenize(row), 3)))
df['four_words'] = df['Clean_Data'].apply(lambda row: list(ngrams(word_tokenize(row), 4)))
token=pd.Series(df["one_word_tokenized_text"])
Lid=pd.Series(df["Id"])
matches= token.apply(lambda x: pd.Series(x).str.extractall("|".join(["({})".format(cat) for cat in Categories.HealthCare])))
match_list= [[m for m in match.values.ravel() if isinstance(m, str)] for match in matches]
match_df = pd.DataFrame({"ID":Lid,"jc1":match_list})
def match_word(feature, row):
categories = []
for bigram in row.bigram:
joined = ' '.join(bigram)
if joined in feature:
categories.append(joined)
for trigram in row.trigram:
joined = ' '.join(trigram)
if joined in feature:
categories.append(joined)
for fourwords in row.four_words:
joined = ' '.join(fourwords)
if joined in feature:
categories.append(joined)
return categories
match_df['Health1'] = df.apply(partial(match_word, HealthCare), axis=1)
match_df['HealthCare'] = match_df[match_df.columns[[1,2]]].apply(lambda x: ','.join(x.dropna().astype(str)),axis=1)
代碼
Category.py
category = [('steam room','IN','HealthCare'),
('sauna','IN','HealthCare'),
('Jacuzzi','IN','HealthCare'),
('Aerobics','IN','HealthCare'),
('yoga room','IN','HealthCare'),]
HealthCare= [e1 for (e1, rel, e2) in category if e2=='HealthCare']
輸出:
ID HealthCare
1918916 Jacuzzi
1495638
1050651 Aerobics, Jacuzzi, yoga room
在這裏,如果我提到的確切字母大小寫在「類別列表」的功能,如數據集中提到的,那麼代碼標識,並返回值,否則它不會。 所以我希望我的代碼不區分大小寫,甚至可以跟蹤健康類別下的「蒸汽房」,「桑拿房」。我嘗試使用「.lower()」函數,但不知道如何實現它。
不,我不應該修改我的數據集值。我只是想將這些詞與類別值進行匹配,而不管情況如何。 –
好吧,您已經爲您的數據集添加了列,我剛纔從我看到的方式編輯了我的答案,您可以: - a)爲您正在創建的3列設置較低/大寫變量 - b)嘗試在您的Category.py 中重現(使用python代碼)所有可能的大小寫格式,後者似乎是矯枉過正。 – Pelican
對不起,如果我的問題很混亂,我理解你的觀點,但我擔心的是,我的最終輸出值案例不應該與我在數據集中收到的不同。如果「桑拿房」,「蒸汽房」有InitialCaps,則輸出時必須一致。我的意思是,如果我的數據集將來會包含類似的單詞,那麼我的代碼必須不區分大小寫以檢測它。 :) –