數據集:包含的屬性/土地特徵的無監督分類查找與n元語法匹配單詞
df['bigram'] = df['Clean_Data'].apply(lambda row: list(ngrams(word_tokenize(row), 2)))
df[:,0:1]
Id bigram
1952043 [(Swimming,Pool),(Pool,in),(in,the),(the,roof),(roof,top),
1918916 [(Luxury,Apartments),(Apartments,consisting),(consisting,11),
1645751 [(Flat,available),(available,sale),(sale,Medavakkam),
1270503 [(Toddler,Pool),(Pool,with),(with,Jogging),(Jogging,Tracks),
1495638 [(near,medavakkam),(medavakkam,junction),(junction,calm),
我有一個Python文件(Categories.py)。
category = [('Luxury Apartments', 'IN', 'Recreation_Ammenities'),
('Swimming Pool', 'IN','Recreation_Ammenities'),
('Toddler Pool', 'IN', 'Recreation_Ammenities'),
('Jogging Tracks', 'IN', 'Recreation_Ammenities')]
Recreation = [e1 for (e1, rel, e2) in category if e2=='Recreation_Ammenities']
要找到兩字列第二類別列表中匹配的單詞:
tokens=pd.Series(df["bigram"])
Lid=pd.Series(df["Id"])
matches = tokens.apply(lambda x: pd.Series(x).str.extractall("|".join(["({})".format(cat) for cat in Categories.Recreation])))
在運行上面的代碼,我收到此錯誤:
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
需要幫助的這。
我所需的輸出是:
Id bigram Recreation_Amenities
1952043 [(Swimming,Pool),(Pool,in),(in,the),.. Swimming Pool
1918916 [(Luxury,Apartments),(Apartments,.. Luxury Apartments
1645751 [(Flat,available),(available,sale)..
1270503 [(Toddler,Pool),(Jogging,Tracks).. Toddler Pool,Jogging Tracks
1495638 [(near,medavakkam),..
,你能解釋一下在高清功能通過了 '行' 參數。而且我還希望多次爲每個類別使用此功能,如娛樂,醫療保健,安全等,以便我可以爲n個類別調用相同的功能。我怎麼能這樣做? –
函數'match_bigrams'被逐行應用(因爲數據框中的每一行都被傳入此函數)。關於你的第二個問題,取決於:該功能與「Recreation」列表中的類別匹配。因此,當您使用其他類別擴展此列表時,它應該適用於n個類別。 –
是的,但目前在功能,條件是 - '如果加入休閒:'就像明智我有多個類別,我想避免寫每個類別的整個功能。所以我可以通過在調用函數中傳遞類別名稱來調用相同的函數,在這裏 - df.apply(match_bigrams,axis = 1) –