0
我有以下代碼從文本文件的目錄中提取語句。如何將字符串附加到熊貓數據框?
# -*- coding: utf-8 -*-
from nltk.tokenize import sent_tokenize
import pandas as pd
directory_in_str = "E:\\Extracted\\"
directory = os.fsencode(directory_in_str)
for file in os.listdir(directory):
filename = os.fsdecode(file)
with open(os.path.join(directory_in_str, filename), encoding="utf8") as f_in:
for line in f_in:
sentences = sent_tokenize(line)
我想建立一個大熊貓數據框並追加句子到數據幀,這樣我可以構建的n-gram的句子的頻率計數爲每How to find ngram frequency of a column in a pandas dataframe?
也就是說我需要的句子追加到df = pd.DataFrame([], columns=['description'])
,這樣我可以再做:
from sklearn.feature_extraction.text import CountVectorizer
word_vectorizer = CountVectorizer(ngram_range=(1,2), analyzer='word')
sparse_matrix = word_vectorizer.fit_transform(df['description'])
frequencies = sum(sparse_matrix).toarray()[0]
pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['frequency'])
什麼是對句子的df
數據框添加的代碼?
如果我做'ngram_freq = pd.DataFrame(頻率,指數= word_vectorizer.get_feature_names(),列= [ '頻率'] )'和'df.index.name ='ngram''和'ngram_freq [ngram_freq.ngram =='youtube']'我無法獲得youtube的頻率計數。任何想法如何做到這一點? – Superdooperhero
@Superdooperhero你的意思是:'ngram_freq [ngram_freq.index =='youtube']'? –
對不起,應該是'ngram_freq.index.name ='ngram'' – Superdooperhero