在應用nltk的句子分詞器而不是Python中的句子後得到字母3.5.1

import codecs, os 
import re 
import string 
import mysql 
import mysql.connector 
y_ = "" 

'''Searching and reading text files from a folder.''' 
for root, dirs, files in os.walk("/Users/ultaman/Documents/PAN dataset/Pan  Plagiarism dataset 2010/pan-plagiarism-corpus-2010/source-documents/test1"): 
for file in files: 
    if file.endswith(".txt"): 
     x_ = codecs.open(os.path.join(root,file),"r", "utf-8-sig") 
     for lines in x_.readlines(): 
      y_ = y_ + lines 
'''Tokenizing the senteces of the text file.''' 
from nltk.tokenize import sent_tokenize 
raw_docs = sent_tokenize(y_) 

tokenized_docs = [sent_tokenize(y_) for sent in raw_docs] 

'''Removing punctuation marks.''' 

regex = re.compile('[%s]' % re.escape(string.punctuation)) 

tokenized_docs_no_punctuation = '' 

for review in tokenized_docs: 

new_review = '' 
for token in review: 
    new_token = regex.sub(u'', token) 
    if not new_token == u'': 
     new_review+= new_token 

tokenized_docs_no_punctuation += (new_review) 
print(tokenized_docs_no_punctuation) 

'''Connecting and inserting tokenized documents without punctuation in database field.''' 
def connect(): 
    for i in range(len(tokenized_docs_no_punctuation)): 
     conn = mysql.connector.connect(user = 'root', password = '', unix_socket = "/tmp/mysql.sock", database = 'test') 
     cursor = conn.cursor() 
     cursor.execute("""INSERT INTO splitted_sentences(sentence_id, splitted_sentences) VALUES(%s, %s)""",(cursor.lastrowid,(tokenized_docs_no_punctuation[i]))) 
     conn.commit() 
     conn.close() 
if __name__ == '__main__': 
    connect() 


After writing the above code, The result is like 

      2 | S  | N                                                                            | 
|   3 | S  | o                                                                           |

| 4 | S | | | 5 | S | d | | 6 | S | o | | 7 | S | u | | 8 | S | b | | 9 | S | t | | 10 | S | | | 11 | S | m | | 12 | S | y | | 13 | S |
| 14 | S | d
在數據庫中。在應用nltk的句子分詞器而不是Python中的句子後得到字母3.5.1

It should be like: 
    1 | S  | No doubt, my dear friend. 
    2 | S  | no doubt.

來源

2016-05-16 An student

nw = [] 
for review in tokenized_docs[0]: 
    new_review = '' 
    for token in review: 
     new_token = regex.sub(u'', token) 
     if not new_token == u'': 
     new_review += new_token 
nw.append(new_review) 
'''Inserting into database''' 
def connect(): 
    for j in nw: 
     conn = mysql.connector.connect(user = 'root', password = '', unix_socket = "/tmp/mysql.sock", database = 'Thesis') 
     cursor = conn.cursor() 
     cursor.execute("""INSERT INTO splitted_sentences(sentence_id, splitted_sentences) VALUES(%s, %s)""",(cursor.lastrowid,j)) 
     conn.commit() 
     conn.close() 
if __name__ == '__main__': 
    connect()

來源

2016-05-19 12:29:50

以上作品。 –

我建議進行以下編輯（使用你想要的）。但這是我用來讓你的代碼運行的東西。你的問題是review在for review in tokenized_docs:已經是一個字符串。所以，這使token在for token in review:字符。因此，要解決這個問題我想 -

tokenized_docs = ['"No doubt, my dear friend, no doubt; but in the meanwhile suppose we talk of this annuity.', 'Shall we say one thousand francs a year."', '"What!"', 'asked Bonelle, looking at him very fixedly.', '"My dear friend, I mistook; I meant two thousand francs per annum," hurriedly rejoined Ramin.', 'Monsieur Bonelle closed his eyes, and appeared to fall into a gentle slumber.', 'The mercer coughed;\nthe sick man never moved.', '"Monsieur Bonelle."'] 

'''Removing punctuation marks.''' 

regex = re.compile('[%s]' % re.escape(string.punctuation)) 

tokenized_docs_no_punctuation = [] 
for review in tokenized_docs: 
    new_token = regex.sub(u'', review) 
    if not new_token == u'': 
     tokenized_docs_no_punctuation.append(new_token) 

print(tokenized_docs_no_punctuation)

，並得到了這一點 -

['No doubt my dear friend no doubt but in the meanwhile suppose we talk of this annuity', 'Shall we say one thousand francs a year', 'What', 'asked Bonelle looking at him very fixedly', 'My dear friend I mistook I meant two thousand francs per annum hurriedly rejoined Ramin', 'Monsieur Bonelle closed his eyes and appeared to fall into a gentle slumber', 'The mercer coughed\nthe sick man never moved', 'Monsieur Bonelle']

輸出的最終格式是由你。我更喜歡使用列表。但是你也可以將它連接成一個字符串。 tokenized_docs的

來源

2016-05-17 07:14:33

輸出是：[[「「毫無疑問，我親愛的朋友，毫無疑問的;但在同時假設我們談論這個年金。」，「我們應該說一年千法郎」。」 '，'什麼！'''，'博尼爾問道，他非常堅定地看着他，'''我親愛的朋友，我錯了，我的意思是每年兩千法郎，「匆匆回到了拉敏身邊，'波尼爾先生閉上了眼睛，似乎陷入溫柔的睡眠中'，'Mercer咳嗽了一聲，'病人一直沒動。'''Bonelle先生。''，'沒有回覆。'，.... –

謝謝。但是，我們如何才能將列表的對象傳遞到數據庫呢？ –

假設splitted_sentences是字符串類型，這一行是準確的 - 「cursor.execute（」「」INSERT INTO splitted_sentences（sentence_id，splitted_sentences）VALUES（％s，％s）「」「，（cursor.lastrowid，（tokenized_docs_no_punctuation [i] ）））'。否則，使用'「」.join（tokenized_docs_no_punctuation）'連接成一個字符串。 –

在應用nltk的句子分詞器而不是Python中的句子後得到字母3.5.1

回答

相關問題