2016-05-16 37 views
0
import codecs, os 
import re 
import string 
import mysql 
import mysql.connector 
y_ = "" 

'''Searching and reading text files from a folder.''' 
for root, dirs, files in os.walk("/Users/ultaman/Documents/PAN dataset/Pan  Plagiarism dataset 2010/pan-plagiarism-corpus-2010/source-documents/test1"): 
for file in files: 
    if file.endswith(".txt"): 
     x_ = codecs.open(os.path.join(root,file),"r", "utf-8-sig") 
     for lines in x_.readlines(): 
      y_ = y_ + lines 
'''Tokenizing the senteces of the text file.''' 
from nltk.tokenize import sent_tokenize 
raw_docs = sent_tokenize(y_) 

tokenized_docs = [sent_tokenize(y_) for sent in raw_docs] 

'''Removing punctuation marks.''' 

regex = re.compile('[%s]' % re.escape(string.punctuation)) 

tokenized_docs_no_punctuation = '' 

for review in tokenized_docs: 

new_review = '' 
for token in review: 
    new_token = regex.sub(u'', token) 
    if not new_token == u'': 
     new_review+= new_token 

tokenized_docs_no_punctuation += (new_review) 
print(tokenized_docs_no_punctuation) 

'''Connecting and inserting tokenized documents without punctuation in database field.''' 
def connect(): 
    for i in range(len(tokenized_docs_no_punctuation)): 
     conn = mysql.connector.connect(user = 'root', password = '', unix_socket = "/tmp/mysql.sock", database = 'test') 
     cursor = conn.cursor() 
     cursor.execute("""INSERT INTO splitted_sentences(sentence_id, splitted_sentences) VALUES(%s, %s)""",(cursor.lastrowid,(tokenized_docs_no_punctuation[i]))) 
     conn.commit() 
     conn.close() 
if __name__ == '__main__': 
    connect() 


After writing the above code, The result is like 

      2 | S  | N                                                                            | 
|   3 | S  | o                                                                           | 

| 4 | S | | | 5 | S | d | | 6 | S | o | | 7 | S | u | | 8 | S | b | | 9 | S | t | | 10 | S | | | 11 | S | m | | 12 | S | y | | 13 | S |
| 14 | S | d
在數據庫中。在應用nltk的句子分詞器而不是Python中的句子後得到字母3.5.1

It should be like: 
    1 | S  | No doubt, my dear friend. 
    2 | S  | no doubt.                                         

回答

0
nw = [] 
for review in tokenized_docs[0]: 
    new_review = '' 
    for token in review: 
     new_token = regex.sub(u'', token) 
     if not new_token == u'': 
     new_review += new_token 
nw.append(new_review) 
'''Inserting into database''' 
def connect(): 
    for j in nw: 
     conn = mysql.connector.connect(user = 'root', password = '', unix_socket = "/tmp/mysql.sock", database = 'Thesis') 
     cursor = conn.cursor() 
     cursor.execute("""INSERT INTO splitted_sentences(sentence_id, splitted_sentences) VALUES(%s, %s)""",(cursor.lastrowid,j)) 
     conn.commit() 
     conn.close() 
if __name__ == '__main__': 
    connect() 
+0

以上作品。 –

0

我建議進行以下編輯(使用你想要的)。但這是我用來讓你的代碼運行的東西。你的問題是reviewfor review in tokenized_docs:已經是一個字符串。所以,這使tokenfor token in review:字符。因此,要解決這個問題我想 -

tokenized_docs = ['"No doubt, my dear friend, no doubt; but in the meanwhile suppose we talk of this annuity.', 'Shall we say one thousand francs a year."', '"What!"', 'asked Bonelle, looking at him very fixedly.', '"My dear friend, I mistook; I meant two thousand francs per annum," hurriedly rejoined Ramin.', 'Monsieur Bonelle closed his eyes, and appeared to fall into a gentle slumber.', 'The mercer coughed;\nthe sick man never moved.', '"Monsieur Bonelle."'] 

'''Removing punctuation marks.''' 

regex = re.compile('[%s]' % re.escape(string.punctuation)) 

tokenized_docs_no_punctuation = [] 
for review in tokenized_docs: 
    new_token = regex.sub(u'', review) 
    if not new_token == u'': 
     tokenized_docs_no_punctuation.append(new_token) 

print(tokenized_docs_no_punctuation) 

,並得到了這一點 -

['No doubt my dear friend no doubt but in the meanwhile suppose we talk of this annuity', 'Shall we say one thousand francs a year', 'What', 'asked Bonelle looking at him very fixedly', 'My dear friend I mistook I meant two thousand francs per annum hurriedly rejoined Ramin', 'Monsieur Bonelle closed his eyes and appeared to fall into a gentle slumber', 'The mercer coughed\nthe sick man never moved', 'Monsieur Bonelle'] 

輸出的最終格式是由你。我更喜歡使用列表。但是你也可以將它連接成一個字符串。 tokenized_docs的

+0

輸出是:[[「「毫無疑問,我親愛的朋友,毫無疑問的;但在同時假設我們談論這個年金。」,「我們應該說一年千法郎」。」 ','什麼!''','博尼爾問道,他非常堅定地看着他,'''我親愛的朋友,我錯了,我的意思是每年兩千法郎,「匆匆回到了拉敏身邊,'波尼爾先生閉上了眼睛,似乎陷入溫柔的睡眠中','Mercer咳嗽了一聲,'病人一直沒動。'''Bonelle先生。'','沒有回覆。',.... –

+0

謝謝。但是,我們如何才能將列表的對象傳遞到數據庫呢? –

+0

假設splitted_sentences是字符串類型,這一行是準確的 - 「cursor.execute(」「」INSERT INTO splitted_sentences(sentence_id,splitted_sentences)VALUES(%s,%s)「」「,(cursor.lastrowid,(tokenized_docs_no_punctuation [i] )))'。否則,使用'「」.join(tokenized_docs_no_punctuation)'連接成一個字符串。 –

相關問題