2017-04-24 94 views
16

尋求關於如何從多個文本文件中挖掘項目以建立字典的建議。Python從多個txt文件解析文本

該文本文件:https://pastebin.com/Npcp3HCM

手動變換成這個所需的數據結構:https://drive.google.com/file/d/0B2AJ7rliSQubV0J2Z0d0eXF3bW8/view

有數千個這樣的文本文件,並且它們可以具有如在這些實施例中所示不同的部分標題:

  1. https://pastebin.com/wWSPGaLX
  2. https://pastebin.com/9Up4RWHu

我開始通過閱讀文件

from glob import glob 

txtPth = '../tr-txt/*.txt' 
txtFiles = glob(txtPth) 

with open(txtFiles[0],'r') as tf: 
    allLines = [line.rstrip() for line in tf] 

sectionHeading = ['Corporate Participants', 
        'Conference Call Participiants', 
        'Presentation', 
        'Questions and Answers'] 

for lineNum, line in enumerate(allLines): 
    if line in sectionHeading: 
     print(lineNum,allLines[lineNum]) 

我的想法是,尋找在那裏節標題中存在的行號,並嘗試提取這些行號之間的內容,然後剝離出像虛線分隔。這並沒有奏效,我試圖創建一個這樣的字典,以便我可以在以後運行各種自然語言處理算法的採購項目。

{file-name1:{ 
    {date-time:[string]}, 
    {corporate-name:[string]}, 
    {corporate-participants:[name1,name2,name3]}, 
    {call-participants:[name4,name5]}, 
    {section-headings:{ 
     {heading1:[ 
      {name1:[speechOrderNum, text-content]}, 
      {name2:[speechOrderNum, text-content]}, 
      {name3:[speechOrderNum, text-content]}], 
     {heading2:[ 
      {name1:[speechOrderNum, text-content]}, 
      {name2:[speechOrderNum, text-content]}, 
      {name3:[speechOrderNum, text-content]}, 
      {name2:[speechOrderNum, text-content]}, 
      {name1:[speechOrderNum, text-content]}, 
      {name4:[speechOrderNum, text-content]}], 
     {heading3:[text-content]}, 
     {heading4:[text-content]} 
     } 
    } 
} 

挑戰是不同的文件可能有不同的標題和標題數量。但總會有一個部分叫做「Presentation」,很可能會有「Question and Answer」部分。這些章節標題總是由一串等號表示。不同說話者的內容總是用一串破折號分開。 Q &的「語音指令」部分用方括號中的數字表示。參與者總是在文件的開頭用星號標出,他們的名字總是在下一行。

任何有關如何解析文本文件的建議表示讚賞。理想的幫助是提供關於如何爲每個文件生成這樣的字典(或其他合適的數據結構)的指導,然後可以將其寫入數據庫。

感謝

- 編輯 -

其中的一個文件看起來是這樣的:https://pastebin.com/MSvmHb2e

在其中的「問題&答案」部分錯誤標註爲「演示」並沒有其他「問題&答案」部分。

和最後的示例文本:https://pastebin.com/jr9WfpV8

+3

我不會建議你到所有的文本數據存儲在一個單一的'dict'對象,正如你所提到的,可能會有大量的文本文件被解析,所以在運行時,python進程需要更多的時間來更新'dict'對象,因爲'dict'對象的大小增加了,並且如果你擁有OutOfMemory一些真正巨大的文件需要處理,我敢打賭一些'DBMS'來存儲這種數據。 – ZdaR

+0

@ZdaR感謝您的建議。在閱讀您的評論後,我決定使用數據庫。我目前正在研究sqlalchemy – samkhan13

+0

錯誤標籤不會很容易解決。您將不得不使用ML技術來構建分類器,該技術會將某個部分分類爲「Presentation」或「Question&Answer」部分,因爲沒有可靠的線索(使用手工制定的規則無法獲得大量模式識別)正確的100%)出現在文本中。 – entrophy

回答

8

在代碼中的註釋應該解釋一切。讓我知道是否有任何規定,並需要更多的意見。

簡而言之,我利用正則表達式來查找'='分隔線來將整個文本細分爲小節,然後分別處理每種類型的部分(爲了清楚起見,您可以告訴我如何處理每個案例)。

附註:我使用的詞'參加者'和'作者'互換。

編輯:更新了代碼,根據演示文稿/質量檢查部分中與會者/作者旁邊的'[x]'模式進行排序。由於pprint不能很好地處理OrderedDict,所以也改變了漂亮的打印部分。

要去除字符串中任何位置的任何其他空白,包括\n,只需執行str.strip()即可。如果您特別需要剝離\n,那麼只需做str.strip('\n')

我修改了代碼以去掉會談中的任何空白。

import json 
import re 
from collections import OrderedDict 
from pprint import pprint 


# Subdivides a collection of lines based on the delimiting regular expression. 
# >>> example_string =' ============================= 
#      asdfasdfasdf 
#      sdfasdfdfsdfsdf 
#      ============================= 
#      asdfsdfasdfasd 
#      ============================= 
# >>> subdivide(example_string, "^=+") 
# >>> ['asdfasdfasdf\nsdfasdfdfsdfsdf\n', 'asdfsdfasdfasd\n'] 
def subdivide(lines, regex): 
    equ_pattern = re.compile(regex, re.MULTILINE) 
    sections = equ_pattern.split(lines) 
    sections = [section.strip('\n') for section in sections] 
    return sections 


# for processing sections with dashes in them, returns the heading of the section along with 
# a dictionary where each key is the subsection's header, and each value is the text in the subsection. 
def process_dashed_sections(section): 

    subsections = subdivide(section, "^-+") 
    heading = subsections[0] # header of the section. 
    d = {key: value for key, value in zip(subsections[1::2], subsections[2::2])} 
    index_pattern = re.compile("\[(.+)\]", re.MULTILINE) 

    # sort the dictionary by first capturing the pattern '[x]' and extracting 'x' number. 
    # Then this is passed as a compare function to 'sorted' to sort based on 'x'. 
    def cmp(d): 
     mat = index_pattern.findall(d[0]) 
     if mat: 
      print(mat[0]) 
      return int(mat[0]) 
     # There are issues when dealing with subsections containing '-'s but not containing '[x]' pattern. 
     # This is just to deal with that small issue. 
     else: 
      return 0 

    o_d = OrderedDict(sorted(d.items(), key=cmp)) 
    return heading, o_d 


# this is to rename the keys of 'd' dictionary to the proper names present in the attendees. 
# it searches for the best match for the key in the 'attendees' list, and replaces the corresponding key. 
# >>> d = {'mr. man ceo of company [1]' : ' This is talk a' , 
# ...  'ms. woman ceo of company [2]' : ' This is talk b'} 
# >>> l = ['mr. man', 'ms. woman'] 
# >>> new_d = assign_attendee(d, l) 
# new_d = {'mr. man': 'This is talk a', 'ms. woman': 'This is talk b'} 
def assign_attendee(d, attendees): 
    new_d = OrderedDict() 
    for key, value in d.items(): 
     a = [a for a in attendees if a in key] 
     if len(a) == 1: 
      # to strip out any additional whitespace anywhere in the text including '\n'. 
      new_d[a[0]] = value.strip() 
     elif len(a) == 0: 
      # to strip out any additional whitespace anywhere in the text including '\n'. 
      new_d[key] = value.strip() 
    return new_d 


if __name__ == '__main__': 
    with open('input.txt', 'r') as input: 
     lines = input.read() 

     # regex pattern for matching headers of each section 
     header_pattern = re.compile("^.*[^\n]", re.MULTILINE) 

     # regex pattern for matching the sections that contains 
     # the list of attendee's (those that start with asterisks) 
     ppl_pattern = re.compile("^(\s+\*)(.+)(\s.*)", re.MULTILINE) 

     # regex pattern for matching sections with subsections in them. 
     dash_pattern = re.compile("^-+", re.MULTILINE) 

     ppl_d = dict() 
     talks_d = dict() 

     # Step1. Divide the the entire document into sections using the '=' divider 
     sections = subdivide(lines, "^=+") 
     header = [] 
     print(sections) 
     # Step2. Handle each section like a switch case 
     for section in sections: 

      # Handle headers 
      if len(section.split('\n')) == 1: # likely to match only a header (assuming) 
       header = header_pattern.match(section).string 

      # Handle attendees/authors 
      elif ppl_pattern.match(section): 
       ppls = ppl_pattern.findall(section) 
       d = {key.strip(): value.strip() for (_, key, value) in ppls} 
       ppl_d.update(d) 

       # assuming that if the previous section was detected as a header, then this section will relate 
       # to that header 
       if header: 
        talks_d.update({header: ppl_d}) 

      # Handle subsections 
      elif dash_pattern.findall(section): 
       heading, d = process_dashed_sections(section) 

       talks_d.update({heading: d}) 

      # Else its just some random text. 
      else: 

       # assuming that if the previous section was detected as a header, then this section will relate 
       # to that header 
       if header: 
        talks_d.update({header: section}) 

     #pprint(talks_d) 
     # To assign the talks material to the appropriate attendee/author. Still works if no match found. 
     for key, value in talks_d.items(): 
      talks_d[key] = assign_attendee(value, ppl_d.keys()) 

     # ordered dict does not pretty print using 'pprint'. So a small hack to make use of json output to pretty print. 
     print(json.dumps(talks_d, indent=4)) 
+0

我可以接受這個答案,如果你可以在speech_d中包含演講順序和演講。語音順序用方括號表示。如果talk_d是一個有序的字典,它會很有用。 – samkhan13

+0

如何從talks_d文本中去除'\ n'? – samkhan13

+0

更新了請求更改的答案。 – entrophy

3

請問您是否只需要「演示文稿」和「問題與答案」部分? 此外,關於輸出可以轉儲類似於你有「手動轉換」的CSV格式。

更新的解決方案適用於您提供的每個樣本文件。

根據共享的「Parsed-transcript」文件,輸出來自單元「D:H」。

#state = ["other", "head", "present", "qa", "speaker", "data"] 
# codes : 0, 1, 2, 3, 4, 5 
def writecell(out, data): 
    out.write(data) 
    out.write(",") 

def readfile(fname, outname): 
    initstate = 0 
    f = open(fname, "r") 
    out = open(outname, "w") 
    head = "" 
    head_written = 0 
    quotes = 0 
    had_speaker = 0 
    for line in f: 
     line = line.strip() 
     if not line: continue 
     if initstate in [0,5] and not any([s for s in line if "=" != s]): 
      if initstate == 5: 
       out.write('"') 
       quotes = 0 
       out.write("\n") 
      initstate = 1 
     elif initstate in [0,5] and not any([s for s in line if "-" != s]): 
      if initstate == 5: 
       out.write('"') 
       quotes = 0 
       out.write("\n") 
       initstate = 4 
     elif initstate == 1 and line == "Presentation": 
      initstate = 2 
      head = "Presentation" 
      head_written = 0 
     elif initstate == 1 and line == "Questions and Answers": 
      initstate = 3 
      head = "Questions and Answers" 
      head_written = 0 
     elif initstate == 1 and not any([s for s in line if "=" != s]): 
      initstate = 0 
     elif initstate in [2, 3] and not any([s for s in line if ("=" != s and "-" != s)]): 
      initstate = 4 
     elif initstate == 4 and '[' in line and ']' in line: 
      comma = line.find(',') 
      speech_st = line.find('[') 
      speech_end = line.find(']') 
      if speech_st == -1: 
       initstate = 0 
       continue 
      if comma == -1: 
       firm = "" 
       speaker = line[:speech_st].strip() 
      else: 
       speaker = line[:comma].strip() 
       firm = line[comma+1:speech_st].strip() 
      head_written = 1 
      if head_written: 
       writecell(out, head) 
       head_written = 0 
      order = line[speech_st+1:speech_end] 
      writecell(out, speaker) 
      writecell(out, firm) 
      writecell(out, order) 
      had_speaker = 1 
     elif initstate == 4 and not any([s for s in line if ("=" != s and "-" != s)]): 
      if had_speaker: 
       initstate = 5 
       out.write('"') 
       quotes = 1 
      had_speaker = 0 
     elif initstate == 5: 
      line = line.replace('"', '""') 
      out.write(line) 
     elif initstate == 0: 
      continue 
     else: 
      continue 
    f.close() 
    if quotes: 
     out.write('"') 
    out.close() 

readfile("Sample1.txt", "out1.csv") 
readfile("Sample2.txt", "out2.csv") 
readfile("Sample3.txt", "out3.csv") 

在該溶液中詳細

有一個狀態機,其工作原理如下: 1.檢測標題是否存在,如果是,它寫 2.檢測揚聲器標題被寫入 3後。爲該揚聲器寫入筆記 4.切換到下一個揚聲器等等......

您可以稍後處理csv文件。 您也可以在完成基本處理後以任何您想要的格式填充數據。

編輯:

請更換功能「writecell」

def writecell(out, data): 
    data = data.replace('"', '""') 
    out.write('"') 
    out.write(data) 
    out.write('"') 
    out.write(",") 
+0

您的方法在結構上與我的要求最接近。它密切處理所有提供的示例文件。但有時在公司名稱後面會出現昏迷,從而導致產出結構混亂。我可以接受最好解決問題中--EDIT--部分樣本的答案。 – samkhan13

+0

可以直接寫入csv文件或字典或數據庫 – samkhan13

+0

您好,我根據您的反饋更新了我的答案。感謝您的反饋。 – mangupt