2013-05-15 42 views
0

我有一個文件每行有一個句子。我正在嘗試讀取文件並搜索句子是否是使用正則表達式的問題,並從句子中提取wh-word並根據它在第一個文件中出現的順序將它們保存回另一個文件中。用Python和正則表達式從文件行中搜索並提取WH-word

這是我迄今爲止..

def whWordExtractor(inputFile): 
    try: 
     openFileObject = open(inputFile, "r") 
     try: 

      whPattern = re.compile(r'(.*)who|what|how|where|when|why|which|whom|whose(\.*)', re.IGNORECASE) 
      with openFileObject as infile: 
       for line in infile: 

        whWord = whPattern.search(line) 
        print whWord 

# Save the whWord extracted from inputFile into another whWord.txt file 
#     writeFileObject = open('whWord.txt','a')     
#     if not whWord: 
#      writeFileObject.write('None' + '\n') 
#     else: 
#      whQuestion = whWord 
#      writeFileObject.write(whQuestion+ '\n') 

     finally: 
      print 'Done. All WH-word extracted.' 
      openFileObject.close() 
    except IOError: 
     pass 

The result after running the code above: set([]) 

有什麼我做錯了嗎?如果有人能指出我的名字,我將不勝感激。

+0

是該程序工作正常嗎? –

+0

不是我想要的方式。當它應該返回或打印從文件中提取的Wh字時,它將返回一個空列表。我使用打印功能來測試我是否得到正確的單詞。 – Cryssie

+0

你只想匹配第一個WH-單詞嗎?例如,「總統人的名字是什麼?」會返回'What',即使它也包含'who'。只是需要考慮。 –

回答

0

不知道這是你在找什麼,但你可以嘗試這樣的事:

def whWordExtractor(inputFile): 
    try: 
     whPattern = re.compile(r'who|what|how|where|when|why|which|whom|whose', re.IGNORECASE) 
     with open(inputFile, "r") as infile: 
      for line in infile: 
       whMatch = whPattern.search(line) 
       if whMatch: 
        whWord = whMatch.group() 
        print whWord 
        # save to file 
       else: 
        # no match 
    except IOError: 
     pass 
0

變化'(.*)who|what|how|where|when|why|which|whom|whose(\.*)'".*(?:who|what|how|where|when|why|which|whom|whose).*\."

1

事情是這樣的:

def whWordExtractor(inputFile): 
    try: 
     with open(inputFile) as f1: 
      whPattern = re.compile(r'(.*)who|what|how|where|when|why|which|whom|whose(\.*)', re.IGNORECASE) 
      with open('whWord.txt','a') as f2: #open file only once, to reduce I/O operations 
       for line in f1: 
        whWord = whPattern.search(line) 
        print whWord 
        if not whWord: 
         f2.write('None' + '\n') 
        else: 
         #As re.search returns a sre.SRE_Match object not string, so you will have to use either 
         # whWord.group() or better use whPattern.findall(line) 
         whQuestion = whWord.group() 
         f2.write(whQuestion+ '\n') 
       print 'Done. All WH-word extracted.' 
    except IOError: 
     pass 
+0

你和grc都回答了我的問題,但我只能選擇一個。所以,我選擇了先到先得的基礎。不過,我給了你一個+1的額外步驟來減少IO操作。 – Cryssie

相關問題