2016-03-05 102 views
1

使用Python,我必須編寫一個基本上「清理」數據文本文件的腳本。到目前爲止,我已經取出了所有不需要的字符或將它們替換爲可接受的字符(例如,可以用空格替換破折號-)。現在我已經到了必須分開加在一起的單詞的地步。這裏是文本的第15行的代碼段文件用大寫字母分隔連接詞

AccessibleComputing Computer accessibility 
AfghanistanHistory History of Afghanistan 
AfghanistanGeography Geography of Afghanistan 
AfghanistanPeople Demographics of Afghanistan 
AfghanistanCommunications Communications in Afghanistan 
AfghanistanMilitary Afghan Armed Forces 
AfghanistanTransportations Transport in Afghanistan 
AfghanistanTransnationalIssues Foreign relations of Afghanistan 
AssistiveTechnology Assistive technology 
AmoeboidTaxa Amoeba 
AsWeMayThink As We May Think 
AlbaniaHistory History of Albania 
AlbaniaPeople Demographics of Albania 
AlbaniaEconomy Economy of Albania 
AlbaniaGovernment Politics of Albania 

我想要做的是獨立的是在其中大寫字母出現點相連接的話。例如,我希望第一行看起來像這樣:

Accessible Computing Computer accessibility 

腳本必須接受文件輸入並將結果寫入輸出文件。這是我目前所擁有的,根本不起作用! (不知道如果我在正確的軌道或沒有在任)

import re 

input_file = open("C:\\Users\\Lucas\\Documents\\Python\\pagelinkSample_10K_cleaned2.txt",'r') 
output_file = open("C:\\Users\\Lucas\\Documents\\Python\\pagelinkSample_10K_cleaned3.txt",'w') 

for line in input_file: 
    if line.contains('A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'): 
     newline = line. 

output_file.write(newline) 

input_file.close() 
output_file.close() 
+0

我想要做的是在連接到前一個單詞的大寫字母之前插入一個空格。我早些時候看到了這個話題,但我無法弄清楚文件輸入:( – lsch91

回答

1

我建議用下面的正則表達式來分割的話:

import re, os 

input_file = 'input.txt' 
output_file = 'output.txt' 

with open(input_file, 'r') as f_in: 
    with open(output_file, 'w') as f_out: 
     for line in f_in.readlines(): 
      p = re.compile(r'[A-Z][a-z]+|\S+') 

      matches = re.findall(p, line) 
      matches = ' '.join(matches) 

      f_out.write(matches+ os.linesep) 

假設data.txt包含您粘貼在文章中的文本,它將打印:

Accessible Computing Computer accessibility 
Afghanistan History History of Afghanistan 
Afghanistan Geography Geography of Afghanistan 
Afghanistan People Demographics of Afghanistan 
Afghanistan Communications Communications in Afghanistan 
Afghanistan Military Afghan Armed Forces 
Afghanistan Transportations Transport in Afghanistan 
Afghanistan Transnational Issues Foreign relations of Afghanistan 
Assistive Technology Assistive technology 
Amoeboid Taxa Amoeba 
As We May Think As We May Think 
Albania History History of Albania 
Albania People Demographics of Albania 
Albania Economy Economy of Albania 
Albania Government Politics of Albania 
... 
+0

這個工作!非常感謝! – lsch91

0

你可以這樣做:

re.sub(r'(?P<end>[a-z])(?P<start>[A-Z])', '\g<end> \g<start>', line) 

這將在每個小寫大寫字母之間插入空格彼此相鄰(假設你只有英文字符)。

+0

還有一個文件中的unicode(這是30萬行長) – lsch91

1

這不是最好的方法,但它很簡單。

from string import uppercase 

s = 'AccessibleComputing Computer accessibility' 

>>> ' '.join(''.join(' ' + c if n and c in uppercase else c 
        for n, c in enumerate(word)) 
      for word in s.split()) 
'Accessible Computing Computer accessibility' 

順便說一下,這是你應該怎麼做你的文件讀/寫:

f_in = "C:\\Users\\Lucas\\Documents\\Python\\pagelinkSample_10K_cleaned2.txt" 
f_out = "C:\\Users\\Lucas\\Documents\\Python\\pagelinkSample_10K_cleaned3.txt" 

def func(line): 
    processed_line = ... # your line processing function 
    return processed_line 

with open(f_in, 'r') as fin: 
    with open(f_out, 'w') a fout: 
     for line in fin.readlines(): 
      fout.write(func(line)) 
+0

謝謝!我會試試這個,讓你知道它是怎麼回事 – lsch91

+0

好吧,歡迎您的到來,併爲此感到高興。 – Saleem

相關問題