2013-10-25 31 views
1

我有一個脫字符分隔的文件。文件中唯一的脫字符是分隔符 - 文本中沒有。幾個字段是自由文本字段幷包含嵌入的換行符。這使得解析文件非常困難。我需要記錄末尾的換行符,但我需要將它們從文本字段中刪除。刪除分隔文件中的嵌套換行符?

這是來自Global Integrated Shipping Information System的開源海事盜版數據。這裏有三條記錄,前面是標題行。首先,船名是NORMANNIA,第二個是Unkown,第三個是KOTA BINTANG。

ship_name^ship_flag^tonnage^date^time^imo_num^ship_type^ship_released_on^time_zone^incident_position^coastal_state^area^lat^lon^incident_details^crew_ship_cargo_conseq^incident_location^ship_status_when_attacked^num_involved_in_attack^crew_conseq^weapons_used_by_attackers^ship_parts_raided^lives_lost^crew_wounded^crew_missing^crew_hostage_kidnapped^assaulted^ransom^master_crew_action_taken^reported_to_coastal_authority^reported_to_which_coastal_authority^reporting_state^reporting_intl_org^coastal_state_action_taken 
NORMANNIA^Liberia^24987^2009-09-19^22:30^9142980^Bulk carrier^^^Off Pulau Mangkai,^^South China Sea^3° 04.00' N^105° 16.00' E^Eight pirates armed with long knives and crowbars boarded the ship underway. They broke into 2/O cabin, tied up his hands and threatened him with a long knife at his throat. Pirates forced the 2/O to call the Master. While the pirates were waiting next to the Master’s door, they seized C/E and tied up his hands. The pirates rushed inside the Master’s cabin once it was opened. They threatened him with long knives and crowbars and demanded money. Master’s hands were tied up and they forced him to the aft station. The pirates jumped into a long wooden skiff with ship’s cash and crew personal belongings and escaped. C/E and 2/O managed to free themselves and raised the alarm^Pirates tied up the hands of Master, C/E and 2/O. The pirates stole ship’s cash and master’s, C/E & 2/O cash and personal belongings^In international waters^Steaming^5-10 persons^Threat of violence against the crew^Knives^^^^^^^^SSAS activated and reported to owners^^Liberian Authority^^ICC-IMB Piracy Reporting Centre Kuala Lumpur^- 
Unkown^Marshall Islands^19846^2013-08-28^23:30^^General cargo ship^^^Cam Pha Port^Viet Nam^South China Sea^20° 59.92' N^107° 19.00' E^While at anchor, six robbers boarded the vessel through the anchor chain and cut opened the padlock of the door to the forecastle store. They removed the turnbuckle and lashing of the forecastle store's rope hatch. The robbers escaped upon hearing the alarm activated when they were sighted by the 2nd officer during the turn-over of duty watch keepers.^"There was no injury to the crew however, the padlock of the door to the forecastle store and the rope hatch were cut-opened. 

Two centre shackles and one end shackle were stolen"^In port area^At anchor^5-10 persons^^None/not stated^Main deck^^^^^^^-^^^Viet Nam^"ReCAAP ISC via ReCAAP Focal Point (Vietnam) 

ReCAAP ISC via Focal Point (Singapore)"^- 
KOTA BINTANG^Singapore^8441^2002-05-12^15:55^8021311^Bulk carrier^^UTC^^^South China Sea^^^Seven robbers armed with long knives boarded the ship, while underway. They broke open accommodation door, held hostage a crew member and forced the Master to open his cabin door. They then tied up the Master and crew member, forced them back onto poop deck from where the robbers jumped overboard and escaped in an unlit boat^Master and cadet assaulted; Cash, crew belongings and ship's cash stolen^In territorial waters^Steaming^5-10 persons^Actual violence against the crew^Knives^^^^^^2^^-^^Yes. SAR, Djakarta and Indonesian Naval Headquarters informed^^ICC-IMB PRC Kuala Lumpur^- 

你會注意到第一個和第三個記錄都很好並且很容易解析。第二個記錄「Unkown」有一些嵌套的換行符。

我應該如何去除python腳本中的嵌套換行符(但不包括記錄末尾的那些字符)(或者,如果有更簡單的方法),以便我可以將這些數據導入SAS?

回答

1

我通過計算遇到分隔符的數量和手動切換到一個新的紀錄解決了這個問題,當我達成了一個記錄相關的數字。然後,我刪除了所有換行符,並將數據寫回新文件。實質上,它是原始文件,其中從字段中刪除了換行符,但在每條記錄的末尾添加了換行符。這裏是代碼:

f = open("events.csv", "r") 

carets_per_record = 33 

final_file = [] 
temp_file = [] 
temp_str = '' 
temp_cnt = 0 

building = False 

for i, line in enumerate(f): 

    # If there are no carets on the line, we are building a string 
    if line.count('^') == 0: 
     building = True 

    # If we are not building a string, then set temp_str equal to the line 
    if building is False: 
     temp_str = line 
    else: 
     temp_str = temp_str + " " + line 

    # Count the number of carets on the line 
    temp_cnt = temp_str.count('^') 

    # If we do not have the proper number of carets, then we are building 
    if temp_cnt < carets_per_record: 
     building = True 

    # If we do have the proper number of carets, then we are finished 
    # and we can push this line to the list 
    elif temp_cnt == carets_per_record: 
     building = False 
     temp_file.append(temp_str) 

# Strip embedded newline characters from the temp file 
for i, item in enumerate(temp_file): 
    final_file.append(temp_file[i].replace('\n', '')) 

# Write the final_file list out to a csv final_file 
g = open("new_events.csv", "wb") 


# Write the lines back to the file 
for item in enumerate(final_file): 
    # item is a tuple, so we get the content part and append a new line 
    g.write(item[1] + '\n') 

# Close the files we were working with 
f.close() 
g.close() 
1

將數據加載到一個字符串,然後做

import re 
newa=re.sub('\n','',a) 

會有在紐瓦

newa=re.sub('\n(?!$)','',a) 

沒有換行和離開的人在該行的結束,但去掉休息

+1

這是否也不會刪除記錄換行符的結尾呢? – Clay

+0

我試過你的第二個例子,它也刪除了行尾的換行符 - 不僅僅是嵌入行。 – Clay

2

我看你已經標記爲正則表達式,但我會建議使用內置的CSV庫來解析這個。 CSV庫將正確解析文件,並保留換行符。

Python的CSV例子:http://docs.python.org/2/library/csv.html

+0

我同意,csv庫易於使用,似乎適合您的問題 – Vorsprung

+0

嗯,我真正需要的是一個csv文件,在字段中沒有換行符,這樣我就可以將它導入SAS。實際上,似乎刪除這些換行符的正則表達式方法的步驟較少。在解析數據後,如何處理將數據重新導出到csv以獲取格式良好的csv文件?一些內部文本字段也嵌入了引號,而另一些則沒有。 – Clay

+0

@Clay:也許你可以上傳一個示例文件到要點,我們可以告訴你如何使用csv模塊將它解析爲CSV,然後正確地重新輸出它。你真正需要的是現場報價,你的意見似乎並不包含。 – VooDooNOFX