寫入XML - 缺失行

Python /編程初學者。正在從XML解析/提取數據。寫入XML - 缺失行

目標：將格式不正確的xml文件（一個.data文件中的多個xml文件）寫入單獨的xml文件。 FYI--每個xml以文件中的相同聲明開始，總共有4個。

方法（1）我在文件（2）上使用readlines（）查找每個xml聲明的索引（ 3）通過xml列表切片循環，將每行寫入文件。下面的代碼，道歉，如果它吮吸:)

For i, x in enumerate(decl_indxs): 
    xml_file = open(file, 'w') 
    if i == 4: 
     for line in file_lines[x:]: 
      xml_file.write(line) 
    else: 
     for line in file_lines[x:decl_indxs[i+1]]: 
      xml_file.write(line)

問題前3個XML是沒有問題產生。第4個xml僅寫入396行的前238行。

疑難解答我修改了代碼以打印出用於最後一個循環的列表片，這很好。我也通過第四個列表切片並正確輸出。

幫助任何人都可以解釋爲什麼會發生這種情況嗎？在改進我的方法方面得到建議也是很好的。越多的信息越好。謝謝

來源

2017-01-31 Jordan M

我不認爲你的方法找到索引是好的，很可能你在某處索引搞砸了。好消息是，這實際上不容易調試，因爲有很多無意義的整數值。我會盡力在這裏爲您提供一些有用的方法。

據我理解你的問題，你需要

使用with上下文管理器打開原始文件有多個個XML。
基於查找已知的聲明標題字符串<?xml，將原始文件的內容分割爲多個字符串變量。
也使用with上下文管理器將單個XML有效字符串寫入單個文件。
如果您需要對這些XML進行進一步的工作，您絕對應該尋找專門的XML解析器（xml.etree,lxml），並且從不像字符串一樣使用它們。

代碼示例：

def split_to_several_xmls(original_file_path): 
    # assuming that original source is correctly formatted, i.e. each line starts with "<" (omitting spaces) 
    # otherwise you need to parse by chars not by lines 
    with open(original_file_path) as f: 
     data = f.read() 
    resulting_lists = [] 
    for line in data.split('\n'): 
     if not line or not line.strip(): # ignore empty lines 
      continue 
     if line.strip().startswith('<?xml '): 
      resulting_lists.append([]) # create new list to write lines to new xml chunk 
     if not resulting_lists: # ignore everything before first xml decalartion 
      continue 
     resulting_lists[-1].append(line) # write current line to current xml chunk 
    resulting_strings = ['\n'.join(e) for e in resulting_lists] 
    # i.e. backwardly convert lines to strings - each one string is one valid xml chunk in the result 
    return resulting_strings 


def save_xmls(xml_strings, filename_base): 
    for i, xml_string in enumerate(xml_strings): 
     filename = '{base}{i}.xml'.format(base=filename_base, i=i) 
     with open(filename, mode='w') as f: 
      f.write(xml_string) 


def main(): 
    xml_strings = split_to_several_xmls('original.txt') # file with multiple xmls in one file 
    save_xmls(xml_strings, 'result') 


if __name__ == '__main__': 
    main()

來源

2017-01-31 13:01:53

謝謝@Nikolay我雙重檢查我的指標是正確的，當我打印出來的切片它包括了所有的數據。但是，無論如何，我對改進策略最感興趣。我查看了你提到的XML解析器，但無法弄清楚如何從單個文件創建多個樹。我只能爲第一個根獲得一棵樹。在調查過程中，我遇到了堆棧上的建議，以便簡化並逐行寫入。 –

是否將xml行連接到字符串（results_strings）中以提高寫入時的性能？跳過它並使用save_xmls函數中的一個嵌套for results_lists效率較低？ –

是的，解析器都是關於單個XML的。你的任務是使用字符串進行純粹的工作。即在你的任務完成後，你要處理結果，你需要一個95％的解析器。 –

寫入XML - 缺失行

回答

相關問題