2017-06-29 76 views
0

我正在處理一個需要我將大量XML文件解析爲JSON的項目。我寫了代碼,但它太慢了。我曾看過使用lxmlBeautifulSoup但我不確定如何繼續。將大量XML文件解析爲JSON

我已經包含了我的代碼。它的工作原理應該如何,除非它太慢。大約需要24小時才能通過一個低於100Mb的文件來解析100,000條記錄。

product_data = open('productdata_29.xml', 'r') 
read_product_data = product_data.read() 


def record_string_to_dict(record_string): 
'''This function takes a single record in string form and iterates through 
it, and sorts it as a dictionary. Only the nodes present in the parent_rss dict 
are appended to the new dict (single_record_dict). After each record, 
single_record_dict is flushed to final_list and is then emptied.''' 

    #Iterating through the string to find keys and values to put in to 
    #single_record_dict. 
    while record_string != record_string[::-1]: 

     try: 
      k = record_string.index('<') 

      l = record_string.index('>') 
      temp_key = record_string[k + 1:l] 
      record_string = record_string[l+1:] 
      m = record_string.index('<') 
      temp_value = record_string[:m] 

      #Cleaning thhe keys and values of unnecessary characters and symbols. 
      if '\n' in temp_value: 
       temp_value = temp_value[3:] 
      if temp_key[-1] == '/': 
       temp_key = temp_key[:-1] 

      n = record_string.index('\n') 
      record_string = record_string[n+2:] 

      #Checking parent_rss dict to see if the key from the record is present. If it is, 
      #the key is replaced with keys and added to single_record_dictionary. 
      if temp_key in mapped_nodes.keys(): 
       temp_key = mapped_nodes[temp_key] 
       single_record_dict[temp_key] = temp_value 

     except Exception: 
      break 


    while len(read_product_data) > 10: 

     #Goes through read_product_data to create blocks, each of which is a single 
     #record. 
     i = read_product_data.index('<record>') 
     j = read_product_data.index('</record>') + 8 
     single_record_string = read_product_data[i:j] 
     single_record_string = single_record_string[9:-10] 

     #Runs previous function with the input being the single string found previously. 
     record_string_to_dict(single_record_string) 

     #Flushes single_record_dict to final_list, and empties the dict for the next 
     #record. 
     final_list.append(single_record_dict) 
     single_record_dict = {} 

     #Removes the record that was previously processed. 
     read_product_data = read_product_data[j:] 

     #For keeping track/ease of use. 
     print('Record ' + str(break_counter) + ' has been appended.') 

     #Keeps track of the number of records. Once the set value is reached 
     #in the if loop, it is flushed to a new file. 
     break_counter += 1 
     flush_counter += 1 

     if break_counter == 100 or flush_counter == break_counter: 
      record_list = open('record_list_'+str(file_counter)+'.txt', 'w') 
      record_list.write(str(final_list)) 

      #file_counter keeps track of how many files have been created, so the next 
      #file has a different int at the end. 
      file_counter += 1 
      record_list.close() 

      #resets break counter 
      break_counter = 0 
      final_list = [] 
     #For testing purposes. Causes execution to stop once the number of files written 
     #matches the integer. 
     if file_counter == 2: 
      break 

    print('All records have been appended.') 
+0

請爲[可重現](https://stackoverflow.com/help/mcve)示例包含輸入xml和所需的輸出json。 – Parfait

回答

2

任何理由,你爲什麼不考慮包如xml2jsonxml2dict。看到這個職位的工作的例子: How can i convert an xml file into JSON using python?

從上面的帖子轉載

相關代碼:

xml2json

import xml2json 
s = '''<?xml version="1.0"?> 
    <note> 
     <to>Tove</to> 
     <from>Jani</from> 
     <heading>Reminder</heading> 
     <body>Don't forget me this weekend!</body> 
    </note>''' 
print xml2json.xml2json(s) 

xmltodict

import xmltodict, json 
o = xmltodict.parse('<e> <a>text</a> <a>text</a> </e>') 
json.dumps(o) # '{"e": {"a": ["text", "text"]}}' 

看到這個帖子,如果在工作Python 3: https://pythonadventures.wordpress.com/2014/12/29/xml-to-dict-xml-to-json/

import json 
import xmltodict 

def convert(xml_file, xml_attribs=True): 
    with open(xml_file, "rb") as f: # notice the "rb" mode 
     d = xmltodict.parse(f, xml_attribs=xml_attribs) 
     return json.dumps(d, indent=4) 
+0

我肯定會嘗試一些item_callback參數在這裏添加元素在JSON文件的末尾。事實上,不確定整個文件作爲字典可以保存在內存中。查看幫助(xmltodict.parse)瞭解更多信息。 –

0

你肯定不希望手工解析XML。與其他人提到的庫一樣,您可以使用XSLT 3.0處理器。要達到100Mb以上,您將受益於Saxon-EE等流媒體處理器,但開放源代碼Saxon-HE應該能夠破解這種水平。你沒有顯示源XML或目標JSON,所以我不能給你具體的代碼 - XSLT 3.0中的假設是你可能想要一個定製的轉換,而不是一個現成的轉換,所以總的想法是編寫模板規則,以定義應如何處理輸入XML的不同部分。