解析整個目錄Etree Parse lxml

-1

我需要在一個目錄（我已經創建了一個帶有glob的語料庫）中用xml標記解析txt文件，但etree解析一次只允許一個文件。如何設置一個循環來一次解析所有文件？目標是使用請求將這些文件添加到Elasticsearch。這是我迄今爲止：解析整個目錄Etree Parse lxml

import json 
import os 
import re 
from lxml import etree 
import xmltodict 
import glob 

corpus=glob.glob('path/*.txt') 
ns=dict(tei="http://www.tei-c.org/ns/1.0") 
tree = etree.ElementTree(file='path/file.txt') 
doc = { 
    "author": tree.xpath('//tei:author/text()', namespaces=ns)[0], 
    "title": tree.xpath('//tei:title/text()', namespaces=ns)[0], 
    "content": "".join(tree.xpath('//tei:text/text()', namespaces=ns)) 
    }

來源

2016-08-05 adw

你問如何寫一個for循環？ –

只需重複在corpus列表。但是，您將希望使用容器（如列表或字典）來容納單獨解析的數據。下面假設.txt文件都能很好地形成.xml文件，並保持相同的結構，包括tei命名空間：

import os, glob 
from lxml import etree 

corpus = glob.glob('path/*.txt') 
ns = dict(tei="http://www.tei-c.org/ns/1.0") 

xmlList = []; xmlDict = {} 

for file in corpus: 
    tree = etree.parse(file) 
    doc = { 
      "author": tree.xpath('//tei:author/text()', namespaces=ns)[0], 
      "title": tree.xpath('//tei:title/text()', namespaces=ns)[0], 
      "content": "".join(tree.xpath('//tei:text/text()', namespaces=ns)) 
      } 
    # LIST OF DOC DICTS 
    xmlList.append(doc)     

    # DICTIONARY OF DOC DICTS, KEY IS FILE NAME 
    key = os.path.basename(file).replace('.txt', '') 
    xmlDict[key] = doc

來源

2016-08-05 14:34:53 Parfait

解析整個目錄Etree Parse lxml

回答

相關問題