與BeautifulSoup和Python

-3

解析許多HTML文件我有HTML文本如下所示結構的許多實例：與BeautifulSoup和Python

<DOC> 
<DOCNO> XXX-2222 </DOCNO> 
<FIRST>Reports Former Saigon Officials Released from Re-education Camp</FIRST> 
<TEXT> 
Lots of text here 
</TEXT> 
</DOC> 
<DOC> 
<DOCNO> YYYY-0001 </DOCNO> 
<FIRST>AP-ONU-ISRAEL -URGENT-</FIRST> 
<TEXT> 
Text 
</TEXT> 
</DOC> 
etc, etc...

我需要做的是索引中的每個結構，與DocNo，首先，和文本，以後再分析（標記等）。

我想使用BeautifulSoup，但我需要一起提取幾件事 - 我該怎麼做，並將它們鏈接在一起？

我想的格式，如：

[(XXX-2222, "Reports Former Saigon Officials Released from Re-education Camp", "Lots of text here"), (YYYY-0001, "AP-ONU-ISRAEL -URGENT-", "Text"), etc...)

謝謝！

來源

2013-02-14 user2070177

這不是HTML。差遠了！它可能是* XML *嗎？ – 2013-02-14 19:38:31

文件格式是在HTML中，而文件本身是語言語料庫的一部分.. – user2070177 2013-02-14 19:42:18

我不明白 - 無論你在這裏發佈的是*不* HTML。你問我們如何解析你沒有顯示的HTML？此外，代碼與你嘗試過什麼？ – 2013-02-14 19:44:36

這不是從我可以告訴HTML，所以我不打算使用Beautifulsoup。下面是一個ElementTree的方法：

import xml.etree.cElementTree as ET 
from collections import namedtuple 

xml = """ 
<DOC> 
<DOCNO> XXX-2222 </DOCNO> 
<FIRST>Reports Former Saigon Officials Released from Re-education Camp</FIRST> 
<TEXT> 
Lots of text here 
</TEXT> 
</DOC> 
<DOC> 
<DOCNO> YYYY-0001 </DOCNO> 
<FIRST>AP-ONU-ISRAEL -URGENT-</FIRST> 
<TEXT> 
Text 
</TEXT> 
</DOC> 
""" 

Record = namedtuple('DOC', 'DOCNO FIRST TEXT') 

def wrapxmlfragment(fragment): 
    return '<root>{}</root>'.format(fragment) 

def getrecords(xml): 
    """Return list of records contained in an xml string""" 
    docs = ET.fromstring(xml) 
    return [recordfromDOC(doc) for doc in docs.findall('DOC')] 

def recordfromDOC(DOC): 
    return Record(
     DOC.find('DOCNO').text.strip(), 
     DOC.find('FIRST').text.strip(), 
     DOC.find('TEXT').text.strip() 
    ) 

print records 
firstrecord = records[0] 
print firstrecord[0] 
print firstrecord.DOCNO

可以很容易地擴展，以從文件列表工作：

def getrecordsfromfiles(filelist): 
    records = [] 
    for filename in filelist: 
     with open(filename, 'rb') as fp: 
      records.extend(getrecords(wrapxmlfragment(fp.read()))) 
    return records 

records = getrecords(wrapxmlfragment(xml))

然而，這是一個非常貧窮的（正，副本）的問題。

來源

2013-02-14 20:17:24

這不是重複的，因爲其他答案沒有提供有關_several_ html文件的信息。對不起，你這樣想。但是，謝謝你的回答。 – user2070177 2013-02-15 20:29:12

與BeautifulSoup和Python

回答

相關問題