2015-08-15 38 views
0

我使用minidom解析xbrl文件。我發現下面的getElementsByTagName使用xml解析返回html,如何獲取它的文本python

<table xmlns="http://www.w3.org/1999/xhtml" style="border-right: 0px; border-top: 0px; border-left: 0px; width: 650px; border-bottom: 0px; border-collapse: collapse" width="100%"><tr><td colspan="1">Independent auditor's report on the financial statements</td></tr></table><br><table xmlns="http://www.w3.org/1999/xhtml" style="border-right: 0px; border-top: 0px; border-left: 0px; width: 650px; border-bottom: 0px; border-collapse: collapse" width="100%"><tr><td colspan="1">We have audited the financial statements of KPMG Statsautoriseret Revisionspartnerselskab for the financial year 11 December 2013 – 31 December 2014. The financial statements comprise income statement, balance sheet, statement of changes in equity, cash flow statement accounting policies and notes. The financial statements are prepared in accordance with the Danish Financial Statements Act.</td></tr></table> 

現在,我想只有文字出來,我應該如何進行?我應該從現在開始跟美女一起去嗎?

整個文件可以在here被發現,並且我期待在該領域是<arr:AuditorsReportOnFinancialStatements

回答

0
soup = BeautifulSoup(auditorsReportOnAuditedFS[0].firstChild.data) 
    items = soup.find_all('td') 
    listForString = [] 
    for item in items: 
     listForString.append(item.text.encode('utf-8').strip()) 
    result.append(' : '.join(['AuditorsReportOnFinancialStatements', ' - '.join(listForString)])) 

這工作