python-docx從下拉列表中獲取信息

我有一個包含多個表的docx文件，我想從列表中的表中獲取所有信息（該列表稱爲「alletabellen」）。通過下面的腳本，我幾乎可以收到表格中的所有信息，但某些變量（位於某些表格單元格中）中的某些變量的值除外。這些單元格的值在我的列表中保持爲空（例如，變量'Number：'中的值'1.2'，請參閱：https://s30.postimg.org/477j8z6ch/table.png我沒有在列表中獲取該值）。python-docx從下拉列表中獲取信息

是否有可能從這些變量中獲取信息？

import docx 

bestand = docx.Document('somefile.docx') 
tabellen = bestand.tables 

alletabellen = []  
for i, tabel in enumerate(tabellen): 
    for row in tabellen[i].rows: 
     for cell in row.cells: 
      for paragraph in cell.paragraphs: 
       alletabellen.append(paragraph.text)

更新

我找到了解決方案（感謝scanny誰指出我到正確的方向）。我沒有意識到一個docx文件實際上是一個帶xml文件的壓縮文件，其中包含所有文本。我使用模塊zipfile來提取docx和模塊bs4以查找所有下拉列表標記（'ddList'）並將數據放入列表中。在我的文檔中有12個下拉列表，我只需要其中的3個（其中一個是來自屏幕截圖中的「Number：」，這是該文檔中的第一個下拉列表）。

import docx 
import zipfile 
from bs4 import BeautifulSoup 

doc = 'somefile.docx' 

bestand = docx.Document(doc) 
tabellen = bestand.tables 

#get data from all the "normal" fields 

alletabellen = []  
for i, tabel in enumerate(tabellen): 
    for row in tabellen[i].rows: 
     for cell in row.cells: 
      for paragraph in cell.paragraphs: 
       alletabellen.append(paragraph.text) 

#get data from all the dropdown lists 

document = zipfile.ZipFile(doc) 
xml_data = document.read('word/document.xml') 
document.close() 

soup = BeautifulSoup(xml_data, 'xml') 
gegevens = soup.findAll('ddList')  #search dropdownlists (n = 12) 

dropdownlist = [] 
dropdownlistdata = [] 

for i in gegevens: 
    dropdownlist.append(i.find('result')) 

#convert to string for if statements 
number = str(dropdownlist[0]) 
job = str(dropdownlist[1]) 
vehicle = str(dropdownlist[7]) 

if number == '<w:result w:val="1"/>' : 
    dropdownlistdata.append('0,3') 
elif number == '<w:result w:val="2"/>' : 
    dropdownlistdata.append('1,2') 
elif number == '<w:result w:val="3"/>' : 
    dropdownlistdata.append('onbekend') 
else: 
    dropdownlistdata.append('geen') 

if job == '<w:result w:val="1"/>' : 
    dropdownlistdata.append('nee') 
else: 
    dropdownlistdata.append('ja') 

if vehicle == '<w:result w:val="1"/>' : 
    dropdownlistdata.append('nee') 
else: 
    dropdownlistdata.append('ja') 

#show data 
print alletabellen 
print dropdownlistdata

來源

2017-01-09 Joost

的「1.2」不來從.text回電的原因是最有可能的，它的包裹在某種「容器」 XML，使其像一個表單字段。

第一步是檢查XML，以便看到您遇到的問題。然後你會寫一些代碼來找到隱藏的內容。

opc-diag可以幫助你檢查你的XML： http://opc-diag.readthedocs.io/en/latest/index.html

你要尋找在document.xml部分。

如果修剪下來的文件，只是表現出這種行爲的最低，這使得它更容易找到你需要努力的部分。

如果您可以發佈該表的該部分的XML，我可以進一步指導您。

來源

2017-01-09 20:22:43 scanny

Omg，直到今天我不知道一個docx文件實際上是一個壓縮文件！感謝您的回答。我在腳本中添加了一些行來解壓縮docx（使用zipfile模塊），閱讀/word/document.xml並使用BeautifulSoup查找特定元素。我想我快到了。明天我會盡力完成我的劇本。 – Joost

python-docx從下拉列表中獲取信息

回答

相關問題