2015-08-22 130 views
6

我需要刪除許多docx文件中的頁眉和頁腳。我當前正在嘗試使用python-docx庫,但此時它不支持docx文檔中的頁眉和頁腳(正在進行中)。Python - 從docx文件中刪除頁眉和頁腳

有沒有什麼辦法可以在Python中實現?

據我所知,docx是一種基於xml的格式,但我不知道如何使用它。

P.S.I有一個主意,用LXML或BeautifulSoup解析XML和更換一些部件,但看起來髒

UPD。感謝肖恩,這是一個很好的起點。我對腳本做了一些修改。這是我的最終版本(對我來說很有用,因爲我需要編輯許多.docx文件。我使用BeautifulSoup,因爲standart xml解析器無法獲得有效的xml樹。另外,我的docx文檔沒有頁眉和頁腳的XML。他們只是放在頭的和頁腳的圖片在頁面的頂部。此外,更多的速度,你可以用它代替湯LXML。

import zipfile 
import shutil as su 
import os 
import tempfile 
from bs4 import BeautifulSoup 


def get_xml_from_docx(docx_filename): 
    """ 
     Return content of document.xml file inside docx document 
    """ 
    with zipfile.ZipFile(docx_filename) as zf: 
     xml_info = zf.read('word/document.xml') 
    return xml_info 


def write_and_close_docx(self, edited_xml, output_filename): 
    """ Create a temp directory, expand the original docx zip. 
     Write the modified xml to word/document.xml 
     Zip it up as the new docx 
    """ 
    tmp_dir = tempfile.mkdtemp() 

    with zipfile.ZipFile(self) as zf: 
     zf.extractall(tmp_dir) 

    with open(os.path.join(tmp_dir, 'word/document.xml'), 'w') as f: 
     f.write(str(edited_xml)) 

    # Get a list of all the files in the original docx zipfile 
    filenames = zf.namelist() 
    # Now, create the new zip file and add all the filex into the archive 
    zip_copy_filename = output_filename 
    docx = zipfile.ZipFile(zip_copy_filename, "w") 
    for filename in filenames: 
     docx.write(os.path.join(tmp_dir, filename), filename) 

    # Clean up the temp dir 
    su.rmtree(tmp_dir) 


if __name__ == '__main__': 
    directory = 'your_directory/' 
    files = os.listdir(directory) 
    for file in files: 
     if file.endswith('.docx'): 
      word_doc = directory + file 
      new_word_doc = 'edited/' + file.rstrip('.docx') + '-edited.docx' 
      tree = get_xml_from_docx(word_doc) 
      soup = BeautifulSoup(tree, 'xml') 
      shapes = soup.find_all('shape') 
      for shape in shapes: 
       if 'margin-left:0pt' in shape.get('style'): 
        shape.parent.decompose() 
      write_and_close_docx(word_doc, soup, new_word_doc) 

所以,這是它:)我知道,代碼不乾淨,很抱歉。

回答

3

嗯,我從來沒有想過,但我只是創建了一個頭文件和頁腳test.docx。一旦你有了這個docx,你可以用unzip它來獲得組成的XML文件。對於我的簡單測試情況下,這產生了:

word/ 
_rels   footer1.xml  styles.xml 
document.xml  footnotes.xml  stylesWithEffects.xml 
endnotes.xml  header1.xml  theme 
fontTable.xml  settings.xml  webSettings.xml 

開放word/documents.xml給你的主要問題區域。你可以看到有關於頁眉和頁腳的元素。在我的簡單的情況下,我得到:

<w:headerReference w:type="default" r:id="rId7"/> 
<w:footerReference w:type="default" r:id="rId8"/> 

<w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800" w:header="720" w:footer="720" w:gutter="0"/> 

所有文檔的實際上是小的,所以

<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mo="http://schemas.microsoft.com/office/mac/office/2008/main" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:mv="urn:schemas-microsoft-com:mac:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 wp14"> 
<w:body> 
    <w:p w:rsidR="009E6E8F" w:rsidRDefault="009E6E8F"/> 
    <w:p w:rsidR="00B53FFA" w:rsidRDefault="00B53FFA"/> 
    <w:p w:rsidR="00B53FFA" w:rsidRDefault="00B53FFA"/><w:p w:rsidR="00B53FFA" w:rsidRDefault="00B53FFA"> 
    <w:r> 
    <w:t>MY BODY</w:t> 
    </w:r> 
    <w:bookmarkStart w:id="0" w:name="_GoBack"/> 
    <w:bookmarkEnd w:id="0"/> 
    </w:p> 
    <w:sectPr w:rsidR="00B53FFA" w:rsidSect="009E6E8F"> 
    <w:headerReference w:type="default" r:id="rId7"/> 
    <w:footerReference w:type="default" r:id="rId8"/> 
    <w:pgSz w:w="12240" w:h="15840"/> 
    <w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800" w:header="720" w:footer="720" w:gutter="0"/>""" 

所以XML操作不會是一個問題,無論是在功能上或性能上適合這種尺寸的東西。這裏有一些代碼應該讓你的文檔進入python,解析爲一個xml樹,然後作爲docx保存下來。我必須現在出去,所以這不是你的完整解決方案,但我認爲這應該讓你走上正軌。如果您仍然遇到麻煩,我會稍後再回來,看看您在哪裏。

import zipfile 
import shutil as su 
import os 
import tempfile 
import xml.etree.cElementTree 


def get_word_xml(docx_filename): 
    with open(docx_filename, mode='rt') as f: 
     zip = zipfile.ZipFile(f) 
     xml_content = zip.read('word/document.xml') 
    return xml_content 


def write_and_close_docx (self, xml_content, output_filename): 
     """ Create a temp directory, expand the original docx zip. 
      Write the modified xml to word/document.xml 
      Zip it up as the new docx 
     """ 

     tmp_dir = tempfile.mkdtemp() 

     self.zipfile.extractall(tmp_dir) 

     with open(os.path.join(tmp_dir,'word/document.xml'), 'w') as f: 
      xmlstr = tree.tostring(xml_content, pretty_print=True) 
      f.write(xmlstr) 

     # Get a list of all the files in the original docx zipfile 
     filenames = self.zipfile.namelist() 
     # Now, create the new zip file and add all the filex into the archive 
     zip_copy_filename = output_filename 
     with zipfile.ZipFile(zip_copy_filename, "w") as docx: 
      for filename in filenames: 
       docx.write(os.path.join(tmp_dir,filename), filename) 

     # Clean up the temp dir 
     su.rmtree(tmp_dir) 

def get_xml_tree(f): 
    return xml.etree.ElementTree.parse(f) 

word_doc = 'TEXT.docx' 
new_word_doc = 'SLIM.docx' 
doc = get_word_xml(word_doc) 
tree = get_xml_tree(doc) 
write_and_close_docx(word_doc, tree, new_word_doc) 
+0

謝謝!這段代碼不起作用,但經過一些重構後,我成功了!再次感謝! – drjackild

+1

@drackild,不錯。什麼需要糾正?發佈它,讓我們分享:) –

相關問題