使用python-docx更新大量文件的docx元數據

我在需要更新元數據的文件夾和子文件夾中有大約300個docx文件。我有一個單獨的包含元數據的300多行csv文件：每行包含文件名，關鍵字和行中的標題。使用python-docx更新大量文件的docx元數據

我想循環瀏覽從csv提取內容並將元數據插入docx文件的docx文件。 Docx文件從根文件夾向下存儲2個子文件夾。

到目前爲止，我已經勾畫出以下內容。我正在努力研究如何循環訪問csv文件並按順序將元數據應用於每個文件。我確信有一個相對簡單的方法可以解決這個問題，建立循環並獲取csv內容就是我迷失的地方。我是一個小菜鳥，和我一樣，感受我的方式。

任何提示讚賞。

#running in python 3.5.2 32bit 
import csv 
from docx import Document 
import os 
import sys 

csv_path = ("datasheet_metadata_uplift.csv") 

def update_docx_metadata(document, keywords, title): 
    """ 
    Update the *keywords*, and *title* metadata 
    properties in *document*. 
    """ 
    core_properties = document.core_properties 
    core_properties.keywords = keywords 
    core_properties.title = title 

def read_csv_lines(filename, keywords, title): 
    """ 
    Reads the csv lines, returns *filename*, *keywords*, *title* 
    """ 
    with open(csv_path, 'r') as f: 
     csv_file = csv.reader(f) 
     for row in csv_file: 
      filename = row[0] 
      keywords = row[1] 
      title = row[2] 

def open_docx(filename): 
    """ 
    Search for docx file and open it 
    """ 
    for root, dirs, files in os.walk("."): 
     if filename in files: 
      doc_path = os.path.join(path, filename) 

csv_lines = read_csv_lines(filename, keywords, title) 
for filename, keywords, title in csv_lines: 
    document = Document(doc_path) 
    update_doc_metadata(filename, keywords, title) 
    document.save(doc_path)

來源

2016-11-17 Aidan

下一步我會推薦Aidan將您的代碼重構爲相干函數。這將允許您在需要時執行所需的操作，每個操作都有一個函數調用，這樣意圖和流程就不會被遮擋。

你可能有這樣的事情開始：

def update_doc_metadata(document, author, keywords, title, subject): 
    """ 
    Update the *author*, *keywords*, *title*, and *subject* metadata 
    properties in *document*. 
    """ 
    core_properties = document.core_properties 
    core_properties.author = author 
    core_properties.keywords = keywords 
    core_properties.title = title 
    core_properties.subject = subject

注意的幾件事情：

它是連貫的，這意味着它所有的只有一兩件事。這使得更具可重用性。
它不依賴任何不作爲參數進來的東西。這使得它很容易測試（如果你這樣做）並且通常易於理解，因爲所需的所有上下文都在這十行中。
它有一個文檔字符串，明確指出它的功能。這是一門有用的學科，不僅因爲它可以幫助讀者（很可能是你，幾周或幾個月後）理解這個意圖，而是因爲它迫使你解釋你在做什麼。很多時候，你可以檢測出錯誤的因素，因爲解釋很難或很長時間。（圍繞參數的星號將在開展某些文檔軟件包斜體字顯示。）

如果你繼續這樣，定位和「提取」相干位到功能，主代碼的核心邏輯將變得更清晰。

我認爲，整體結構是這樣的：

csv_lines = read_csv_lines(csv_path) 
for filename, keywords, title in csv_lines: 
    doc_path, document = open_docx(filename) 
    update_doc_metadata(document, author, keywords, title, subject) 
    document.save(doc_path)

來源

2016-11-17 22:20:10 scanny

嗨Scanny - 謝謝！非常有幫助的答案，我一直在重構使用函數，如你所建議的，但有些不太正確。我得到一個'NameError：name'文件名'未定義'的錯誤與代碼的最後部分有關。我已經使用新代碼更新了原始帖子。有什麼想法？ – Aidan

@Aidan我想你可能會對函數參數在Python中的作用感到困惑。他們將價值（*）*轉化爲*函數，但通常不會*出*。爲此你需要一個return語句。所以read_csv_lines應該只是將csv_path作爲參數，然後返回（filename，keywords，title）序列（可能是元組）的序列（可能是list）。我認爲read_csv_lines的返回值只是'return [row for csv_file]''。您可能想要查找一些Python教程資源。我喜歡[這一個]（https://pymotw.com/3/）和Python官方教程是相當不錯:) – scanny

好吧，感謝您的幫助scanny，我意識到，我今天看到這一點。 – Aidan

所以我想通了這一點，它結束了是很簡單的。通過將完整的文件路徑放入csv中，我也使自己更容易。感謝scanny的鼓勵。下一站，文檔和教程頁:)

#runs in python 3.5.2 32-bit 
#docx requires 32 bit operation 
import csv 
from docx import Document 
import os 
import sys 

#path to the csv file - csv file must contain rows as follows: 
#full filepath, title, subject 
#ensure there are no commas, other than the csv delimiters 

csv_path = "datasheet_metadata_uplift.csv" 

#set up the lists that will be used to hold csv values 
filename = [] 
title = [] 
keywords = [] 

#sets up the csv file, and parses the "columns" to one of three lists: filename, title, keywords 
f = open(csv_path) 
csv_file = csv.reader(f) 

#chops up csv into [] lists 
for row in csv_file: 
    filename.append(row[0]) 
    title.append(row[1]) 
    keywords.append(row[2]) 

#get the number of lines in the csv, and thus the number of files that need updating 
file = open(csv_path) 
num_lines = len(file.readlines()) 

#do the updates on every filename in the list 
i = 0 
while i < num_lines: 
    if i < num_lines: 
     #update the docx files, one for each csv file entry 
     document = Document(filename[i]) 
     core_properties = document.core_properties 
     core_properties.keywords = (keywords[i]) 
     core_properties.title = (title[i]) 
     core_properties.subject = ("YOUR_SUBJECT_HERE") 
     core_properties.comments = (" ") 
     core_properties.company = ("YOUR_COMPANY_HERE") 
     document.save(filename[i]) 
     i+=1 

print ("finished!")

來源

2016-11-21 15:43:48 Aidan

使用python-docx更新大量文件的docx元數據

回答

相關問題