2017-10-28 72 views
1

我有一些代碼可以讀取Word文檔中的表格,並根據它製作一個數據框。在Word表格中搜索某些文本Python docx

import numpy as np 
import pandas as pd 
from docx import Document 

#### Time for some old fashioned user functions #### 
def make_dataframe(f_name, table_loc): 
    document = Document(f_name) 
    tables = document.tables[table_loc] 

    for i, row in enumerate(tables.rows): 
     text = (cell.text for cell in row.cells) 
     if i == 0: 
      keys = tuple(text) 
      continue 

     row_data = dict(zip(keys, text)) 
     data.append(row_data) 
    df = pd.DataFrame.from_dict(data) 
    return df 


SHRD_filename = "SHRD - 12485.docx" 
SHDD_filename = "SHDD - 12485.docx" 

df_SHRD = make_dataframe(SHRD_filename,30) 
df_SHDD = make_dataframe(SHDD_filename,-60) 

因爲文件是不同的(例如在SHRD有32個表,我要找的人是倒數第二,但SHDD文件有280桌,而我要找的人是60從結束。但事實可能並非總是如此。

如何通過文檔中的表格檢索,並開始在一個cell[0,0] = 'Tag Numbers'

回答

2

您可以通過表迭代工作,並檢查文本在第一個單元格中,我修改了輸出以返回一個數據框列表,以防發現多個表格。如果沒有表格符合標準,則轉爲空白列表。

def make_dataframe(f_name, first_cell_string='tag number'): 
    document = Document(f_name) 

    # create a list of all of the table object with text of the 
    # first cell equal to `first_cell_string` 
    tables = [t for t in document.tables 
       if t.cell(0,0).text.lower().strip()==first_cell_string] 

    # in the case that more than one table is found 
    out = [] 
    for table in tables: 
     for i, row in enumerate(table.rows): 
      text = (cell.text for cell in row.cells) 
      if i == 0: 
       keys = tuple(text) 
       continue 

      row_data = dict(zip(keys, text)) 
      data.append(row_data) 
     out.append(pd.DataFrame.from_dict(data)) 
    return out 
+0

謝謝。我唯一需要添加的是'first_cell_string = first_cell_string.lower()。strip()',因此搜索字符串與Word字符串匹配。 –