2017-04-21 209 views
-1

我有一些繁重的任務需要做,我需要python的一些幫助。請看這個文件。從.docx文件中提取python的GPS座標

enter image description here

我提取文本和GPS從各行的座標。 10個docx文件中目前有超過100個座標。我的「重型」蟒蛇知識讓我知道這一點。

from docx import Document 
import re 

main_file = Document("D:/DOCUMENTS/Google_Link/1 Category I/1 Category 
I.docx") 
table = main_file.tables[1] #this is same for every document 

data = [] 
keys = None 

for i, row in enumerate(table.rows): 
    text = (cell.text for cell in row.cells) 

if i == 0: 
    keys = tuple(text) 
    continue 

row_data = tuple(text) 
data.append(row_data) 

regexReference = re.compile("(C.-)\w+") 
colReference = [item[1] for item in data] 

listReference = filter(regexReference.match, colReference) 

for i in listReference: 
    print i.encode('UTF-8') 

我可以從列2打印16個參考ID。請指導我打印這樣的內容。

C1-20701-17-1 

some site, some region 

The existing CMC Office at Bariyodhala (22°40'34.3"N; 91°38'28.2"E) requires 
some repair/maintenance works including electrical wiring and electrical 
lights and appliances like ceiling fans supplies. Detail specification of 
the works are attached 

x = 91°38'28.2"E 
y = 22°40'34.3"N 

這些XY位置和描述將用於創建KML文件並附加到每個文檔。我更喜歡上面部分的每個部分(參考ID,位置,描述,x和y)的一個變量,以便我可以自動執行。

demo docx

+0

我建議你添加一個鏈接到一個演示docx文件。 –

+0

已添加demo docx文件鏈接。 –

回答

1

我不知道這是否正常工作,如果有不同的圖案文件(PS我使用python 2.7.11):

# -*- coding: utf-8 -*- 
from docx import Document 
import sys 
import os 
import re 

reload(sys) 
sys.setdefaultencoding('utf8') 

for root, dirs, files in os.walk("."): 
    for name in files: 
     doc_file = os.path.join(root, name) 
     if doc_file.endswith('docx'): 
      main_file = Document(doc_file) 
      table = main_file.tables[1] # this is same for every document 

      data = [] 
      keys = None 

      for i, row in enumerate(table.rows): 
       text = (cell.text for cell in row.cells) 

       if i == 0: 
        keys = tuple(text) 
        continue 

       row_data = tuple(text) 
       data.append(row_data) 

      regexReference = re.compile("(C.-[0-9-]+)") 
      regexCoordinate = re.compile(r'(N-(.{,12})([0-9]|\')|[0-9].{,12}N)[;, ]+(E-(.{,12})([0-9]|\')|[0-9].{,12}E)') 

      result = [] 
      for item in data: 
       tmp = dict() 
       matchReference = regexReference.search(item[1]) 
       matchCoordinate = regexCoordinate.search(unicode(item[2])) 
       if matchReference: 
        tmp['reference'] = matchReference.group() 
       if matchCoordinate: 
        tmp['x'] = matchCoordinate.group(1) 
        tmp['y'] = matchCoordinate.group(4) 
       tmp['description'] = unicode(item[2]) 
       tmp['location'] = unicode(item[3]) 
       result.append(tmp) 

      for rs in result: 
       if 'reference' in rs: 
        for k, v in rs.iteritems(): 
         print('{} = {}'.format(k, v)) 
        print 

# Output: 
# -------------------------------- 
# y = 91°38'28.2"E 
# x = 22°40'34.3"N 
# description = The existing CMC Office at Bariyodhala (22°40'34.3"N; 91°38'28.2"E) requires some repair/maintenance works including electrical wiring and electrical lights and appliances like ceiling fans supplies. Detail specification of the works are attached. 
# reference = C1-20701-17-1 
# location = xxxxx Site, c Region 
+0

謝謝。這似乎幾乎工作,除了座標部分。我沒有使用'CURRENT_DIR',因爲這些文件不在同一個文件夾中。文件名也沒有下劃線。請添加一個'os.walk'(用於文件夾內的所有文件)並刪除下劃線。 –

+0

好吧,我添加'os.walk'並通過檢查.docx擴展名替換文件名 – jpnkls

+0

非常感謝。 –