用於合併多個文件(甚至> 2)的基礎上的一個或多個公共列,最好的和有效的方法之一在python將使用「啤酒廠」。你甚至可以指定哪些字段需要考慮合併以及哪些字段需要保存。
import brewery
from brewery
import ds
import sys
sources = [
{"file": "grants_2008.csv",
"fields": ["receiver", "amount", "date"]},
{"file": "grants_2009.csv",
"fields": ["id", "receiver", "amount", "contract_number", "date"]},
{"file": "grants_2010.csv",
"fields": ["receiver", "subject", "requested_amount", "amount", "date"]}
]
創建所有的字段列表和數據records.Go通過源定義添加文件名存儲有關原產地信息,並收集領域:
for source in sources:
for field in source["fields"]:
if field not in all_fields:
out = ds.CSVDataTarget("merged.csv")
out.fields = brewery.FieldList(all_fields)
out.initialize()
for source in sources:
path = source["file"]
# Initialize data source: skip reading of headers
# use XLSDataSource for XLS files
# We ignore the fields in the header, because we have set-up fields
# previously. We need to skip the header row.
src = ds.CSVDataSource(path,read_header=False,skip_rows=1)
src.fields = ds.FieldList(source["fields"])
src.initialize()
for record in src.records():
# Add file reference into ouput - to know where the row comes from
record["file"] = path
out.append(record)
# Close the source stream
src.finalize()
cat merged.csv | brewery pipe pretty_printer
什麼是你的工作這麼遠? –
如果您只需要這個,請查看命令行'join'工具:http://linux.die.net/man/1/join – eumiro
感謝您的建議,但是一個示例如何使用join命令對於這種情況非常歡迎 – user1042891