備註:這個問題涵蓋了爲什麼這個腳本太慢了。但是,如果你更喜歡改善某種東西的人,你可以看看my post on CodeReview which aims to improve the performance。爲什麼我的Python腳本比它的R相當慢?
我正在研究一個打開純文本文件(.lst)的項目。
的文件名(fileName
)的名字很重要,因爲我會從中提取node
(例如abessijn)和component
(如WR-P-E-A)轉換成數據幀。示例:
abessijn.WR-P-E-A.lst
A-bom.WR-P-E-A.lst
acroniem.WR-P-E-C.lst
acroniem.WR-P-E-G.lst
adapter.WR-P-E-A.lst
adapter.WR-P-E-C.lst
adapter.WR-P-E-G.lst
每個文件由一行或多行組成。每行包含一個句子(在<sentence>
標籤內)。實施例(abessijn.WR-P-E-A.lst)
/home/nobackup/SONAR/COMPACT/WR-P-E-A/WR-P-E-A0000364.data.ids.xml: <sentence>Vooral mijn abessijn ruikt heerlijk kruidig .. :))</sentence>
/home/nobackup/SONAR/COMPACT/WR-P-E-A/WR-P-E-A0000364.data.ids.xml: <sentence>Mijn abessijn denkt daar heel anders over .. :)) Maar mijn kinderen richt ik ook niet af , zit niet in mijn bloed .</sentence>
從每個線I提取句子,做一些小的修改它,並調用它sentence
。接下來是一個名爲leftContext
的元素,它將node
(例如abessijn)之間的第一部分與它來自的句子進行比較。最後,從leftContext
我得到的前詞,這是node
sentence
之前的詞,或leftContext
中的最右邊的詞(帶有一些限制,例如用連字符形成的化合物的選項)。示例:
ID | filename | node | component | precedingWord | leftContext | sentence
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 adapter.WR-P-P-F.lst adapter WR-P-P-F aanpassingseenheid Een aanpassingseenheid ( Een aanpassingseenheid (adapter) ,
2 adapter.WR-P-P-F.lst adapter WR-P-P-F toestel Het toestel ( Het toestel (adapter) draagt zorg voor de overbrenging van gegevens
3 adapter.WR-P-P-F.lst adapter WR-P-P-F de de aansluiting tussen de sensor en de de aansluiting tussen de sensor en de adapter ,
4 airbag.WS-U-E-A.lst airbag WS-U-E-A den ja voor den ja voor den airbag op te pompen eh :p
5 airbag.WS-U-E-A.lst airbag WS-U-E-A ne Dobby , als ze valt heeft ze dan wel al ne Dobby , als ze valt heeft ze dan wel al ne airbag hee
該數據幀作爲dataset.csv導出。
之後,我的項目的意圖就在眼前:我創建了一個頻率表,其中考慮了node
和precedingWord
。從變量I定義neuter
和non_neuter
,e.g(在Python)
neuter = ["het", "Het"]
non_neuter = ["de","De"]
和休止類別unspecified
。當precedingWord
是列表中的項目時,將其分配給變量。頻率表輸出示例:
node | neuter | nonNeuter | unspecified
-------------------------------------------------
A-bom 0 4 2
acroniem 3 0 2
act 3 2 1
頻率列表導出爲頻率.csv。
我從R開始,考慮到稍後我會對頻率進行一些統計分析。我的當前R腳本(也可作爲paste):
# ---
# STEP 0: Preparations
start_time <- Sys.time()
## 1. Set working directory in R
setwd("")
## 2. Load required library/libraries
library(dplyr)
library(mclm)
library(stringi)
## 3. Create directory where we'll save our dataset(s)
dir.create("../R/dataset", showWarnings = FALSE)
# ---
# STEP 1: Loop through files, get data from the filename
## 1. Create first dataframe, based on filename of all files
files <- list.files(pattern="*.lst", full.names=T, recursive=FALSE)
d <- data.frame(fileName = unname(sapply(files, basename)), stringsAsFactors = FALSE)
## 2. Create additional columns (word & component) based on filename
d$node <- sub("\\..+", "", d$fileName, perl=TRUE)
d$node <- tolower(d$node)
d$component <- gsub("^[^\\.]+\\.|\\.lst$", "", d$fileName, perl=TRUE)
# ---
# STEP 2: Loop through files again, but now also through its contents
# In other words: get the sentences
## 1. Create second set which is an rbind of multiple frames
## One two-column data.frame per file
## First column is fileName, second column is data from each file
e <- do.call(rbind, lapply(files, function(x) {
data.frame(fileName = x, sentence = readLines(x, encoding="UTF-8"), stringsAsFactors = FALSE)
}))
## 2. Clean fileName
e$fileName <- sub("^\\.\\/", "", e$fileName, perl=TRUE)
## 3. Get the sentence and clean
e$sentence <- gsub(".*?<sentence>(.*?)</sentence>", "\\1", e$sentence, perl=TRUE)
e$sentence <- tolower(e$sentence)
# Remove floating space before/after punctuation
e$sentence <- gsub("\\s(?:(?=[.,:;?!) ])|(?<=\\())", "\\1", e$sentence, perl=TRUE)
# Add space after triple dots ...
e$sentence <- gsub("\\.{3}(?=[^\\s])", "... ", e$sentence, perl=TRUE)
# Transform HTML entities into characters
# It is unfortunate that there's no easier way to do this
# E.g. Python provides the HTML package which can unescape (decode) HTML
# characters
e$sentence <- gsub("'", "'", e$sentence, perl=TRUE)
e$sentence <- gsub("&", "&", e$sentence, perl=TRUE)
# Avoid R from wrongly interpreting ", so replace by single quotes
e$sentence <- gsub(""|\"", "'", e$sentence, perl=TRUE)
# Get rid of some characters we can't use such as ³ and ¾
e$sentence <- gsub("[^[:graph:]\\s]", "", e$sentence, perl=TRUE)
# ---
# STEP 3:
# Create final dataframe
## 1. Merge d and e by common column name fileName
df <- merge(d, e, by="fileName", all=TRUE)
## 2. Make sure that only those sentences in which df$node is present in df$sentence are taken into account
matchFunction <- function(x, y) any(x == y)
matchedFrame <- with(df, mapply(matchFunction, node, stri_split_regex(sentence, "[ :?.,]")))
df <- df[matchedFrame, ]
## 3. Create leftContext based on the split of the word and the sentence
# Use paste0 to make sure we are looking for the node, not a compound
# node can only be preceded by a space, but can be followed by punctuation as well
contexts <- strsplit(df$sentence, paste0("(^|)", df$node, "(|[!\",.:;?})\\]])"), perl=TRUE)
df$leftContext <- sapply(contexts, `[`, 1)
## 4. Get the word preceding the node
df$precedingWord <- gsub("^.*\\b(?<!-)(\\w+(?:-\\w+)*)[^\\w]*$","\\1", df$leftContext, perl=TRUE)
## 5. Improve readability by sorting columns
df <- df[c("fileName", "component", "precedingWord", "node", "leftContext", "sentence")]
## 6. Write dataset to dataset dir
write.dataset(df,"../R/dataset/r-dataset.csv")
# ---
# STEP 4:
# Create dataset with frequencies
## 1. Define neuter and nonNeuter classes
neuter <- c("het")
non.neuter<- c("de")
## 2. Mutate df to fit into usable frame
freq <- mutate(df, gender = ifelse(!df$precedingWord %in% c(neuter, non.neuter), "unspecified",
ifelse(df$precedingWord %in% neuter, "neuter", "non_neuter")))
## 3. Transform into table, but still usable as data frame (i.e. matrix)
## Also add column name "node"
freqTable <- table(freq$node, freq$gender) %>%
as.data.frame.matrix %>%
mutate(node = row.names(.))
## 4. Small adjustements
freqTable <- freqTable[,c(4,1:3)]
## 5. Write dataset to dataset dir
write.dataset(freqTable,"../R/dataset/r-frequencies.csv")
diff <- Sys.time() - start_time # calculate difference
print(diff) # print in nice format
然而,由於我使用的是大數據集(16500個文件,全部用多線)它似乎需要相當長。在我的系統上,整個過程花費了大約一個小時四分之一的時間。我認爲自己應該有一種語言可以更快地完成這項工作,所以我去了,自學了一些Python,並在這裏問了很多問題。
最後我想出了以下腳本(paste)。
import os, pandas as pd, numpy as np, regex as re
from glob import glob
from datetime import datetime
from html import unescape
start_time = datetime.now()
# Create empty dataframe with correct column names
columnNames = ["fileName", "component", "precedingWord", "node", "leftContext", "sentence" ]
df = pd.DataFrame(data=np.zeros((0,len(columnNames))), columns=columnNames)
# Create correct path where to fetch files
subdir = "rawdata"
path = os.path.abspath(os.path.join(os.getcwd(), os.pardir, subdir))
# "Cache" regex
# See http://stackoverflow.com/q/452104/1150683
p_filename = re.compile(r"[./\\]")
p_sentence = re.compile(r"<sentence>(.*?)</sentence>")
p_typography = re.compile(r" (?:(?=[.,:;?!) ])|(?<=\())")
p_non_graph = re.compile(r"[^\x21-\x7E\s]")
p_quote = re.compile(r"\"")
p_ellipsis = re.compile(r"\.{3}(?=[^ ])")
p_last_word = re.compile(r"^.*\b(?<!-)(\w+(?:-\w+)*)[^\w]*$", re.U)
# Loop files in folder
for file in glob(path+"\\*.lst"):
with open(file, encoding="utf-8") as f:
[n, c] = p_filename.split(file.lower())[-3:-1]
fn = ".".join([n, c])
for line in f:
s = p_sentence.search(unescape(line)).group(1)
s = s.lower()
s = p_typography.sub("", s)
s = p_non_graph.sub("", s)
s = p_quote.sub("'", s)
s = p_ellipsis.sub("... ", s)
if n in re.split(r"[ :?.,]", s):
lc = re.split(r"(^|)" + n + "(|[!\",.:;?})\]])", s)[0]
pw = p_last_word.sub("\\1", lc)
df = df.append([dict(fileName=fn, component=c,
precedingWord=pw, node=n,
leftContext=lc, sentence=s)])
continue
# Reset indices
df.reset_index(drop=True, inplace=True)
# Export dataset
df.to_csv("dataset/py-dataset.csv", sep="\t", encoding="utf-8")
# Let's make a frequency list
# Create new dataframe
# Define neuter and non_neuter
neuter = ["het"]
non_neuter = ["de"]
# Create crosstab
df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter"
df.loc[df.precedingWord.isin(non_neuter), "gender"] = "non_neuter"
df.loc[df.precedingWord.isin(neuter + non_neuter)==0, "gender"] = "rest"
freqDf = pd.crosstab(df.node, df.gender)
freqDf.to_csv("dataset/py-frequencies.csv", sep="\t", encoding="utf-8")
# How long has the script been running?
time_difference = datetime.now() - start_time
print("Time difference of", time_difference)
在確定兩個腳本的輸出是相同的之後,我想我會讓他們參加測試。
我使用四核處理器和8 GB RAM在Windows 10 64位上運行。對於R我使用的是RGUI 64位3.2.2,Python在版本3.4.3(Anaconda)上運行,並在Spyder中執行。請注意,我以32位運行Python,因爲我希望將來使用nltk module,並且阻止用戶使用64位。
我發現R約在55分鐘內完成。但是Python已經運行了兩個小時,並且我可以在變量瀏覽器中看到它只在business.wr-p-p-g.lst
(文件按字母順序排序)。 waaaaayyyy慢!
所以我做了一個測試用例,看看兩個腳本如何用一個小得多的數據集執行。我花了大約100個文件(而不是16,500個)並運行腳本。再一次,R要快得多。 R大概在2秒內完成,17中的Python!
看到Python的目標是讓所有事情都變得更加順利,我感到困惑。我讀Python很快(而且R很慢),所以我哪裏出錯了?問題是什麼? Python在閱讀文件和行,或者在執行正則表達式時速度較慢?或者R只是更好地處理數據幀,不能被熊貓打敗? 或是我的代碼簡單地嚴重優化,應該Python確實是勝利者?
因此,我的問題是:爲什麼Python在這種情況下比R慢,以及 - 如果可能的話 - 我們如何改進Python來發光?
每個願意試一試腳本的人都可以下載我使用的測試數據here。請在下載文件時提醒我。
快速掃描提示您在python循環中打開每個文件的事實:'open(file,encoding =「utf-8」)作爲f'不會像r等價的'e < - do.call(rbind,lapply(files,function(x){...' – jeremycg
)您的R代碼只是針對該語言進行了更加優化,沒有for循環,並且大量使用矢量化操作,內置的函數實際上是用C/Fortran編寫的,你的Python代碼只是非常低效,就是這樣, –
@jeremycg有什麼辦法可以在Python中做類似的事情嗎?例如,將所有文本文件以某種方式縫合在一起? –