爲什麼我的Python腳本比它的R相當慢？

備註：這個問題涵蓋了爲什麼這個腳本太慢了。但是，如果你更喜歡改善某種東西的人，你可以看看my post on CodeReview which aims to improve the performance。爲什麼我的Python腳本比它的R相當慢？

我正在研究一個打開純文本文件（.lst）的項目。

的文件名（fileName）的名字很重要，因爲我會從中提取node（例如abessijn）和component（如WR-P-E-A）轉換成數據幀。示例：

abessijn.WR-P-E-A.lst 
A-bom.WR-P-E-A.lst 
acroniem.WR-P-E-C.lst 
acroniem.WR-P-E-G.lst 
adapter.WR-P-E-A.lst 
adapter.WR-P-E-C.lst 
adapter.WR-P-E-G.lst

每個文件由一行或多行組成。每行包含一個句子（在<sentence>標籤內）。實施例（abessijn.WR-P-E-A.lst）

/home/nobackup/SONAR/COMPACT/WR-P-E-A/WR-P-E-A0000364.data.ids.xml: <sentence>Vooral mijn abessijn ruikt heerlijk kruidig .. :))</sentence> 
/home/nobackup/SONAR/COMPACT/WR-P-E-A/WR-P-E-A0000364.data.ids.xml: <sentence>Mijn abessijn denkt daar heel anders over .. :)) Maar mijn kinderen richt ik ook niet af , zit niet in mijn bloed .</sentence>

從每個線I提取句子，做一些小的修改它，並調用它sentence。接下來是一個名爲leftContext的元素，它將node（例如abessijn）之間的第一部分與它來自的句子進行比較。最後，從leftContext我得到的前詞，這是nodesentence之前的詞，或leftContext中的最右邊的詞（帶有一些限制，例如用連字符形成的化合物的選項）。示例：

ID | filename    | node | component | precedingWord  | leftContext        | sentence 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
1 adapter.WR-P-P-F.lst adapter WR-P-P-F aanpassingseenheid Een aanpassingseenheid (      Een aanpassingseenheid (adapter) , 
2 adapter.WR-P-P-F.lst adapter WR-P-P-F toestel    Het toestel (        Het toestel (adapter) draagt zorg voor de overbrenging van gegevens 
3 adapter.WR-P-P-F.lst adapter WR-P-P-F de     de aansluiting tussen de sensor en de   de aansluiting tussen de sensor en de adapter , 
4 airbag.WS-U-E-A.lst airbag WS-U-E-A den     ja voor den         ja voor den airbag op te pompen eh :p 
5 airbag.WS-U-E-A.lst airbag WS-U-E-A ne     Dobby , als ze valt heeft ze dan wel al ne Dobby , als ze valt heeft ze dan wel al ne airbag hee

該數據幀作爲dataset.csv導出。

之後，我的項目的意圖就在眼前：我創建了一個頻率表，其中考慮了node和precedingWord。從變量I定義neuter和non_neuter，e.g（在Python）

neuter = ["het", "Het"] 
non_neuter = ["de","De"]

和休止類別unspecified。當precedingWord是列表中的項目時，將其分配給變量。頻率表輸出示例：

node | neuter | nonNeuter | unspecified 
------------------------------------------------- 
A-bom  0   4    2 
acroniem 3   0    2 
act   3   2    1

頻率列表導出爲頻率.csv。

我從R開始，考慮到稍後我會對頻率進行一些統計分析。我的當前R腳本（也可作爲paste）：

# --- 
# STEP 0: Preparations 
    start_time <- Sys.time() 
    ## 1. Set working directory in R 
    setwd("") 

    ## 2. Load required library/libraries 
    library(dplyr) 
    library(mclm) 
    library(stringi) 

    ## 3. Create directory where we'll save our dataset(s) 
    dir.create("../R/dataset", showWarnings = FALSE) 


# --- 
# STEP 1: Loop through files, get data from the filename 

    ## 1. Create first dataframe, based on filename of all files 
    files <- list.files(pattern="*.lst", full.names=T, recursive=FALSE) 
    d <- data.frame(fileName = unname(sapply(files, basename)), stringsAsFactors = FALSE) 

    ## 2. Create additional columns (word & component) based on filename 
    d$node <- sub("\\..+", "", d$fileName, perl=TRUE) 
    d$node <- tolower(d$node) 
    d$component <- gsub("^[^\\.]+\\.|\\.lst$", "", d$fileName, perl=TRUE) 


# --- 
# STEP 2: Loop through files again, but now also through its contents 
# In other words: get the sentences 

    ## 1. Create second set which is an rbind of multiple frames 
    ## One two-column data.frame per file 
    ## First column is fileName, second column is data from each file 
    e <- do.call(rbind, lapply(files, function(x) { 
     data.frame(fileName = x, sentence = readLines(x, encoding="UTF-8"), stringsAsFactors = FALSE) 
    })) 

    ## 2. Clean fileName 
    e$fileName <- sub("^\\.\\/", "", e$fileName, perl=TRUE) 

    ## 3. Get the sentence and clean 
    e$sentence <- gsub(".*?<sentence>(.*?)</sentence>", "\\1", e$sentence, perl=TRUE) 
    e$sentence <- tolower(e$sentence) 
     # Remove floating space before/after punctuation 
     e$sentence <- gsub("\\s(?:(?=[.,:;?!) ])|(?<=\\())", "\\1", e$sentence, perl=TRUE) 
    # Add space after triple dots ... 
     e$sentence <- gsub("\\.{3}(?=[^\\s])", "... ", e$sentence, perl=TRUE) 

    # Transform HTML entities into characters 
    # It is unfortunate that there's no easier way to do this 
    # E.g. Python provides the HTML package which can unescape (decode) HTML 
    # characters 
     e$sentence <- gsub("&apos;", "'", e$sentence, perl=TRUE) 
     e$sentence <- gsub("&amp;", "&", e$sentence, perl=TRUE) 
     # Avoid R from wrongly interpreting ", so replace by single quotes 
     e$sentence <- gsub("&quot;|\"", "'", e$sentence, perl=TRUE) 

     # Get rid of some characters we can't use such as ³ and ¾ 
     e$sentence <- gsub("[^[:graph:]\\s]", "", e$sentence, perl=TRUE) 


# --- 
# STEP 3: 
# Create final dataframe 

    ## 1. Merge d and e by common column name fileName 
    df <- merge(d, e, by="fileName", all=TRUE) 

    ## 2. Make sure that only those sentences in which df$node is present in df$sentence are taken into account 
    matchFunction <- function(x, y) any(x == y) 
    matchedFrame <- with(df, mapply(matchFunction, node, stri_split_regex(sentence, "[ :?.,]"))) 
    df <- df[matchedFrame, ] 

    ## 3. Create leftContext based on the split of the word and the sentence 
    # Use paste0 to make sure we are looking for the node, not a compound 
    # node can only be preceded by a space, but can be followed by punctuation as well 
    contexts <- strsplit(df$sentence, paste0("(^|)", df$node, "(|[!\",.:;?})\\]])"), perl=TRUE) 
    df$leftContext <- sapply(contexts, `[`, 1) 

    ## 4. Get the word preceding the node 
    df$precedingWord <- gsub("^.*\\b(?<!-)(\\w+(?:-\\w+)*)[^\\w]*$","\\1", df$leftContext, perl=TRUE) 

    ## 5. Improve readability by sorting columns 
    df <- df[c("fileName", "component", "precedingWord", "node", "leftContext", "sentence")] 

    ## 6. Write dataset to dataset dir 
    write.dataset(df,"../R/dataset/r-dataset.csv") 


# --- 
# STEP 4: 
# Create dataset with frequencies 

    ## 1. Define neuter and nonNeuter classes 
    neuter <- c("het") 
    non.neuter<- c("de") 

    ## 2. Mutate df to fit into usable frame 
    freq <- mutate(df, gender = ifelse(!df$precedingWord %in% c(neuter, non.neuter), "unspecified", 
     ifelse(df$precedingWord %in% neuter, "neuter", "non_neuter"))) 

    ## 3. Transform into table, but still usable as data frame (i.e. matrix) 
    ## Also add column name "node" 
    freqTable <- table(freq$node, freq$gender) %>% 
     as.data.frame.matrix %>% 
     mutate(node = row.names(.)) 

    ## 4. Small adjustements 
    freqTable <- freqTable[,c(4,1:3)] 

    ## 5. Write dataset to dataset dir 
    write.dataset(freqTable,"../R/dataset/r-frequencies.csv") 


    diff <- Sys.time() - start_time # calculate difference 
    print(diff) # print in nice format

然而，由於我使用的是大數據集（16500個文件，全部用多線）它似乎需要相當長。在我的系統上，整個過程花費了大約一個小時四分之一的時間。我認爲自己應該有一種語言可以更快地完成這項工作，所以我去了，自學了一些Python，並在這裏問了很多問題。

最後我想出了以下腳本（paste）。

import os, pandas as pd, numpy as np, regex as re 

from glob import glob 
from datetime import datetime 
from html import unescape 

start_time = datetime.now() 

# Create empty dataframe with correct column names 
columnNames = ["fileName", "component", "precedingWord", "node", "leftContext", "sentence" ] 
df = pd.DataFrame(data=np.zeros((0,len(columnNames))), columns=columnNames) 

# Create correct path where to fetch files 
subdir = "rawdata" 
path = os.path.abspath(os.path.join(os.getcwd(), os.pardir, subdir)) 

# "Cache" regex 
# See http://stackoverflow.com/q/452104/1150683 
p_filename = re.compile(r"[./\\]") 

p_sentence = re.compile(r"<sentence>(.*?)</sentence>") 
p_typography = re.compile(r" (?:(?=[.,:;?!) ])|(?<=\())") 
p_non_graph = re.compile(r"[^\x21-\x7E\s]") 
p_quote = re.compile(r"\"") 
p_ellipsis = re.compile(r"\.{3}(?=[^ ])") 

p_last_word = re.compile(r"^.*\b(?<!-)(\w+(?:-\w+)*)[^\w]*$", re.U) 

# Loop files in folder 
for file in glob(path+"\\*.lst"): 
    with open(file, encoding="utf-8") as f: 
     [n, c] = p_filename.split(file.lower())[-3:-1] 
     fn = ".".join([n, c]) 
     for line in f: 
      s = p_sentence.search(unescape(line)).group(1) 
      s = s.lower() 
      s = p_typography.sub("", s) 
      s = p_non_graph.sub("", s) 
      s = p_quote.sub("'", s) 
      s = p_ellipsis.sub("... ", s) 

      if n in re.split(r"[ :?.,]", s): 
       lc = re.split(r"(^|)" + n + "(|[!\",.:;?})\]])", s)[0] 

       pw = p_last_word.sub("\\1", lc) 

       df = df.append([dict(fileName=fn, component=c, 
            precedingWord=pw, node=n, 
            leftContext=lc, sentence=s)]) 
      continue 

# Reset indices 
df.reset_index(drop=True, inplace=True) 

# Export dataset 
df.to_csv("dataset/py-dataset.csv", sep="\t", encoding="utf-8") 

# Let's make a frequency list 
# Create new dataframe 

# Define neuter and non_neuter 
neuter = ["het"] 
non_neuter = ["de"] 

# Create crosstab 
df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter" 
df.loc[df.precedingWord.isin(non_neuter), "gender"] = "non_neuter" 
df.loc[df.precedingWord.isin(neuter + non_neuter)==0, "gender"] = "rest" 

freqDf = pd.crosstab(df.node, df.gender) 

freqDf.to_csv("dataset/py-frequencies.csv", sep="\t", encoding="utf-8") 

# How long has the script been running? 
time_difference = datetime.now() - start_time 
print("Time difference of", time_difference)

在確定兩個腳本的輸出是相同的之後，我想我會讓他們參加測試。

我使用四核處理器和8 GB RAM在Windows 10 64位上運行。對於R我使用的是RGUI 64位3.2.2，Python在版本3.4.3（Anaconda）上運行，並在Spyder中執行。請注意，我以32位運行Python，因爲我希望將來使用nltk module，並且阻止用戶使用64位。

我發現R約在55分鐘內完成。但是Python已經運行了兩個小時，並且我可以在變量瀏覽器中看到它只在business.wr-p-p-g.lst（文件按字母順序排序）。 waaaaayyyy慢！

所以我做了一個測試用例，看看兩個腳本如何用一個小得多的數據集執行。我花了大約100個文件（而不是16,500個）並運行腳本。再一次，R要快得多。 R大概在2秒內完成，17中的Python！

看到Python的目標是讓所有事情都變得更加順利，我感到困惑。我讀Python很快（而且R很慢），所以我哪裏出錯了？問題是什麼？ Python在閱讀文件和行，或者在執行正則表達式時速度較慢？或者R只是更好地處理數據幀，不能被熊貓打敗？或是我的代碼簡單地嚴重優化，應該Python確實是勝利者？

因此，我的問題是：爲什麼Python在這種情況下比R慢，以及 - 如果可能的話 - 我們如何改進Python來發光？

每個願意試一試腳本的人都可以下載我使用的測試數據here。請在下載文件時提醒我。

來源

2015-08-20 Bram Vanroy

快速掃描提示您在python循環中打開每個文件的事實：'open（file，encoding =「utf-8」）作爲f'不會像r等價的'e < - do.call（rbind，lapply（files，function（x）{...' – jeremycg

）您的R代碼只是針對該語言進行了更加優化，沒有for循環，並且大量使用矢量化操作，內置的函數實際上是用C/Fortran編寫的，你的Python代碼只是非常低效，就是這樣， –

@jeremycg有什麼辦法可以在Python中做類似的事情嗎？例如，將所有文本文件以某種方式縫合在一起？ –

你要做的就是調用一個循環DataFrame.append方法最效率極其低下的東西，即

df = pandas.DataFrame(...) 
for file in files: 
    ... 
    for line in file: 
     ... 
     df = df.append(...)

NumPy的數據結構設計時考慮了函數式編程，因此這種操作並不意味着在使用迭代的必要方式，因爲調用不會在原地改變你的數據幀，但會創建一個新的，導致時間和內存複雜度的巨大增加。如果您確實想要使用數據幀，請在list中累積行並將其傳遞給DataFrame構造函數，例如，

pre_df = [] 
for file in files: 
    ... 
    for line in file: 
     ... 
     pre_df.append(processed_line) 

df = pandas.DataFrame(pre_df, ...)

這是因爲它會引入最少的修改你的代碼最簡單的方法。但是更好的（並且計算上很漂亮）的方法是想辦法如何生成你的數據集。這可以通過將工作流分解爲離散函數（從函數式編程風格的意義上來說）並使用懶惰生成器表達式和/或高階函數來組成它們來輕鬆實現。然後，您可以使用生成的生成器來構建數據框，例如

df = pandas.DataFrame.from_records(processed_lines_generator, columns=column_names, ...)

至於在一次運行中讀取多個文件，您可能需要閱讀this。

P.S.

如果你有性能問題，你應該在試圖優化它之前對你的代碼進行剖析。

來源

2015-08-20 15:56:04

一旦我看過懶惰，我會詳細回答，imap和ifilter（蘋果公司沒有專利，這是嗎？';-)'）但是，您的最後一句話中的「簡介」是什麼意思？ –

@BramVanroy我的意思是[code prodiling]（https://en.wikipedia.org/wiki/Profiling_（computer_programming））。獲取PyCharm（有免費版）的副本，它內置了許多其他好東西中的分析工具。在分析代碼時，您會發現'DataFrame.append'瓶頸。 –

我一直在嘗試與你提到的一些表達式的運氣，但沒有運氣。我可能會很快加入一些新的問題... –

爲什麼我的Python腳本比它的R相當慢？

回答

相關問題