2017-09-26 95 views
0

我正在處理OCR'd pdf文件並從中提取文本並從中創建數據框,我得到的是矢量,我無法將它們連接到一行,以便它可以作爲列添加到數據框。從這個塊的代碼,我提取列數據框我將列添加到r中的數據框時出錯

chk_words=c("Swimming pool","Gym","west","para") 
tp_big=c("swimming pool in a farm","gym","west","north","south") 
ps=c() 
x=list() 
for(i in chk_words){ 
    br=if(length(which(stri_detect_fixed(tolower(tp_big),tolower(i)))) <= 0){ print("Not Present") } else {print("Present")} 

    if(br == "Present") 
    ps=i 
    x[[i]]=ps 
    tc=unlist(unique(x)) 
    x=paste(tc,collapse=" ") 
    } 


df11=data.frame(x) 

我得到的輸出(數據幀)爲

x 
Swimming pool Gym west 

但是當我試圖實現在這個大的代碼我也是上面的代碼沒能獲得所需的列「X」 這是代碼整片

library(pdftools) 
    library(tesseract) 
    library(stringi) 
    library(TraMineRextras) 
     All_files=Sys.glob("*.pdf") 
v1 <- numeric(length(All_files)) 
chk_words=c("Swimming pool","Gym","west","para") 
word <- "Gym" 
tc=c() 
ps=c() 
x=list() 
df <- data.frame() 
df11 <- data.frame() 
Status="Present" 

for (i in seq_along(All_files)){ 


    file_name <- All_files[i] 

    cnt <- pdf_info(All_files[i])$pages 
    print(cnt) 

    for(j in seq_len(cnt)){ 
    img_file <- pdftools::pdf_convert(All_files[i], format = 'tiff', pages = j, dpi = 400) 
    text <- ocr(img_file) 
    ocr_text <- capture.output(cat(text)) 
    check <- sapply(ocr_text, paste, collapse="") 
    junk <- dir(path="D:/Deepesh/R Script/All_PDF_Files/Registration_Certificates_OCR", pattern="tiff") 
    file.remove(junk) 
    br <-if(length(which(stri_detect_fixed(tolower(check),tolower(word)))) <= 0) "Not Present" 
    else "Present" 
    print(br)  
    if(br=="Present") { 
     v1[i] <- j 
     break} 

    for(k in chk_words){ 
     sr=if(length(which(stri_detect_fixed(tolower(check),tolower(k)))) <= 0){ print("Not Present") } else {print("Present")} 
     if(sr == "Present") 
     ps=k 
     x[[k]]=ps 
     tc=unlist(unique(x)) 

    } 




    } 
    y=paste(tc,collapse=" ") 
    #tc=paste(tc,collapse=" ") 
    Status <- if(v1[i] == 0) "Not Present" else "Present" 
    pages <- if(v1[i] == 0) "-" else 
    paste0(tools::file_path_sans_ext(basename(file_name)), "_", v1[i]) 
    words <- if(v1[i] == 0) "-" else word 
    df <- rbind(df, cbind(file_name = basename(file_name), 
         Status, pages = pages, words = words,y)) 


} 

現在我得到這樣的輸出(賦給y爲NULL)

FILE_NAME狀態頁字Y test1.pdf目前test1_1健身房
test2.pdf不存在 - 我期望是

file_name status   pages    words  y 
test1.pdf Present  test1_1    gym   swimming pool, gym 
test2.pdf Not Present  - 

任何建議,其中M我去錯了。 在此先感謝。

P.S here可以訪問樣本pdf文件;更清晰在this post

回答

0
checkList = list() 
j=0 
for(i in chk_words){ 
    chk=Reduce('|', lapply(i, function(x) any(ocr_text %in% x))) 
    if(chk == "TRUE") { 
    j = j + 1; 
    checkList[[j]] <- i 
    } 
} 
THIRD_COL <- cat(paste(shQuote(unlist(checkList), type="cmd"), collapse=", ")) 

提到這會給你"swimming pool", "gym" 我做什麼,如果條件滿足,將在檢查表chk_words存儲(這是一個列表)。然後,我在paste中使用shQuote來返回所需的輸出。

+0

更新了w.r.t的答案,但仍無法正確獲取「THIRD_COL」列。 – deepesh

+0

你需要調試你的代碼,因爲你已經把它放在兩個嵌套for循環和if語句之下。我沒有看到我的代碼塊無法工作的任何原因。 – Santosh

+0

在我執行的代碼上找不到任何東西 – deepesh

相關問題