0
我正在處理OCR'd pdf文件並從中提取文本並從中創建數據框,我得到的是矢量,我無法將它們連接到一行,以便它可以作爲列添加到數據框。從這個塊的代碼,我提取列數據框我將列添加到r中的數據框時出錯
chk_words=c("Swimming pool","Gym","west","para")
tp_big=c("swimming pool in a farm","gym","west","north","south")
ps=c()
x=list()
for(i in chk_words){
br=if(length(which(stri_detect_fixed(tolower(tp_big),tolower(i)))) <= 0){ print("Not Present") } else {print("Present")}
if(br == "Present")
ps=i
x[[i]]=ps
tc=unlist(unique(x))
x=paste(tc,collapse=" ")
}
df11=data.frame(x)
我得到的輸出(數據幀)爲
x
Swimming pool Gym west
但是當我試圖實現在這個大的代碼我也是上面的代碼沒能獲得所需的列「X」 這是代碼整片
library(pdftools)
library(tesseract)
library(stringi)
library(TraMineRextras)
All_files=Sys.glob("*.pdf")
v1 <- numeric(length(All_files))
chk_words=c("Swimming pool","Gym","west","para")
word <- "Gym"
tc=c()
ps=c()
x=list()
df <- data.frame()
df11 <- data.frame()
Status="Present"
for (i in seq_along(All_files)){
file_name <- All_files[i]
cnt <- pdf_info(All_files[i])$pages
print(cnt)
for(j in seq_len(cnt)){
img_file <- pdftools::pdf_convert(All_files[i], format = 'tiff', pages = j, dpi = 400)
text <- ocr(img_file)
ocr_text <- capture.output(cat(text))
check <- sapply(ocr_text, paste, collapse="")
junk <- dir(path="D:/Deepesh/R Script/All_PDF_Files/Registration_Certificates_OCR", pattern="tiff")
file.remove(junk)
br <-if(length(which(stri_detect_fixed(tolower(check),tolower(word)))) <= 0) "Not Present"
else "Present"
print(br)
if(br=="Present") {
v1[i] <- j
break}
for(k in chk_words){
sr=if(length(which(stri_detect_fixed(tolower(check),tolower(k)))) <= 0){ print("Not Present") } else {print("Present")}
if(sr == "Present")
ps=k
x[[k]]=ps
tc=unlist(unique(x))
}
}
y=paste(tc,collapse=" ")
#tc=paste(tc,collapse=" ")
Status <- if(v1[i] == 0) "Not Present" else "Present"
pages <- if(v1[i] == 0) "-" else
paste0(tools::file_path_sans_ext(basename(file_name)), "_", v1[i])
words <- if(v1[i] == 0) "-" else word
df <- rbind(df, cbind(file_name = basename(file_name),
Status, pages = pages, words = words,y))
}
現在我得到這樣的輸出(賦給y爲NULL)
個FILE_NAME狀態頁字Y test1.pdf目前test1_1健身房
test2.pdf不存在 - 我期望是
file_name status pages words y
test1.pdf Present test1_1 gym swimming pool, gym
test2.pdf Not Present -
任何建議,其中M我去錯了。 在此先感謝。
P.S here可以訪問樣本pdf文件;更清晰在this post
更新了w.r.t的答案,但仍無法正確獲取「THIRD_COL」列。 – deepesh
你需要調試你的代碼,因爲你已經把它放在兩個嵌套for循環和if語句之下。我沒有看到我的代碼塊無法工作的任何原因。 – Santosh
在我執行的代碼上找不到任何東西 – deepesh