將iframe中的PDF刮入R

我試圖將聯合國安理會（UNSC）決議的文本劃分爲R.聯合國維護PDF格式（here）的所有UNSC決議的聯機存檔。所以，從理論上講，這應該是可行的。將iframe中的PDF刮入R

如果我點擊特定年份的超鏈接，然後點擊特定文檔的鏈接（例如，this one），我可以在瀏覽器中看到PDF。當我嘗試通過在URL欄中的鏈接指向download.file來下載該PDF時，它似乎可行。但是，當我嘗試使用pdftools程序包中的pdf_text函數將該文件的內容讀入R時，我收到了一堆錯誤消息。

這是我試圖失敗的。如果你運行它，你會看到我正在談論的錯誤消息。

library(pdftools) 
pdflink <- "http://www.un.org/en/ga/search/view_doc.asp?symbol=S/RES/2341(2017)" 
tmp <- tempfile() 
download.file(pdflink, tmp, mode = "wb") 
doc <- pdf_text(tmp)

我錯過了什麼？我想認爲它與這些文件的可下載版本的鏈接地址不同，這些地址與瀏覽器內顯示的鏈接地址不同，但我無法弄清楚如何獲得前者的路徑。我試圖右鍵點擊下載圖標;使用Chrome中的「檢查」選項查看標識爲'src'的URL（this link）;並指出我的過程的其餘部分。 download.file部分再一次執行，但運行pdf_text時收到相同的錯誤消息。我還嘗試了a）將mode的mode部分改爲download.file，並且b）將「.pdf」加到tmp的路徑的末尾，但這兩者都沒有幫助。

來源

2017-02-22 ulfelder

什麼錯誤信息你好嗎？你可以用'download.file（）'下載文件後打開文件嗎？ – MrFlick

@MrFlick，錯誤消息的堆棧以'錯誤：可能不是PDF文件（繼續）'開頭，包含一堆「非法字符」消息，並以'錯誤：PDF解析失敗'結尾。 – ulfelder

我試過了您的代碼和下載的PDF文件已損壞。當我嘗試瀏覽器中的鏈接時，它也會出現一些錯誤。鏈接是否正確？ – anonR

您要下載的pdf位於主頁的iframe中，因此您下載的鏈接僅包含html。您需要按照iframe中的鏈接獲取pdf的實際鏈接。在獲得直接鏈接下載PDF之前，您需要跳到幾個頁面以獲取Cookie /臨時網址。

下面是你發佈的鏈接的例子：

rm(list=ls()) 
library(rvest) 
library(pdftools) 

s <- html_session("http://www.un.org/en/ga/search/view_doc.asp?symbol=S/RES/2341(2017)") 
#get the link in the mainFrame iframe holding the pdf 
frame_link <- s %>% read_html() %>% html_nodes(xpath="//frame[@name='mainFrame']") %>% 
    html_attr("src") 

#go to that link 
s <- s %>% jump_to(url=frame_link) 

#there is a meta refresh with a link to another page, get it and go there 
temp_url <- s %>% read_html() %>% 
    html_nodes("meta") %>% 
    html_attr("content") %>% {gsub(".*URL=","",.)} 

s <- s %>% jump_to(url=temp_url) 

#get the LtpaToken cookie then come back 
s %>% jump_to(url="https://documents-dds-ny.un.org/prod/ods_mother.nsf?Login&Username=freeods2&Password=1234") %>% 
    back() 

#get the pdf link and download it 
pdf_link <- s %>% read_html() %>% 
    html_nodes(xpath="//meta[@http-equiv='refresh']") %>% 
    html_attr("content") %>% {gsub(".*URL=","",.)} 

s <- s %>% jump_to(pdf_link) 
tmp <- tempfile() 
writeBin(s$response$content,tmp) 
doc <- pdf_text(tmp) 
doc

來源

2017-02-25 09:04:03 NicE

謝謝。現在，爲了遍歷過去半個世紀的無數決議... – ulfelder

將iframe中的PDF刮入R

回答

相關問題