[R htmlTreeParse部分不需要翻譯

所以，我在意大利，並與「最佳電影」奧斯卡名單IMDB在R.扮演運行此代碼：[R htmlTreeParse部分不需要翻譯

library(XML) 
fileUrl <- "http://www.imdb.com/search/title?   
count=100&groups=oscar_best_picture_winners&sort=year%2Cdesc&ref_=nv_ch_osc_3" 
doc <- htmlTreeParse(fileUrl,useInternal=TRUE) 
scores <- xpathSApply(doc,"//td[@class='title']",xmlValue) 
head(scores,2)

產生以下的輸出：

[1] "\n \n\n\n\n 12 anni schiavo\n (2013)\n\n\n\n \n \n\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n\n8.2/10\nX\n \n\n\nIn the antebellum United States, Solomon Northup, a free black man from upstate New York, is abducted and sold into slavery.\n\n Dir: Steve McQueen\n With: Chiwetel Ejiofor, Michael K. Williams, Michael Fassbender\n\n Biography | Drama | History\n \n 134 mins.\n"              
[2] "\n \n\n\n\n Argo\n (2012)\n\n\n\n \n \n\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n\n7.8/10\nX\n \n\n\nActing under the cover of a Hollywood producer scouting a location for a science fiction film, a CIA agent launches a dangerous operation to rescue six Americans in Tehran during the U.S. hostage crisis in Iran in 1980.\n\n Dir: Ben Affleck\n With: Ben Affleck, Bryan Cranston, John Goodman\n\n Drama | Thriller\n \n 120 mins.\n" 
[3] "\n \n\n\n\n The Artist\n (2011)\n\n\n\n \n \n\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n\n8.0/10\nX\n \n\n\nA silent movie star meets a young dancer, but the arrival of talking pictures sends their careers in opposite directions.\n\n Dir: Michel Hazanavicius\n With: Jean Dujardin, Bérénice Bejo, John Goodman\n\n Comedy | Drama | Romance\n \n 100 mins.\n"

檢查出換行後的第一個字段...請注意電影1如何將名稱翻譯爲意大利語（英文名稱爲'12 Years a Slave'），但對於電影3，僅給出英語？快進一點，這裏有一個片段進一步沿着只給一個想法（ommitted中間步驟）：

> head(scores.df[,1],10) 
[1] "12 anni schiavo"     "Argo"        
[3] "The Artist"      "Il discorso del re"    
[5] "The Hurt Locker"     "The Millionaire"     
[7] "Non è un paese per vecchi"  "The Departed - Il bene e il male" 
[9] "Million Dollar Baby"    "Crash: Contatto fisico"

我運行一個Web代理，所以很自然，當我去在Chrome網站它給了我所有的英語，但即使在隱身模式和Internet Explorer中，它也會提供全部英文，所以爲什麼它會部分翻譯某些標題，我該如何強制它停止？

謝謝！

來源

2014-10-01 Amit Kohli

因此，解析的值不存在於傳遞給解析器的URL中？ – 2014-10-01 00:20:57

你可以添加問題的網址嗎？ – 2014-10-01 00:21:31

正確的，如果我去的URL我看到所有的英文名字。 URL是在fileUrl中提供的，所以'http://www.imdb.com/search/title？ count = 100＆groups = oscar_best_picture_winners＆sort = year％2Cdesc＆ref_ = nv_ch_osc_3' – 2014-10-02 11:27:29

它看起來雖然IMDB必須根據您的請求的原點的IP假設的東西。您很可能在Chrome中設置了默認區域設置以請求en-US版本的頁面，或者您的代理具有更「英文」的外觀IP，但htmlTreeParse的文件傳輸機制不使用相同的機制來下載文件。我沒有看到任何明顯的方式來更改XML庫使用的標頭。不過這裏是使用httr的庫來幫助HTTP請求

library(XML) 
library(httr) 
fileUrl <- "http://www.imdb.com/search/title?count=100&groups=oscar_best_picture_winners&sort=year%2Cdesc&ref_=nv_ch_osc_3" 
en<-content(GET(fileUrl, add_headers("Accept-Language"="en-US;en"))) 
it<-content(GET(fileUrl, add_headers("Accept-Language"="it-it;it")))

現在我們可以比較結果

head(xpathSApply(en,"//td[@class='title']//a[1]", xmlValue)) 
# [1] "12 Years a Slave" "Argo"    "The Artist"   
# [4] "The King's Speech" "The Hurt Locker"  "Slumdog Millionaire" 

head(xpathSApply(it,"//td[@class='title']//a[1]", xmlValue)) 
# [1] "12 anni schiavo" "Argo"    "The Artist"   
# [4] "Il discorso del re" "The Hurt Locker" "The Millionaire"

所以我們可以看到，IMDB服從從請求頭中的請求的語言版本。

來源

2014-10-01 00:38:15 MrFlick

我無法重現您的例子...什麼是'a [1]'？你提供的head命令返回NULL，如果我刪除了'// a [1]'，它將返回一個空列表 – 2014-10-02 11:26:13

@AmitKohli我包裝了URL使其看起來更好，但如果你沒有刪除白色空間，你可能會得到這個錯誤。我已經從示例中刪除了空間。所有'a [1]'的意思是找到標題TD標籤下的第一個錨標籤。我只是用它來提取名稱。你可以使用任何你想要的XPATH。這不是重要的部分。 – MrFlick 2014-10-02 13:14:51

我重新安裝了XML，現在我可以重現您的示例......完全正確，謝謝！ – 2014-10-02 14:24:39

[R htmlTreeParse部分不需要翻譯

回答

相關問題