R中的HTML字符實體替換

我有一大組HTML文件，其中包含節點span中雜誌的文本。我的PDF到HTML轉換器在整個HTML中插入字符實體 。問題是在R中，我使用xmlValue函數（在XML包中）來提取文本，但是在任何存在 的地方，單詞之間的空間被消除。例如：R中的HTML字符實體替換

<span class="ft6">kids,&nbsp;and kids in your community,&nbsp;in         DIY&nbsp;projects.&nbsp;</span>

將陸續xmlValue功能的出來：

"kids,and kids in your community,in DIYprojects."

我在想，最簡單的方法來解決，這將是通過xmlValue運行span節點之前找到所有 ，並用" "（空格）替換它們。我將如何處理？

來源

2013-01-15 Gene Burinsky

我已經重寫了答案，以反映原始海報無法從XMLValue獲取文本的問題。可能有不同的方法來解決這個問題，但一種方法是直接打開/替換/寫入HTML文件本身。通常用正則表達式處理XML/HTML是一個糟糕的想法，但在這種情況下，我們有一個直接的問題是不需要的非空白空間，所以它可能不是太多問題。以下代碼是如何創建匹配文件列表並在內容上執行gsub的示例。根據需要修改或擴展應該很容易。

setwd("c:/test/") 
# Create 'html' file to use with test 
txt <- "<span class=ft6>kids,&nbsp;and kids in your community,&nbsp;in         DIY&nbsp;projects.&nbsp;</span> 
<span class=ft6>kids,&nbsp;and kids in your community,&nbsp;in         DIY&nbsp;projects.&nbsp;</span> 
<span class=ft6>kids,&nbsp;and kids in your community,&nbsp;in         DIY&nbsp;projects.&nbsp;</span>" 
writeLines(txt, "file1.html") 

# Now read files - in this case only one 
html.files <- list.files(pattern = ".html") 
html.files 

# Loop through the list of files 
retval <- lapply(html.files, function(x) { 
      in.lines <- readLines(x, n = -1) 
      # Replace non-breaking space with space 
      out.lines <- gsub("&nbsp;"," ", in.lines) 
      # Write out the corrected lines to a new file 
      writeLines(out.lines, paste("new_", x, sep = "")) 
})

來源

2013-01-15 00:23:10 SlowLearner

這是'' 沒有'的方式$ nbsp'，所以'GSUB（「」，」」，測試）'應該工作。 – thelatemail

@thelatemail感謝您發現 - 現在修正了錯別字。在正常醒來之前必須避免張貼... – SlowLearner

我試過gsub。問題是xmlValue的輸入不是一個字符向量，它是一個「XMLinternalNode」。 gsub需要可轉換爲字符向量或字符向量的東西，但都不是這樣。 –

R中的HTML字符實體替換

回答

相關問題