我有一個字符向量,這是一些PDF通過pdftotext
(命令行工具)抓取的文件。這個空白在哪裏隱藏?
一切都(很幸福)很好地排隊。但是,向量充滿了一種空白的那逃避我的正則表達式:
> test
[1] "Address:" "Clinic Information:" "Store " "351 South Washburn" "Aurora Quick Care"
[6] "Info" "St. Oshkosh, WI 54904" "Phone: 920‐232‐0718" "Pewaukee"
> grepl("[0-9]+ [A-Za-z ]+",test)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> dput(test)
c("Address:", "Clinic Information:", "Store ", "351 South Washburn",
"Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718",
"Pewaukee")
> test.pasted <- c("Address:", "Clinic Information:", "Store ", "351 South Washburn",
+ "Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718",
+ "Pewaukee")
> grepl("[0-9]+ [A-Za-z ]+",test.pasted)
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
> Encoding(test)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
> Encoding(test.pasted)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "UTF-8" "unknown"
顯然有這不是在dput
得到分配的一些角色,如下面的問題:
How to properly dput internationalized text?
我無法複製/粘貼整個矢量....我如何搜索並摧毀這個非空白空白?
編輯
顯然我是不是甚至接近清楚,因爲答案是所有的地方。這裏有一個更簡單的測試用例:
> grepl("Clinic Information:", test[2])
[1] FALSE
> grepl("Clinic Information:", "Clinic Information:") # Where the second phrase is copy/pasted from the screen
[1] TRUE
有單詞「診所」和「信息」顯示在屏幕上,並在dput
輸出之間的空單,但無論是在字符串中是不是一個標準的空間。我的目標是消除這個,所以我可以正確地把這個元素搞清楚。
空白不在矢量本身中,它只是以它的顯示方式。 – 2012-07-28 17:07:43
看看'lapply(test [4],utf8ToInt)',看看裏面是否有大數字。 – 2012-07-28 17:37:39
@AlanCurry'> lapply(test [4],utf8ToInt) [1] 51 53 49 160 83 111 117 116 104 160 87 97 115 104 98 117 114 110' – 2012-07-28 20:35:37