2015-05-06 24 views
3

這是我的數據的一個樣本:刪除標點符號格式的文本 - 星火

case time (especially it's purse), read manual care, follow care instructions make stays waterproof -- example, inspect rubber seals doors (especially battery/memory card door open time) 
xm "life support" picture . flip part bit flimsy guessing won't long . sound great altec speaker dock it! chance back base (xm3020) . traveling bag connect laptop extra speaker . amount paid ($25). 

我想刪除所有標點符號除了點,並與length < = 2刪除的話,比如我的預期輸出()。是:

case time especially its purse read manual care follow care instructions . make stays waterproof example inspect rubber seals doors especially batterymemory card door open time 
life support picture . flip part bit flimsy guessing wont long . sound great altec speaker dock chance back base xm3020 . traveling bag connect laptop extra speaker . amount paid $25 . 

,這應該在Scala中實現, 我已經試過:

replaceAll("""\\W\s""", "") 
replaceAll(""""[^a-zA-Z\.]""", "") 

但無法正常工作,任何人都可以幫助我嗎?

+0

'$ 25'有一個特殊的字符,你沒有刪除。 – tuxdna

回答

13

望着正則表達式的javadoc(http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html ),我們看到標點符號的字符類是\p{Punct},我們可以刪除來自角色類的字符使用[a-z&&[^def]]。從那以後,很容易定義一個正則表達式,將刪除所有標點符號除了點:

s.replaceAll("""[\p{Punct}&&[^.]]""", "") 

刪除單詞,大小< = 2可以像這樣做:

s.replaceAll("""\b\p{IsLetter}{1,2}\b""") 

結合了兩下,這給出:

s.replaceAll("""([\p{Punct}&&[^.]]|\b\p{IsLetter}{1,2}\b)\s*""", "") 

請注意我如何添加\s*刪除冗餘空間。

此外,你可以看到上述正則表達式完全刪除'$',因爲它一個標點符號(由unicode定義)。 如果這是不可取的(似乎表明您的預期輸出),請更精確地考慮標點符號。 例如,你可能希望只考慮下列字符爲標點符號:?.!:()

s.replaceAll("""([?.!:]|\b\p{IsLetter}{1,2}\b)\s*""", "") 

或者,你可以只添加「$」您「不標點」人物名單,以點一起:

s.replaceAll("""([\p{Punct}&&[^.$]]|\b\p{IsLetter}{1,2}\b)\s*""", "") 
0

如何:

replaceAll("(\\(|\\)|'|/", "") 

然後你只需要添加更多的標點符號使用刪除|,並確保逃避像字符(和)雙反斜線?

0

你可以嘗試過濾這樣的字符串:

val example = "Hey there! It's me, myself and I." 
example.filterNot(x => x == ',' || x == '!' || x == 'm') 
res3: String = Hey there It's e yself and I. 
0

試試這個,應當編制:

val str = """ 
    |case time (especially it's purse), read manual care, follow care instructions make stays waterproof -- example, inspect rubber seals doors (especially battery/memory card door open time) 
    |xm "life support" picture . flip part bit flimsy guessing won't long . sound great altec speaker dock it! chance back base (xm3020) . traveling bag connect laptop extra speaker . amount paid ($25). 
    """.stripMargin('|') 

println(str) 
val pat = """[^\w\s\.\$]""" 
val pat2 = """\s\w{2}\s""" 
println(str.replaceAll(pat, "").replaceAll(pat2, "")) 

OUTPUT:

case time especially its purse read manual care follow care instructions make stays waterproof example inspect rubber seals doors especially batterymemory card door open time 
life support picture . flip part bit flimsy guessing wont long . sound great altec speaker dockchance back base xm3020 . traveling bag connect laptop extra speaker . amount paid $25.