2015-03-18 94 views
0

我是R新手,我正在研究文本分析項目。我的問題在這裏我似乎無法讓我的數據清理/ prepped進行分析。這是下面的代碼,但運行我的聊天數據後不會改變。在R中刪除標點

chat.df <- chat1 
dim(chat.df) 
# [1] 0 1 
myCorpus <- Corpus(VectorSource(chat.df$text)) 
myCorpus <- tm_map(myCorpus, tolower) 
myCorpus <- tm_map(myCorpus, content_transformer(tolower)) 
myCorpus <- tm_map(myCorpus, removePunctuation) 
myCorpus <- tm_map(myCorpus, removeNumbers) 
myStopwords <- c(stopwords("english"), "available", "via") 
myStopwords <- setdiff(myStopwprds, c("r", "big")) 

這是我運行這些腳本後的聊天數據。我也附加了我的文件。我需要幫助。

To.link.your.wife.s.email.address.to.a.digital.account.please.follow.these.steps.As.a.registered.user.of.Boston.Globe.Subscriber.Services.your.account.has.already.been.linked.To.confirm.your.account.information.is.correct.please.login.to.Subscriber.Services..httpsservicesbostonglobecomregistrationsubDefaultaspx...I.see.that.you.want.to.view.your.vacation.stops.I.have.two.stops.recorded.The.first.is.a.stop.on.April.3.with.a.resume.date.of.April.7.The.second.is.a.stop.date.of.May.9.with.a.resume.date.of.May.11.If.you.would.like.to.make.any.changes.to.these.vacation.stops.we.are.happy.to.help.you.If.you.are.having.trouble.receiving.today.s.paper.we.suggest.a.manual.download.This.can.be.done.by.going.to.the.Store..select.today.s.date.and.then.downloadltbr.gtltbr.gtIf.there.is.anything.else.we.can.assist.you.with.please.let.us.know.In.30.seconds.I.will.need.to.terminate.the.chat.due.to.no.response..Let.me.know.if.you.are.still.there.Ms.Piasecki.we.has.a.late.truck.in.this.morning.and.all.papers.should.be.delivered.until.9am.I.will.alert.the.distribution.center.manager.to.contact.you.regarding.this.issue.so.he.may.better.understand.how.his.local.carriers.can.improve.their.serviceltbr.gtltbr.gt.We.apologize.for.any.inconvenience.this.delay.may.causeltbr.gt.Let..us.know.if.there.is.anything.else.we.can.help.you.with.today..Our.subscriber.management.system.is.currently.down.for.maintenance.and.as.a.result.I.am.unable.to.make.any.changes.to.accounts.Please.contact.us.tomorrow.with.your.request.when.our.system.is.back.up.and.we.will.be.happy.to.assist.you.further.picture.of.a.head.Please.let.us.know.if.there.is.anything.we.can.help.you.withltbr.gt.We.re.sorry.to.hear.that.you.re.having.trouble.with.the.app..We.are.aware.of.the.issue.since.the.IOS.8.update..We.suggest.de.authorizing.from..the.app.and.then.re.authorizing.This.can.be.done.by.going.to.the.Settings.and.then.Swipe.to.de.authorizeltbr.gtltbr.gtOnce.de.authorized.please.authenticate.in.the.app.againltbr.gtltbr.gt..Click.on.Settings.on.the.iPhone.or.on.the.iPad.click.the.person.icon.in.the.top.right.corner.of.the.Appltbr.gt..Under.Account.click.on..preview.and.then.under.Registered.User.enter.your.BGcom.e.mail.address.and.password..ltbr.gtltbr.gtOnce.re.authorized.please.try.downloading.today.s.paper.again.This.can.be.done.by.going.to.the.Store..select.today.s.date.and.then.download.You.re.all.set.Is.there.anything.else.I.can.assist.you.with..Your.account.will.be.credited.during.your.time.away.If.you.would.prefer.to.donate.any.days.to.Newspapers.in.Education.please.let.us.know.3.Pickwick.Way.3..Go.to.Pressreader.and.it.will.list.the.authorized.devices.7.day.home.delivery.is.1099.a.week.8.o.clock.is.the.standard.delivery.time.on.weekends.102614.is.your.new.paid.to.date 
+0

所以你只是想刪除期間? – rawr 2015-03-18 21:29:41

+0

和數字如果可能的話。 – user3117087 2015-03-18 21:36:53

+0

沒有與它一起工作了一段時間,但我認爲它預計它作爲矩陣第一? – user2600629 2015-03-18 22:27:36

回答

1

隨着基礎R您可以輕鬆地清理你的字符串:

x <- tolower("Time.on.weekends.102614.is.your.New.paid.to") 
    gsub("[[:digit:][:punct:]']", " ", x) 

[1] "time on weekends  is your new paid to" 


y <- gsub("[0-9]","","time.on.weekends.102614.is.your.new.paid.to") 
gsub("[[:punct:]]"," ", y) 

[1] "time on weekends is your new paid to" 
+0

停用詞怎麼樣? – 2015-03-18 22:41:44

+0

謝謝jbaums,我刪除了&。我的錯。 – 2015-03-18 23:34:24

0

爲什麼你創建胼之前你不刪除標點? stringr方法比tm::removePunctuation好,因爲它留下了標點符號所在的空間。

您可以通過另一個電話刪除數字。

library(stringr) 
df <- "o.link.your.wife.s.email.address.to.a.digital.account.please.follow.these.steps.As.a.registered.user.of.Boston.Globe.Subscriber.Services.your.account.has.already.been." You could extend this to remove digits also. 

text <- str_replace_all(df, pattern = "[[:punct:]]", " ") 

> text 
[1] "o link your wife s email address to a digital account please follow these steps As a registered user of Boston Globe Subscriber Services your account has already been " 
+0

所以我試了這個,但得到一個錯誤 – user3117087 2015-03-19 15:25:36

+0

庫(stringr) > chat2 < - read.delim(file.choose(),header = T) > chat.df < - chat2 > chat.df < - str_replace_all (df,pattern =「[[:punct:]]」,「」) 錯誤:字符串必須是原子向量 – user3117087 2015-03-19 15:25:57