2014-03-05 23 views
2

到文件我在Python很新,我處理下面的鳴叫:讀出線,過程一一列出,並用Python語言編寫

@PrincessSuperC Hey Cici sweetheart! Just wanted to let u know I luv u! OH! and will the mixtape drop soon? FANTASY RIDE MAY 5TH!!!! 
@Msdebramaye I heard about that contest! Congrats girl!! 
UNC!!! NCAA Champs!! Franklin St.: I WAS THERE!! WILD AND CRAZY!!!!!! Nothing like it...EVER http://tinyurl.com/49955t3 
Do you Share More #jokes #quotes #musiC#photos or #news #articles on #Facebook or #Twitter? 
Good night #Twitter and #TheLegionoftheFallen. 5:45am cimes awfully early! 
I just finished a 2.66 mi run with a pace of 11'14"/mi with Nike+ GPS. #nikeplus #makeitcount 
Disappointing day. Attended a car boot sale to raise some funds for the sanctuary, made a total of 88p after the entry fee - sigh 
no more taking Irish car bombs with strange Australian women who can drink like rockstars...my head hurts. 
Just had some bloodwork done. My arm hurts 

而且它應該有一個特徵向量的輸出如下:

featureList = ['hey', 'cici', 'luv', 'mixtape', 'drop', 'soon', 'fantasy', 'ride', 'heard', 
'congrats', 'ncaa', 'franklin', 'wild', 'share', 'jokes', 'quotes', 'music', 'photos', 'news', 
'articles', 'facebook', 'twitter', 'night', 'twitter', 'thelegionofthefallen', 'cimes', 'awfully', 
'finished', 'mi', 'run', 'pace', 'gps', 'nikeplus', 'makeitcount', 'disappointing', 'day', 'attended', 
'car', 'boot', 'sale', 'raise', 'funds', 'sanctuary', 'total', 'entry', 'fee', 'sigh', 'taking', 
'irish', 'car', 'bombs', 'strange', 'australian', 'women', 'drink', 'head', 'hurts', 'bloodwork', 
'arm', 'hurts'] 

然而,電流輸出,我得到的是隻有

hey, cici, luv, mixtape, drop, soon, fantasy, ride 

其中僅來自第一鳴叫。並且它只在一條推文中保持循環,而不會進入下一行。我嘗試使用nextLine,但顯然它不適用於Python。我的代碼如下:

#import regex 
import re 
import csv 
import pprint 
import nltk.classify 

#start replaceTwoOrMore 
def replaceTwoOrMore(s): 
    #look for 2 or more repetitions of character 
    pattern = re.compile(r"(.)\1{1,}", re.DOTALL) 
    return pattern.sub(r"\1\1", s) 
#end 

#start process_tweet 
def processTweet(tweet): 
    # process the tweets 

    #Convert to lower case 
    tweet = tweet.lower() 
    #Convert www.* or https?://* to URL 
    tweet = re.sub('((www\.[\s]+)|(https?://[^\s]+))','URL',tweet) 
    #Convert @username to AT_USER 
    tweet = re.sub('@[^\s]+','AT_USER',tweet)  
    #Remove additional white spaces 
    tweet = re.sub('[\s]+', ' ', tweet) 
    #Replace #word with word 
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet) 
    #trim 
    tweet = tweet.strip('\'"') 
    return tweet 
#end 

#start getStopWordList 
def getStopWordList(stopWordListFileName): 
    #read the stopwords 
    stopWords = [] 
    stopWords.append('AT_USER') 
    stopWords.append('URL') 

    fp = open(stopWordListFileName, 'r') 
    line = fp.readline() 
    while line: 
     word = line.strip() 
     stopWords.append(word) 
     line = fp.readline() 
    fp.close() 
    return stopWords 
#end 

#start getfeatureVector 
#start getfeatureVector 
def getFeatureVector(tweet): 
    featureVector = [] 
    #split tweet into words 
    words = tweet.split() 
    for w in words: 
     #replace two or more with two occurrences 
     w = replaceTwoOrMore(w) 
     #strip punctuation 
     w = w.strip('\'"?,.') 
     #check if the word stats with an alphabet 
     val = re.search(r"^[a-zA-Z][a-zA-Z0-9]*$", w) 
     #ignore if it is a stop word 
     if(w in stopWords or val is None): 
      continue 
     else: 
      featureVector.append(w.lower()) 
    return featureVector 
#end 

#Read the tweets one by one and process it 
fp = open('data/sampleTweets.txt', 'r') 
line = fp.readline() 

st = open('data/feature_list/stopwords.txt', 'r') 
stopWords = getStopWordList('data/feature_list/stopwords.txt') 

while line: 
    processedTweet = processTweet(line) 
    featureVector = getFeatureVector(processedTweet) 
    with open('data/niek_corpus_feature_vector.txt', 'w') as f: 
     f.write(', '.join(featureVector)) 
#end loop 
fp.close() 

UPDATE: 試圖改變循環後的建議如下:

st = open('data/feature_list/stopwords.txt', 'r') 
stopWords = getStopWordList('data/feature_list/stopwords.txt') 

with open('data/sampleTweets.txt', 'r') as fp: 
    for line in fp: 
     processedTweet = processTweet(line) 
     featureVector = getFeatureVector(processedTweet) 
     with open('data/niek_corpus_feature_vector.txt', 'w') as f: 
      f.write(', '.join(featureVector)) 
fp.close() 

我得到了下面的輸出,這僅僅是從最後一行字推文。

bloodwork, arm, hurts 

我仍在試圖弄明白。

+0

預期輸出和輸入之間的關係是什麼?看起來你只是隨機挑選單詞。 – aIKid

+0

輸出是由getFeatureVector方法處理的重要關鍵字(特徵向量)。這裏的問題是我似乎無法前往下一行。不是關於如何選擇單詞。 – fuschia

+0

.readline()只讀取單行。如果您想要閱讀一個流程並重新讀取並處理該流程,則必須將整個流程放在一個循環中。 – Stormvirux

回答

1

如果只想使用輸入行(),而不是readlines方法使用循環如下。

st = open('data/feature_list/stopwords.txt', 'r') 
stopWords = getStopWordList('data/feature_list/stopwords.txt') 
with open('data/sampleTweets.txt', 'r') as fp: 
    for line in fp: 
     processedTweet = processTweet(line) 
     featureVector = getFeatureVector(processedTweet) 
     with open('data/niek_corpus_feature_vector.txt', 'ab') as f: 
      f.write(', '.join(featureVector)) 
+0

當我嘗試這個時,輸出只包含來自推文最後一行的單詞,這些單詞是:** bloodwork,arm,hurt **。我仍然試圖找出爲什麼.. – fuschia

+0

當用'open('data/niek_corpus_feature_vector.txt','w')打開文件時,使用f:'use'a'作爲append而不是'w'來寫入。 w破壞以前的文件內容'open('data/niek_corpus_feature_vector.txt','a')爲f:'。我將更新答案 – Stormvirux

1
line = fp.readline() 

只讀取文件中的一行。然後你在那段時間處理那條線,然後立即退出。您需要閱讀文件中的每一行。一旦你讀完了整個文件,你就應該像處理完成一樣處理每一行。

lines = fp.readlines() 

# Now process each line 

for line in lines: 
    # Now process the line as you do in your original code 
    while line: 
    processedTweet = processTweet(line) 

Python File readlines() Method

方法readlines()讀取EOF使用readline() 直到並返回包含行的列表。如果存在可選大小的參數 ,而不是讀取到EOF,則會讀取總計約爲sizehint字節的整行 (可能在舍入到 之後的內部緩衝區大小之後)。

以下爲readlines方法()的語法的方法:

fileObject.readlines(sizehint); Parameters sizehint -- This is the number of bytes to be read from the file. 

Return Value: This method returns a list containing the lines. 

示例下面的示例示出了readlines()方法的使用。

#!/usr/bin/python 

# Open a file 
fo = open("foo.txt", "rw+") print "Name of the file: ", fo.name 

# Assuming file has following 5 lines 
# This is 1st line 
# This is 2nd line 
# This is 3rd line 
# This is 4th line 
# This is 5th line 

line = fo.readlines() print "Read Line: %s" % (line) 

line = fo.readlines(2) print "Read Line: %s" % (line) 

# Close opend file 

fo.close() 

讓我們編譯並運行上述程序,這將產生以下結果:

Name of the file: foo.txt Read Line: ['This is 1st line\n', 'This is 
2nd line\n', 
      'This is 3rd line\n', 'This is 4th line\n', 
      'This is 5th line\n'] 
Read Line: [] 
+0

@Stormvirux我已經複製了原來的網站以準確顯示它。我傾向於認爲,當引用一個網站時,她會精確地展示它,而不是改變措辭。 – sabbahillel