到文件我在Python很新,我處理下面的鳴叫:讀出線,過程一一列出,並用Python語言編寫
@PrincessSuperC Hey Cici sweetheart! Just wanted to let u know I luv u! OH! and will the mixtape drop soon? FANTASY RIDE MAY 5TH!!!!
@Msdebramaye I heard about that contest! Congrats girl!!
UNC!!! NCAA Champs!! Franklin St.: I WAS THERE!! WILD AND CRAZY!!!!!! Nothing like it...EVER http://tinyurl.com/49955t3
Do you Share More #jokes #quotes #musiC#photos or #news #articles on #Facebook or #Twitter?
Good night #Twitter and #TheLegionoftheFallen. 5:45am cimes awfully early!
I just finished a 2.66 mi run with a pace of 11'14"/mi with Nike+ GPS. #nikeplus #makeitcount
Disappointing day. Attended a car boot sale to raise some funds for the sanctuary, made a total of 88p after the entry fee - sigh
no more taking Irish car bombs with strange Australian women who can drink like rockstars...my head hurts.
Just had some bloodwork done. My arm hurts
而且它應該有一個特徵向量的輸出如下:
featureList = ['hey', 'cici', 'luv', 'mixtape', 'drop', 'soon', 'fantasy', 'ride', 'heard',
'congrats', 'ncaa', 'franklin', 'wild', 'share', 'jokes', 'quotes', 'music', 'photos', 'news',
'articles', 'facebook', 'twitter', 'night', 'twitter', 'thelegionofthefallen', 'cimes', 'awfully',
'finished', 'mi', 'run', 'pace', 'gps', 'nikeplus', 'makeitcount', 'disappointing', 'day', 'attended',
'car', 'boot', 'sale', 'raise', 'funds', 'sanctuary', 'total', 'entry', 'fee', 'sigh', 'taking',
'irish', 'car', 'bombs', 'strange', 'australian', 'women', 'drink', 'head', 'hurts', 'bloodwork',
'arm', 'hurts']
然而,電流輸出,我得到的是隻有
hey, cici, luv, mixtape, drop, soon, fantasy, ride
其中僅來自第一鳴叫。並且它只在一條推文中保持循環,而不會進入下一行。我嘗試使用nextLine,但顯然它不適用於Python。我的代碼如下:
#import regex
import re
import csv
import pprint
import nltk.classify
#start replaceTwoOrMore
def replaceTwoOrMore(s):
#look for 2 or more repetitions of character
pattern = re.compile(r"(.)\1{1,}", re.DOTALL)
return pattern.sub(r"\1\1", s)
#end
#start process_tweet
def processTweet(tweet):
# process the tweets
#Convert to lower case
tweet = tweet.lower()
#Convert www.* or https?://* to URL
tweet = re.sub('((www\.[\s]+)|(https?://[^\s]+))','URL',tweet)
#Convert @username to AT_USER
tweet = re.sub('@[^\s]+','AT_USER',tweet)
#Remove additional white spaces
tweet = re.sub('[\s]+', ' ', tweet)
#Replace #word with word
tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
#trim
tweet = tweet.strip('\'"')
return tweet
#end
#start getStopWordList
def getStopWordList(stopWordListFileName):
#read the stopwords
stopWords = []
stopWords.append('AT_USER')
stopWords.append('URL')
fp = open(stopWordListFileName, 'r')
line = fp.readline()
while line:
word = line.strip()
stopWords.append(word)
line = fp.readline()
fp.close()
return stopWords
#end
#start getfeatureVector
#start getfeatureVector
def getFeatureVector(tweet):
featureVector = []
#split tweet into words
words = tweet.split()
for w in words:
#replace two or more with two occurrences
w = replaceTwoOrMore(w)
#strip punctuation
w = w.strip('\'"?,.')
#check if the word stats with an alphabet
val = re.search(r"^[a-zA-Z][a-zA-Z0-9]*$", w)
#ignore if it is a stop word
if(w in stopWords or val is None):
continue
else:
featureVector.append(w.lower())
return featureVector
#end
#Read the tweets one by one and process it
fp = open('data/sampleTweets.txt', 'r')
line = fp.readline()
st = open('data/feature_list/stopwords.txt', 'r')
stopWords = getStopWordList('data/feature_list/stopwords.txt')
while line:
processedTweet = processTweet(line)
featureVector = getFeatureVector(processedTweet)
with open('data/niek_corpus_feature_vector.txt', 'w') as f:
f.write(', '.join(featureVector))
#end loop
fp.close()
UPDATE: 試圖改變循環後的建議如下:
st = open('data/feature_list/stopwords.txt', 'r')
stopWords = getStopWordList('data/feature_list/stopwords.txt')
with open('data/sampleTweets.txt', 'r') as fp:
for line in fp:
processedTweet = processTweet(line)
featureVector = getFeatureVector(processedTweet)
with open('data/niek_corpus_feature_vector.txt', 'w') as f:
f.write(', '.join(featureVector))
fp.close()
我得到了下面的輸出,這僅僅是從最後一行字推文。
bloodwork, arm, hurts
我仍在試圖弄明白。
預期輸出和輸入之間的關係是什麼?看起來你只是隨機挑選單詞。 – aIKid
輸出是由getFeatureVector方法處理的重要關鍵字(特徵向量)。這裏的問題是我似乎無法前往下一行。不是關於如何選擇單詞。 – fuschia
.readline()只讀取單行。如果您想要閱讀一個流程並重新讀取並處理該流程,則必須將整個流程放在一個循環中。 – Stormvirux