2017-06-27 82 views
3

我下面的一個小文章叫做:使用Python在JSON文件處理前的鳴叫

挖掘Twitter的數據實際上,我在第2部分是文本預處理。這是標記化推文文本的示例。現在

import re 
import json 

emoticons_str = r""" 
    (?: 
        [:=;] # Eyes 
        [oO\-]? # Nose (optional) 
        [D\)\]\(\]/\\OpP] # Mouth 
    )""" 
regex_str = [ 
    emoticons_str, 
    r'<[^>]+>', # HTML Tags 
    r'(?:@[\w_]+)', # @-mentions 
    r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # hash-tags 
    r'http[s]?://(?:[a-z]|[0-9]|[[email protected]&amp;+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs 
    r'(?:(?:\d+,?)+(?:\.?\d+)?)', # numbers 
    r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and ' 
    r'(?:[\w_]+)', # other words 
    r'(?:\S)' # anything else 
] 

tokens_re = re.compile(r'(' + '|'.join(regex_str) + ')', re.VERBOSE | re.IGNORECASE) 
emoticon_re = re.compile(r'^' + emoticons_str + '$', re.VERBOSE | re.IGNORECASE) 


def tokenize(s): 
    return tokens_re.findall(s) 


def preprocess(s, lowercase=False): 
    tokens = tokenize(s) 
    if lowercase: 
     tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens] 
    return tokens 

,它工作正常,當你直接插入一個字符串是這樣的:

tweet = 'RT @marcobonzanini: just an example! :D http://example.com #NLP' 
print(preprocess(tweet)) 

但是,一旦我試圖在文件中導入JSON用於令牌化所有的鳴叫文本它occurrs一個錯誤。

這是它應該如何工作。

with open('tweets.json', 'r') as f: 
    for line in f: 
     tweet = json.loads(line) 
     tokens = preprocess(tweet['text']) 

這是顯示錯誤:

Traceback (most recent call last): 
    File "C:/Users/fmigg/PycharmProjects/untitled/Data Mining/tweetTextProcessing.py", line 43, in <module> 
    tweet = json.loads(line) 
    File "C:\Program Files\Anaconda3\lib\json\__init__.py", line 319, in loads 
    return _default_decoder.decode(s) 
    File "C:\Program Files\Anaconda3\lib\json\decoder.py", line 339, in decode 
    obj, end = self.raw_decode(s, idx=_w(s, 0).end()) 
    File "C:\Program Files\Anaconda3\lib\json\decoder.py", line 357, in raw_decode 
    raise JSONDecodeError("Expecting value", s, err.value) from None 
json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 1) 

最後,這就是所謂的tweets.json與它(鳴叫的數量是有點大,所以我會鳴叫的JSON文件只能放一個Tweet來分析它的結構)。

{"created_at":"Tue Jun 27 16:05:01 +0000 2017","id":879732307992739840,"id_str":"879732307992739840","text":"RT @PythonQnA: Python List Comprehension Vs. Map #python #list-comprehension #map-function https:\/\/t.co\/YtxeSt64pd","source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\"nofollow\"\u003eTwitter for Android\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":704974573985525760,"id_str":"704974573985525760","name":"UNIVERSAL TGSI","screen_name":"universaltgsi","location":"Magny-le-Hongre, France, SM","url":"http:\/\/www.tgsi.eu","description":"Find everything you want to know about business Technology by ONE TGSI","protected":false,"verified":false,"followers_count":424,"friends_count":343,"listed_count":273,"favourites_count":4250,"statuses_count":2958,"created_at":"Wed Mar 02 10:20:11 +0000 2016","utc_offset":7200,"time_zone":"Paris","geo_enabled":false,"lang":"fr","contributors_enabled":false,"is_translator":false,"profile_background_color":"1B95E0","profile_background_image_url":"http:\/\/pbs.twimg.com\/profile_background_images\/705020861909225472\/psLvMIAP.jpg","profile_background_image_url_https":"https:\/\/pbs.twimg.com\/profile_background_images\/705020861909225472\/psLvMIAP.jpg","profile_background_tile":true,"profile_link_color":"0084B9","profile_sidebar_border_color":"000000","profile_sidebar_fill_color":"000000","profile_text_color":"000000","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/866410987880099840\/HT8fZKLO_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/866410987880099840\/HT8fZKLO_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/704974573985525760\/1495404137","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweeted_status":{"created_at":"Tue Jun 27 08:24:00 +0000 2017","id":879616290700263424,"id_str":"879616290700263424","text":"Python List Comprehension Vs. Map #python #list-comprehension #map-function https:\/\/t.co\/YtxeSt64pd","source":"\u003ca href=\"http:\/\/jarvis.ratankumar.org\/\" rel=\"nofollow\"\u003ePythonQnA\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":747460774998605825,"id_str":"747460774998605825","name":"PythonQnA","screen_name":"PythonQnA","location":"Bengaluru, India","url":null,"description":"I tweet Python questions from stackoverflow.","protected":false,"verified":false,"followers_count":632,"friends_count":64,"listed_count":277,"favourites_count":0,"statuses_count":85791,"created_at":"Mon Jun 27 16:05:10 +0000 2016","utc_offset":null,"time_zone":null,"geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"F5F8FA","profile_background_image_url":"","profile_background_image_url_https":"","profile_background_tile":false,"profile_link_color":"1DA1F2","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/747461193653092352\/Mz9NjeE__normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/747461193653092352\/Mz9NjeE__normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/747460774998605825\/1467044067","default_profile":true,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":2,"favorite_count":1,"entities":{"hashtags":[{"text":"python","indices":[34,41]},{"text":"list","indices":[42,47]},{"text":"map","indices":[62,66]}],"urls":[{"url":"https:\/\/t.co\/YtxeSt64pd","expanded_url":"https:\/\/goo.gl\/OZxWIC","display_url":"goo.gl\/OZxWIC","indices":[76,99]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en"},"is_quote_status":false,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[{"text":"python","indices":[49,56]},{"text":"list","indices":[57,62]},{"text":"map","indices":[77,81]}],"urls":[{"url":"https:\/\/t.co\/YtxeSt64pd","expanded_url":"https:\/\/goo.gl\/OZxWIC","display_url":"goo.gl\/OZxWIC","indices":[91,114]}],"user_mentions":[{"screen_name":"PythonQnA","name":"PythonQnA","id":747460774998605825,"id_str":"747460774998605825","indices":[3,13]}],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":true,"filter_level":"low","lang":"en","timestamp_ms":"1498579501518"} 

我想知道爲什麼會發生這種情況。 非常感謝大家!

P.S這是鏈接到相應的文章: Mining Twitter Data with Python (Part 2: Text Pre-processing)

UPDATE:

我試圖用一個簡單的JSON鳴叫,並在JSON文件中的兩個簡單的JSON鳴叫的代碼和它的工作。所以看起來問題在於當我打開整個文件時,所有推文都打開了。

如果有人需要該文件,您可以在我的Microsoft Onedrive中下載或觀看它。 https://1drv.ms/f/s!AjHPHWCBEuf7ux3uLmSVEaSCPWIE

+1

我懷疑這是因爲在文件中的空行。在try catch中包含'json.loads(line)'並打印無效行。這應該有助於你找到壞線 – balki

+0

它可能是每行結尾處的尾隨換行符。試試'json.loads(line.strip())'。 – anonymoose

+0

@balki謝謝!它使用這個例子https:// stackoverflow。com/questions/4710067 /刪除-a-file-python –

回答

1

正如@balki說,那是因爲有每個每個JSON對象空線在此模式:

1 JSON Object 
2 empty line 
3 JSON Object 
4 empty line 

所以我從以下問題Deleting a specific line in a file (python)解決方案,並改變它刪除空行,像這樣:

def erase_empty_lines(file_name): 
    file = open(file_name, 'r') 
    lines = file.readlines() 
    file.close() 

    file = open(file_name, 'w') 
    for line in lines: 
     if line != '\n': 
      file.write(line) 
    file.close() 
+0

如果你不想修改原始文件,你可以這樣做:https://pastebin.com/p1vdsuds – balki

0

你的json文件可能只有一行包含整個json字符串。所以在遍歷文件的行時沒有意義。相反,你想通過tweets = json.load(f)加載json文件的內容。假設不同的鳴叫都存儲在一個列表中,您可以遍歷他們這樣的:

with open('tweets.json') as fp: 
    tweets = json.load(fp) 

for tweet in tweets: 
    tokens = preprocess(tweet['text'])