我正在開發Twitter數據的分析程序。 我現在正在使用mongoDB。我嘗試編寫一個Java程序來從Twitter API獲取推文並將它們放入數據庫中。 獲取Tweets已經運行得很好,但當我想將它們放入數據庫時遇到問題。由於Twitter API通常只返回相同的Tweets,我必須在數據庫中放置某種索引。避免使用Java和JSON對象在mongoDB中重複輸入
首先,我連接到數據庫並獲取與搜索項相關的集合,或者如果不存在,則創建此集合。
public void connectdb(String keyword)
{
try {
// on constructor load initialize MongoDB and load collection
initMongoDB();
items = db.getCollection(keyword);
BasicDBObject index = new BasicDBObject("tweet_ID", 1);
items.ensureIndex(index);
} catch (MongoException ex) {
System.out.println("MongoException :" + ex.getMessage());
}
}
然後我得到的鳴叫,並把它們在數據庫:
public void getTweetByQuery(boolean loadRecords, String keyword) {
if (cb != null) {
TwitterFactory tf = new TwitterFactory(cb.build());
Twitter twitter = tf.getInstance();
try {
Query query = new Query(keyword);
query.setCount(50);
QueryResult result;
result = twitter.search(query);
System.out.println("Getting Tweets...");
List<Status> tweets = result.getTweets();
for (Status tweet : tweets) {
BasicDBObject basicObj = new BasicDBObject();
basicObj.put("user_name", tweet.getUser().getScreenName());
basicObj.put("retweet_count", tweet.getRetweetCount());
basicObj.put("tweet_followers_count", tweet.getUser().getFollowersCount());
UserMentionEntity[] mentioned = tweet.getUserMentionEntities();
basicObj.put("tweet_mentioned_count", mentioned.length);
basicObj.put("tweet_ID", tweet.getId());
basicObj.put("tweet_text", tweet.getText());
if (mentioned.length > 0) {
// System.out.println("Mentioned length " + mentioned.length + " Mentioned: " + mentioned[0].getName());
}
try {
items.insert(basicObj);
} catch (Exception e) {
System.out.println("MongoDB Connection Error : " + e.getMessage());
loadMenu();
}
}
// Printing fetched records from DB.
if (loadRecords) {
getTweetsRecords();
}
} catch (TwitterException te) {
System.out.println("te.getErrorCode() " + te.getErrorCode());
System.out.println("te.getExceptionCode() " + te.getExceptionCode());
System.out.println("te.getStatusCode() " + te.getStatusCode());
if (te.getStatusCode() == 401) {
System.out.println("Twitter Error : \nAuthentication credentials (https://dev.twitter.com/pages/auth) were missing or incorrect.\nEnsure that you have set valid consumer key/secret, access token/secret, and the system clock is in sync.");
} else {
System.out.println("Twitter Error : " + te.getMessage());
}
loadMenu();
}
} else {
System.out.println("MongoDB is not Connected! Please check mongoDB intance running..");
}
}
但正如我前面提到的,經常有相同的微博,和他們在數據庫中的重複。 我認爲tweet_ID
字段對於索引是一個很好的字段,並且在集合中應該是唯一的。
或在您傳遞的BasicDBObject上放置(「dropDups」,true)。 – evanchooly