在一個查詢中在cassandra中寫入大量數據

我已經創建了一個程序，可以在我的cassandra表中查詢數據並查詢twitter API以獲取關注者和一個用戶的朋友。我安全地保存了所有的id，然後當我把所有的追隨者/朋友寫入Cassandra。在一個查詢中在cassandra中寫入大量數據

問題是其中一個用戶得到1M24追隨者，當我執行此代碼的大小設置種類生成寫入cassandra錯誤。

def get_data(tweepy_function, author_id, author_username, session): 
    if tweepy_function == "followers": 
     followers = set() 
     for follower_id in tweepy.Cursor(API.followers_ids, id=author_id, count=5000).items(): 
      if len(followers) % 5000 == 0 and len(followers) != 0: 
       print("Collected followers: ", len(followers)) 
      followers.add(follower_id) 
     query = "INSERT INTO {0} (node_id, screen_name, centrality, follower_ids) VALUES ({1}, {2}, {3}, {4})"\ 
      .format("network", author_id, author_username, 0.0, followers) 
     session.execute(query) 
    if tweepy_function == "friends": 
     friends = set() 
     for friend_id in tweepy.Cursor(API.friends_ids, id=author_id, count=5000).items(): 
      if len(friends) % 5000 == 0 and len(friends) != 0: 
       print("Collected followers: ", len(friends)) 
      friends.add(friend_id) 
     query = "INSERT INTO {0} (node_id, screen_name, centrality, friend_ids) VALUES ({1}, {2}, {3}, {4})"\ 
      .format("network", author_id, author_username, 0.0, friends) 
     session.execute(query)

至於問我加我的架構：

table = """CREATE TABLE IF NOT EXISTS 
        {0} (
         node_id bigint , 
         screen_name text, 
         last_tweets set<text>, 
         follower_ids set<bigint>, 
         friend_ids set<bigint>, 
         centrality float, 
         PRIMARY KEY (node_id)) 
         """.format(table_name)

爲什麼我得到一個寫入錯誤？如何預防它？這是將數據安全轉入Cassandra的好方法嗎？

來源

2017-03-05 mel

你的模式是什麼？ –

@AshrafulIslam添加它 – mel

您正在使用follower_ids和friend_ids爲Set（集合）在卡桑德拉收集

限制：

項目的集合中的最大大小爲64K或2B，根據在本地協議版本上。
請收藏小，以防止因查詢卡珊德拉讀取其全部收集過程中發生延遲。這個集合並不是在內部分頁，集合被設計爲只存儲少量數據的。
切勿在集合中插入超過64K的物品。如果您將多於64K的項目插入到一個集合中，則只有64K的項目將被查詢，從而導致數據丟失。

您可以使用下面的模式：

CREATE TABLE IF NOT EXISTS my_table (
    node_id bigint , 
    screen_name text, 
    last_tweets set<text>, 
    centrality float, 
    friend_follower_id bigint, 
    is_friend boolean, 
    is_follower boolean, 
    PRIMARY KEY ((node_id), friend_follower_id) 
);

這裏friend_follower_id是friendid或followerid，如果朋友然後標記is_friend爲true如果跟隨然後標記is_follower作爲true

例子：

If for node_id = 1 
    friend_ids = [10, 20, 30] 
    follower_ids = [11, 21, 31]

那麼你的插入查詢將是：

INSERT INTO user(node_id , friend_follower_id , is_friend) VALUES (1, 10, true); 
INSERT INTO user(node_id , friend_follower_id , is_friend) VALUES (1, 20, true); 
INSERT INTO user(node_id , friend_follower_id , is_friend) VALUES (1, 30, true); 
INSERT INTO user(node_id , friend_follower_id , is_follower) VALUES (1, 11, true); 
INSERT INTO user(node_id , friend_follower_id , is_follower) VALUES (1, 21, true); 
INSERT INTO user(node_id , friend_follower_id , is_follower) VALUES (1, 31, true);

如果你想獲得的所有friendids和followerids然後查詢：

SELECT * FROM user WHERE node_id = 1;

你會得到這樣的：

node_id | friend_follower_id | centrality | is_follower | is_friend | last_tweets | screen_name 
---------+--------------------+------------+-------------+-----------+-------------+------------- 
     1 |     10 |  null |  null |  True |  null |  null 
     1 |     11 |  null |  True |  null |  null |  null 
     1 |     20 |  null |  null |  True |  null |  null 
     1 |     21 |  null |  True |  null |  null |  null 
     1 |     30 |  null |  null |  True |  null |  null 
     1 |     31 |  null |  True |  null |  null |  null

來源：
https://docs.datastax.com/en/cql/3.1/cql/cql_using/use_collections_c.html https://docs.datastax.com/en/cql/3.1/cql/cql_reference/refLimits.html

來源

2017-03-06 04:34:16

我不太瞭解你提出的新模式。在我的follower_ids集中，我存儲了所有關注作者的人的id，並且在我的friend_id集中，我做了同樣的事情，但是使用了作者所關注的人。所以最後我有兩個ID的集合。 – mel

@mel詳情已添加到我的答案 –

感謝您的更新。這是存儲集合的最佳方式嗎？ – mel

在一個查詢中在cassandra中寫入大量數據

回答

相關問題