2017-05-29 89 views
3

我有大量存儲在S3桶文件(> 1000)的,我想在它們之間迭代(例如,在循環for)以從他們使用boto3提取數據。如何遍歷S3存儲桶中的文件?

然而,我注意到,按照http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.list_objectsClient類的list_objects()方法只列出多達1,000個對象:

In [1]: import boto3 

In [2]: client = boto3.client('s3') 

In [11]: apks = client.list_objects(Bucket='iper-apks') 

In [16]: type(apks['Contents']) 
Out[16]: list 

In [17]: len(apks['Contents']) 
Out[17]: 1000 

不過,我想列出所有的對象,即使有超過1000個。我怎麼能做到這一點?

回答

4

庫爾特 - 偷看票據,boto3Paginator類,它可以讓你迭代器來S3對象的頁面,並可以很容易地用來提供網頁內移到項目的迭代器:

import boto3 


def iterate_bucket_items(bucket): 
    """ 
    Generator that iterates over all objects in a given s3 bucket 

    See http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.list_objects_v2 
    for return data format 
    :param bucket: name of s3 bucket 
    :return: dict of metadata for an object 
    """ 


    client = boto3.client('s3') 
    paginator = client.get_paginator('list_objects_v2') 
    page_iterator = paginator.paginate(Bucket=bucket) 

    for page in page_iterator: 
     for item in page['Contents']: 
      yield item 


for i in iterate_bucket_items(bucket='my_bucket'): 
    print i 

哪將輸出類似:

{u'ETag': '"a8a9ee11bd4766273ab4b54a0e97c589"', 
u'Key': '2017-06-01-10-17-57-EBDC490AD194E7BF', 
u'LastModified': datetime.datetime(2017, 6, 1, 10, 17, 58, tzinfo=tzutc()), 
u'Size': 242, 
u'StorageClass': 'STANDARD'} 
{u'ETag': '"03be0b66e34cbc4c037729691cd5efab"', 
u'Key': '2017-06-01-10-28-58-732EB022229AACF7', 
u'LastModified': datetime.datetime(2017, 6, 1, 10, 28, 59, tzinfo=tzutc()), 
u'Size': 238, 
u'StorageClass': 'STANDARD'} 
... 

注意list_objects_v2建議的list_objects代替:https://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html

您也可以通過直接調用list_objects_v2()並在NextContinuationToken值傳遞從響應,ContinuationTokenisTruncated是在響應真正做到這一點,在一個較低的水平。

1

我發現boto3Paginator類來處理截斷的結果。以下爲我工作:

paginator = client.get_paginator('list_objects') 
page_iterator = paginator.paginate(Bucket='iper-apks') 

之後

我可以使用 page_iterator發電機在 for循環。

+1

文檔建議使用「list_objects_v2」進行新的開發,請參閱 - https://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html - paginator也支持此功能,請參閱http:// boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Paginator.ListObjectsV2 –