分割文件不超過N塊但長度最小

我有一個文件，並假設我需要將它分割成N個較小的文件，最小的塊應該至少有X個字節，並且所有文件都應該有（幾乎）相同的尺寸：分割文件不超過N塊但長度最小

因此，使用例如字符串 'ABCDEFGHIJ' 與N = 4和X = 3將返回[ 'ABCD'， 'EFG'， 'HIJ']因爲：

3 chunks < 4 chunks 
4 chars > 3 chars

我寫了一個分割功能，但是它有時會產生一個額外的串所以我可能應該通過x的值而不是在那裏計算。

def split(string, n): 
    x = len(string)//n 
    return [string[i:i+x] for i in range(0, len(string), x)]

真正的問題是如何計算用最少的字節數來剪切文件的塊數。

def calculate(length, max_n, min_x): 
    n, x = ... 
    return n, x

是否有一個簡單的已知算法來做這種動作？

實際上：這些文件不需要在1個字節上有所不同，因爲我想最大限度地增加塊的數量。

來源

2012-09-10 user1661233

您使用的'N'是不相符的。是字符數還是文件數？ –

@gnibbler N是文件（塊）的數量。 X是字節數（字符） – user1661233

但標題說N是字符數 –

def calculate(L, N, X): 
    n = min(L//X, N) 
    return n, L//n

編輯：

def spread(seq, N=None, X=1): 
    """Yield successive subsequences of seq having at least X elements. 

    If N is specified, the number of subsequences yielded will not exceed N. 

    The first L % X subsequences yielded (where L = len(seq)) will be longer 
    by 1 than the remaining ones. 

    >>> list(spread('abcdefghij', 4, 3)) 
    ['abcd', 'efg', 'hij'] 
    >>> list(spread('abcdefghijklmnopqrstuvwxyz', 4, 7)) 
    ['abcdefghi', 'jklmnopqr', 'stuvwxyz'] 

    seq any object supporting len(...) and slice-indexing 
    N  a positive integer (default: L) 
    X  a positive integer not greater than L (default: 1) 
    """ 

    # All error-checking code omitted 

    L = len(seq)  # length of seq 
    assert 0 < X <= L 

    if N is None: N = L 
    assert 0 < N 

    # A total of n subsequences will be yielded, the first r of which will 
    # have length x + 1, and the remaining ones will have length x. 

    # if we insist on using calculate()... 
    # n, x = calculate(L, N, X) 
    # r = L % n 

    # ...but this entails separate computations of L//n and L%n; may as well 
    # do both with a single divmod(L, n) 
    n = min(L//X, N) 
    x, r = divmod(L, n) 

    start = 0 
    stride = x + 1 # stride will revert to x when i == r 
    for i in range(n): 
     if i == r: stride = x 
     finish = start + stride 
     yield seq[start:finish] 
     start = finish 
    assert start == L

來源

2012-09-11 13:33:22 kjo

我認爲你是第一個瞭解這個問題。我喜歡這個函數找到正確的「N」（但不是右邊的「X」），所以我只需要使用@JohnGainesJr寫的分割函數。 – user1661233

好的，我修正了'calculate'的原始實現中的一個錯誤。也許這個版本產生你想要的'x'。 – kjo

你是什麼意思，「用最少的字節數來剪切文件」？要麼你沒有完全解釋這個問題，要麼沒有獨特的解決方案。

當你的解決方案表明，這是分裂的問題：如果L是總長度，你可以把它分成n塊爲任何n < L。其餘部分（必然小於n）給出了比其他字符多一個字符的塊數。例如，10 % 3 = 1因此在您的示例中，三個塊中的一個更長。但是你可以將10 % 7（餘數3）分成7個組塊，其中3個長度更長（長度2而不是1）。或者只是10個長度爲1的塊，如果你真的想要「最大化塊的數量」，就像你寫的那樣。

更普遍的：對於m指定，選擇N = L // m和您的塊將有長度m和m+1（或只是m，如果L // m沒有餘數）的任何長度。正如我所說，這只是一個分裂問題。

來源

2012-09-10 22:37:12 alexis

是的，它沒有獨特的解決方案，因爲它有一個開放的間隔選擇。當發生這種情況時，我想要儘可能多的塊數 – user1661233

最大可能數量的塊總是將長度爲'L'的字符串拆分爲長度爲1的「L」塊。 – alexis

但我不希望這些塊是那小。說我希望他們至少有10個字節。所以我將不得不減少件數 – user1661233

不知道簡單或已知，但這似乎是伎倆。它會返回N個字符串，並將額外的字符分配給集合中的早期字符串。

import itertools as it 
s = 'abcdefhijklm' 
def hunks(s, n): 
    size, extra = divmod(len(s), n) 
    i = 0 
    extras = it.chain(it.repeat(1, extra), it.repeat(0)) 
    while i < len(s): 
     e = next(extras) 
     yield s[i:i + size + e] 
     i += size + e 
list(hunks(s, 4))

來源

2012-09-10 22:38:22

但是我怎樣才能設置每件作品的最小長度。我想減少件數，如果他們變得太小 – user1661233

分割文件不超過N塊但長度最小

回答

相關問題