2017-06-06 45 views
1

我在blob存儲中創建了一個2GB文件,並且正在構建一個將該文件下載到桌面的控制檯應用程序。要求是分割成100MB的塊並在文件名中附加一個數字。我不需要重新組合這些文件。我需要的只是文件塊。在blob存儲中下載大塊文件並將其分割爲100 MB塊

我公司目前擁有從Azure download blob part

這個代碼,但我無法弄清楚如何停止下載時,文件大小爲100MB已經和創建一個新的。

任何幫助將不勝感激。

更新:這裏是我的代碼

CloudStorageAccount account = CloudStorageAccount.Parse(connectionString); 
      var blobClient = account.CreateCloudBlobClient(); 
      var container = blobClient.GetContainerReference(containerName); 
      var file = uri; 
      var blob = container.GetBlockBlobReference(file); 
      //First fetch the size of the blob. We use this to create an empty file with size = blob's size 
      blob.FetchAttributes(); 
      var blobSize = blob.Properties.Length; 
      long blockSize = (1 * 1024 * 1024);//1 MB chunk; 
      blockSize = Math.Min(blobSize, blockSize); 
      //Create an empty file of blob size 
      using (FileStream fs = new FileStream(file, FileMode.Create))//Create empty file. 
      { 
       fs.SetLength(blobSize);//Set its size 
      } 
      var blobRequestOptions = new BlobRequestOptions 
      { 
       RetryPolicy = new ExponentialRetry(TimeSpan.FromSeconds(5), 3), 
       MaximumExecutionTime = TimeSpan.FromMinutes(60), 
       ServerTimeout = TimeSpan.FromMinutes(60) 
      }; 
      long startPosition = 0; 
      long currentPointer = 0; 
      long bytesRemaining = blobSize; 
      do 
      { 
       var bytesToFetch = Math.Min(blockSize, bytesRemaining); 
       using (MemoryStream ms = new MemoryStream()) 
       { 
        //Download range (by default 1 MB) 
        blob.DownloadRangeToStream(ms, currentPointer, bytesToFetch, null, blobRequestOptions); 
        ms.Position = 0; 
        var contents = ms.ToArray(); 
        using (var fs = new FileStream(file, FileMode.Open))//Open that file 
        { 
         fs.Position = currentPointer;//Move the cursor to the end of file. 
         fs.Write(contents, 0, contents.Length);//Write the contents to the end of file. 
        } 
        startPosition += blockSize; 
        currentPointer += contents.Length;//Update pointer 
        bytesRemaining -= contents.Length;//Update bytes to fetch 

        Console.WriteLine(fileName + dateTimeStamp + ".csv " + (startPosition/1024/1024) + "/" + (blob.Properties.Length/1024/1024) + " MB downloaded..."); 
       } 
      } 
      while (bytesRemaining > 0); 
+0

你能分享你的實際代碼嗎? –

+0

按要求添加代碼。 – Joseph

+0

你解決了這個問題嗎,你需要進一步的幫助嗎? –

回答

1

按我的理解,你可以打破你的blob文件到您的預期件(100MB),然後利用CloudBlockBlob.DownloadRangeToStream下載每個文件的塊。這裏是我的代碼片段,你可以參考一下吧:

ParallelDownloadBlob

private static void ParallelDownloadBlob(Stream outPutStream, CloudBlockBlob blob,long startRange,long endRange) 
{ 
    blob.FetchAttributes(); 
    int bufferLength = 1 * 1024 * 1024;//1 MB chunk for download 
    long blobRemainingLength = endRange-startRange; 
    Queue<KeyValuePair<long, long>> queues = new Queue<KeyValuePair<long, long>>(); 
    long offset = startRange; 
    while (blobRemainingLength > 0) 
    { 
     long chunkLength = (long)Math.Min(bufferLength, blobRemainingLength); 
     queues.Enqueue(new KeyValuePair<long, long>(offset, chunkLength)); 
     offset += chunkLength; 
     blobRemainingLength -= chunkLength; 
    } 
    Parallel.ForEach(queues, 
     new ParallelOptions() 
     { 
      MaxDegreeOfParallelism = 5 
     }, (queue) => 
     { 
      using (var ms = new MemoryStream()) 
      { 
       blob.DownloadRangeToStream(ms, queue.Key, queue.Value); 
       lock (outPutStream) 
       { 
        outPutStream.Position = queue.Key- startRange; 
        var bytes = ms.ToArray(); 
        outPutStream.Write(bytes, 0, bytes.Length); 
       } 
      } 
     }); 
} 

程序主

var container = storageAccount.CreateCloudBlobClient().GetContainerReference(defaultContainerName); 
var blob = container.GetBlockBlobReference("code.txt"); 
blob.FetchAttributes(); 
long blobTotalLength = blob.Properties.Length; 
long chunkLength = 10 * 1024; //divide blob file into each file with 10KB in size 
for (long i = 0; i <= blobTotalLength; i += chunkLength) 
{ 

    long startRange = i; 
    long endRange = (i + chunkLength) > blobTotalLength ? blobTotalLength : (i + chunkLength); 

    using (var fs = new FileStream(Path.Combine(AppDomain.CurrentDomain.BaseDirectory, $"resources\\code_[{startRange}]_[{endRange}].txt"), FileMode.Create)) 
    { 
     Console.WriteLine($"\nParallelDownloadBlob from range [{startRange}] to [{endRange}] start..."); 
     Stopwatch sp = new Stopwatch(); 
     sp.Start(); 

     ParallelDownloadBlob(fs, blob, startRange, endRange); 
     sp.Stop(); 
     Console.WriteLine($"download done, time cost:{sp.ElapsedMilliseconds/1000.0}s"); 
    } 
} 

結果 enter image description here

enter image description here

UPDATE:

根據您的要求,我建議你可以下載你的斑點成一個單一的文件,然後利用LumenWorks.Framework.IO到逐行讀取你的大文件記錄行,然後檢查字節大小您已經閱讀並保存到一個新的csv文件,其大小可達100MB。下面的代碼片段,你可以參考一下吧:

using (CsvReader csv = new CsvReader(new StreamReader("data.csv"), true)) 
{ 
    int fieldCount = csv.FieldCount; 
    string[] headers = csv.GetFieldHeaders(); 
    while (csv.ReadNextRecord()) 
    { 
     for (int i = 0; i < fieldCount; i++) 
      Console.Write(string.Format("{0} = {1};", 
          headers[i], 
          csv[i] == null ? "MISSING" : csv[i])); 
     //TODO: 
     //1.Read the current record, check the total bytes you have read; 
     //2.Create a new csv file if the current total bytes up to 100MB, then save the current record to the current CSV file. 
    } 
} 

此外,您可以參考A Fast CSV ReaderCsvHelper的更多細節。

UPDATE2

打破大的CSV文件分成小的CSV文件與固定字節代碼示例中,我使用CsvHelper 2.16.3以下代碼片段,你可以參考一下吧:

string[] headers = new string[0]; 
using (var sr = new StreamReader(@"C:\Users\v-brucch\Desktop\BlobHourMetrics.csv")) //83.9KB 
{ 
    using (CsvHelper.CsvReader csvReader = new CsvHelper.CsvReader(sr, 
     new CsvHelper.Configuration.CsvConfiguration() 
     { 
      Delimiter = ",", 
      Encoding = Encoding.UTF8 
     })) 
    { 
     //check header 
     if (csvReader.ReadHeader()) 
     { 
      headers = csvReader.FieldHeaders; 
     } 

     TextWriter writer = null; 
     CsvWriter csvWriter = null; 
     long readBytesCount = 0; 
     long chunkSize = 30 * 1024; //divide CSV file into each CSV file with byte size up to 30KB 

     while (csvReader.Read()) 
     { 
      var curRecord = csvReader.CurrentRecord; 
      var curRecordByteCount = curRecord.Sum(r => Encoding.UTF8.GetByteCount(r)) + headers.Count() + 1; 
      readBytesCount += curRecordByteCount; 

      //check bytes you have read 
      if (writer == null || readBytesCount > chunkSize) 
      { 
       readBytesCount = curRecordByteCount + headers.Sum(h => Encoding.UTF8.GetByteCount(h)) + headers.Count() + 1; 
       if (writer != null) 
       { 
        writer.Flush(); 
        writer.Close(); 
       } 
       string fileName = $"BlobHourMetrics_{Guid.NewGuid()}.csv"; 
       writer = new StreamWriter(Path.Combine(AppDomain.CurrentDomain.BaseDirectory, fileName), true); 
       csvWriter = new CsvWriter(writer); 
       csvWriter.Configuration.Encoding = Encoding.UTF8; 
       //output header field 
       foreach (var header in headers) 
       { 
        csvWriter.WriteField(header); 
       } 
       csvWriter.NextRecord(); 
      } 
      //output record field 
      foreach (var field in curRecord) 
      { 
       csvWriter.WriteField(field); 
      } 
      csvWriter.NextRecord(); 
     } 
     if (writer != null) 
     { 
      writer.Flush(); 
      writer.Close(); 
     } 
    } 
} 

結果enter image description here

+0

試過這個。它會「破壞」文件最後一行的記錄,並繼續下一個文件。這不應該是這樣。例如。在最後一行上,它將寫入前七列中的前三列,然後繼續將數據寫入下一個文件第一行的下四列。 – Joseph

+0

正如你所描述的,將你的blob文件分割成固定大小,它必須打破你的記錄。你能提供你的blob記錄的結構嗎? –

+0

表中有七列。如果你有使用lumenworks完整的代碼片段會更好。 :) – Joseph

相關問題