Resumable Conversion from CSV File with Large Size (> 50 MB) to Several Spreadsheets by Splitting File

Gists

Overview

This is a sample script which can achieve the resumable conversion from the large CSV-file to several spreadsheets by splitting the CSV file using Google Apps Script (GAS).

Description

Is there a situation that you want to convert a CSV file with the large size (> 50 MB) to Spreadsheet? When such large CSV file is converted to Spreadsheet, you will experience the error. The reason is the size and/or also it may be due to the total cells (> 2,000,000 cells) of CSV file. In this case, you will think that when the file is split, each file can be converted to Spreadsheet. But the blob which can be used at GAS is less than 50 MB (52,428,800 bytes). Ref1 So when you have a CSV file with more than 50 MB, it is not possible to split it. In this case, you can split such large file by files.get of Drive API. files.get of Drive API has a function of Partial download. Ref2 Using this, the file can be split. By the way, in my environment, when a CSV file with the size of 100 MB is used for this sample script, when the file is split by 10 MB, about 65 seconds was required to convert a chunk to Spreadsheet. In this case, when the CSV file is completely converted, it is considered that it will be over the limitation time (6 min) for executing GAS. So in this post, I would like to introduce a sample script which can achieve the resumable conversion from the large CSV-file to several spreadsheets.

I think that this method will apply to various scenes. If this is useful for your situation, I’m glad.

Sample script

In order to use this sample, please enable Drive API at Advanced Google Services and API console. And please modify obj in main() for your situation. A sample case is as follows.

It supposes as follows.

  • You want to convert the CSV file with the size of 100 MB to 10 spreadsheets.
  • Size of one chunk is 10 MB.
  • CSV file is processed by every 3.

In this sample case, each obj is as follows. Please input each obj at each run.

  1. var obj = {fileId: "#####", chunk: 10485760, files: 3, start: 0, fileName: "sample", number: 1}
    • {"nextStart": ### nextStart2 ###, "nextNumber": 4} is returned.
  2. var obj = {fileId: "#####", chunk: 10485760, files: 3, start: ### nextStart2 ###, fileName: "sample", number: 4}
    • {"nextStart": ### nextStart3 ###, "nextNumber": 7} is returned.
  3. var obj = {fileId: "#####", chunk: 10485760, files: 3, start: ### nextStart3 ###, fileName: "sample", number: 7}
    • {"nextStart": ### nextStart4 ###, "nextNumber": 10} is returned.
  4. var obj = {fileId: "#####", chunk: 10485760, files: 3, start: ### nextStart4 ###, fileName: "sample", number: 10}
    • null is returned.

By this flow, 10 spreadsheets are created from the CSV file with the size of 100 MB.

If null is used for files in obj, files is automatically calculated. But in this case, the limitation time for executing GAS may be over. Please be careful this.

In my environment, when I tried to convert a CSV file with the size of 75 Mb, 1,000,000 rows and 11 columns to 8 spreadsheets, 321 seconds are required.

function createSplitSpreadsheet(obj) {
  var accessToken = ScriptApp.getOAuthToken();
  var baseUrl = "https://www.googleapis.com/drive/v3/files/";

  // Retrieve file size.
  var url1 = baseUrl + obj.fileId + "?fields=size";
  var params1 = {
    method: "get",
    headers: {Authorization: "Bearer " + accessToken},
  };
  var fileSize = Number(JSON.parse(UrlFetchApp.fetch(url1, {headers: {Authorization: "Bearer " + accessToken}}).getContentText()).size);

  // Calculate number of output files.
  if (obj.files == null) {
    obj.number = 1;
    obj.start = 0;
  }
  var start = obj.start;
  var end = start + obj.chunk;
  var useFileSize = fileSize - start;
  f = Math.floor(useFileSize / obj.chunk);
  f = useFileSize % obj.chunk > 0 ? f + 1 : f;
  if (f < obj.files || obj.files == null) {
    obj.files = f;
  }

  // Split large file by chunk size (bytes).
  var url2 = baseUrl + obj.fileId + "?alt=media";
  var i;
  for (i = 0; i < obj.files; i++) {
    var params = {
      method: "get",
      headers: {
        Authorization: "Bearer " + accessToken,
        Range: "bytes=" + start + "-" + end,
      },
    };
    var res = UrlFetchApp.fetch(url2, params).getContentText();
    var e = res.lastIndexOf("\n");
    start += e + 1;
    end = start + obj.chunk;
    Drive.Files.insert(
      {mimeType: MimeType.GOOGLE_SHEETS, title: obj.fileName + (i + obj.number)},
      Utilities.newBlob(res.substr(0, e), MimeType.CSV)
    );
  }

  // Return next start value if there is a next chunk for the resume.
  if (start < fileSize) {
    return {nextStart: start, nextNumber: i + obj.number};
  } else {
    return null;
  }
}

// Please run this function.
function main() {
    var obj = {
        fileId: "#####", // File ID of the large CSV file.
        chunk: 10485760, // 10MB Please modify this for your situation.
        files: 3, // Please input the number of files you want to convert.
        start: 0,
        fileName: "sample",
        number: 1, // Counter of output files. Please input this as a next number.
    };
    var nextStart = createSplitSpreadsheet(obj);
    Logger.log(nextStart);
}

References

  1. Resumable Upload for Web Apps using Google Apps Script
  2. Partial download