Parsing HTML using Google Apps Script

Gists

This is a sample script for parsing HTML using Google Apps Script. When HTML data is converted to Google Document, the HTML data can be parsed and be converted to Google Document. In this case, the paragraphs, lists and tables are included. From this situation, I thought that this situation can be used for parsing HTML using Google Apps Script. So I could came up with this method.

In the Sheet API, the HTML data can be put to the Spreadsheet with the PasteDataRequest. But unfortunately, in this case, I couldn't distinguish between the body and tables.

The flow of this method is as follows. In this sample script, the tables from HTML are retrieved.

Flow

  1. Retrieve HTML data using UrlFetchApp.fetch().
  2. Create new Google Document by converting HTML data to Google Document using Drive API.
    • This is a temporal file.
  3. Retrieve all tables using Document service of Google Apps Script.
  4. Delete the temporal file.

Sample script

Before you run this script, please enable Drive API at Advanced Google Services.

function parseTablesFromHTML(url) {
  var html = UrlFetchApp.fetch(url);
  var docId = Drive.Files.insert(
    { title: "temporalDocument", mimeType: MimeType.GOOGLE_DOCS },
    html.getBlob()
  ).id;
  var tables = DocumentApp.openById(docId)
    .getBody()
    .getTables();
  var res = tables.map(function(table) {
    var values = [];
    for (var row = 0; row < table.getNumRows(); row++) {
      var temp = [];
      var cols = table.getRow(row);
      for (var col = 0; col < cols.getNumCells(); col++) {
        temp.push(cols.getCell(col).getText());
      }
      values.push(temp);
    }
    return values;
  });
  Drive.Files.remove(docId);
  return res;
}

// Please run this function.
function run() {
  var url = "###"; // <--- Please set URL that you want to retrieve table.
  var res = parseTablesFromHTML(url);
  Logger.log(res);
}

Result

As a test case, when you set https://gist.github.com/tanaikech/f52e391b68473cbf6d4ab16108dcfbbb to url and run the script, the following result can be retrieved.

[
  [
    ["head1_1", "head1_2", "head1_3\n"],
    ["value1_a1", "value1_b1", "value1_c1"],
    ["value1_a2", "value1_b2", "value1_c2"]
  ],
  [
    ["head2_1", "head2_2", "head2_3\n"],
    ["value2_a1", "value2_b1", "value2_c1"],
    ["value2_a2", "value2_b2", "value2_c2"]
  ]
]

Note

  • Using this method, all paragraphs and lists can be also retrieved.
  • This method can be also used with other languages.

References