This is a sample script for parsing HTML using Google Apps Script. When HTML data is converted to Google Document, the HTML data can be parsed and be converted to Google Document. In this case, the paragraphs, lists and tables are included. From this situation, I thought that this situation can be used for parsing HTML using Google Apps Script. So I could came up with this method.
In the Sheet API, the HTML data can be put to the Spreadsheet with the PasteDataRequest. But unfortunately, in this case, I couldn’t distinguish between the body and tables.
The flow of this method is as follows. In this sample script, the tables from HTML are retrieved.
Flow
- Retrieve HTML data using
UrlFetchApp.fetch()
. - Create new Google Document by converting HTML data to Google Document using Drive API.
- This is a temporal file.
- Retrieve all tables using Document service of Google Apps Script.
- Delete the temporal file.
Sample script
Before you run this script, please enable Drive API at Advanced Google Services.
function parseTablesFromHTML(url) {
var html = UrlFetchApp.fetch(url);
var docId = Drive.Files.insert(
{ title: "temporalDocument", mimeType: MimeType.GOOGLE_DOCS },
html.getBlob()
).id;
var tables = DocumentApp.openById(docId)
.getBody()
.getTables();
var res = tables.map(function(table) {
var values = [];
for (var row = 0; row < table.getNumRows(); row++) {
var temp = [];
var cols = table.getRow(row);
for (var col = 0; col < cols.getNumCells(); col++) {
temp.push(cols.getCell(col).getText());
}
values.push(temp);
}
return values;
});
Drive.Files.remove(docId);
return res;
}
// Please run this function.
function run() {
var url = "###"; // <--- Please set URL that you want to retrieve table.
var res = parseTablesFromHTML(url);
Logger.log(res);
}
Result
As a test case, when you set https://gist.github.com/tanaikech/f52e391b68473cbf6d4ab16108dcfbbb
to url
and run the script, the following result can be retrieved.
[
[
["head1_1", "head1_2", "head1_3\n"],
["value1_a1", "value1_b1", "value1_c1"],
["value1_a2", "value1_b2", "value1_c2"]
],
[
["head2_1", "head2_2", "head2_3\n"],
["value2_a1", "value2_b1", "value2_c1"],
["value2_a2", "value2_b2", "value2_c2"]
]
]
Note
- Using this method, all paragraphs and lists can be also retrieved.
- This method can be also used with other languages.