Semantic Search using Corpus of Gemini API with Google Apps Script

Gists

Description

In the current stage, v1beta of Gemini API can use the corpora. Ref When the corpora are used, the values can be searched with the semantic search. In the current stage, 5 corpora can be created in a single project. And, each corpus can have 10,000 documents and 1,000,000 chunks. In this report, I would like to introduce a method for achieving the semantic search using the corpora with Google Apps Script.

Usage

In order to use the corpora, it is required to use the access token with the scopes including https://www.googleapis.com/auth/generative-language.retriever.

1. Create a Google Apps Script project

Please create a standalone Google Apps Script project. Of course, this script can be also used with the container-bound script.

And, please open the script editor of the Google Apps Script project.

2. Linking Google Cloud Platform Project to Google Apps Script Project for New IDE

In this case, you can see how to do this at my repository.

Also, please enable Generative Language API at the API console.

3. Create corpus

In order to create a new corpus, please run the following Google Apps Script. The official document is here.

Please copy and paste the following script.

function sample1() {
  const payload = { displayName: "sample name" }; // Please set your sample name.
  const url = `https://generativelanguage.googleapis.com/v1beta/corpora`;
  const res = UrlFetchApp.fetch(url, {
    headers: { authorization: "Bearer " + ScriptApp.getOAuthToken() },
    payload: JSON.stringify(payload),
    contentType: "application/json",
  });
  console.log(res.getContentText());
}

When this script is run, the following result is returned.

{
  "name": "corpora/#####",
  "displayName": "sample name",
  "createTime": "2024-01-30T00:00:00.000000Z",
  "updateTime": "2024-01-30T00:00:00.000000Z"
}

4. Create a document in the corpus

In order to create a new document in the created corpus, please run the following Google Apps Script. The official document is here.

Please copy and paste the following script.

function sample2() {
  const parent = "corpora/###"; // This is from the returned value of "sample1".
  const payload = { displayName: "sample document name" }; // Please set your sample document name.

  const url = `https://generativelanguage.googleapis.com/v1beta/${parent}/documents`;
  const res = UrlFetchApp.fetch(url, {
    headers: { authorization: "Bearer " + ScriptApp.getOAuthToken() },
    payload: JSON.stringify(payload),
    contentType: "application/json",
  });
  console.log(res.getContentText());
}

When this script is run, the following result is returned.

{
  "name": "corpora/###/documents/###",
  "displayName": "sample document name",
  "updateTime": "2024-01-30T00:01:00.000000Z",
  "createTime": "2024-01-30T00:01:00.000000Z"
}

5. Put chunks to document

In order to create a new document in the created corpus, please run the following Google Apps Script. The official document is here.

Please copy and paste the following script.

function sample3() {
  const documentName = "corpora/###/documents/###"; // This is from the returned value of "sample2".

  // This is a sample values.
  // Ref: https://medium.com/google-cloud/categorization-using-gemini-pro-api-with-google-apps-script-804df0101161
  const categories = [
    {
      category: "fruit",
      description:
        "Nature's candy! Seeds' sweet ride to spread, bursting with colors, sugars, and vitamins. Fuel for us, future for plants. Deliciously vital!",
    },
    {
      category: "vegetable",
      description:
        "Not just leaves! Veggies sprout from roots, stems, flowers, and even bulbs. Packed with vitamins, minerals, and fiber galore, they fuel our bodies and keep us wanting more.",
    },
    {
      category: "vehicle",
      description:
        "Metal chariots or whirring steeds, gliding on land, skimming seas, piercing clouds. Carrying souls near and far, vehicles weave paths for dreams and scars.",
    },
    {
      category: "human",
      description:
        "Walking contradictions, minds aflame, built for laughter, prone to shame. Woven from stardust, shaped by clay, seeking answers, paving the way.",
    },
    {
      category: "animal",
      description:
        "Sentient dance beneath the sun, from buzzing flies to whales that run. Flesh and feather, scale and claw, weaving instincts in nature's law. ",
    },
    {
      category: "other",
      description: "Except for fruit, vegetable, vehicle, human, and animal",
    },
  ];

  const requests = categories.map(({ category, description }) => ({
    parent: documentName,
    chunk: {
      data: { stringValue: description },
      customMetadata: [{ key: "category", stringValue: category }],
    },
  }));
  const payload = { requests };
  const url = `https://generativelanguage.googleapis.com/v1beta/${documentName}/chunks:batchCreate`;
  const res = UrlFetchApp.fetch(url, {
    headers: { authorization: "Bearer " + ScriptApp.getOAuthToken() },
    payload: JSON.stringify(payload),
    contentType: "application/json",
  });
  console.log(res.getContentText());
}

6. Semantic search from a document in a corpus

In order to search values with the semantic search from a document in a corpus, please run the following Google Apps Script. The official document is here.

Please copy and paste the following script.

function sample4() {
  // This is a sample search texts.
  // Ref: https://medium.com/google-cloud/categorization-using-gemini-pro-api-with-google-apps-script-804df0101161
  const searchTexts = ["penguin", "sushi", "orange", "spinach", "toyota lexus"];

  const documentName = "corpora/###/documents/###"; // This is from the returned value of "sample2".

  const res = searchTexts.map((searchText) => {
    const payload = { query: searchText, resultsCount: 5 };
    const url = `https://generativelanguage.googleapis.com/v1beta/${documentName}:query`;
    const res = UrlFetchApp.fetch(url, {
      headers: { authorization: "Bearer " + ScriptApp.getOAuthToken() },
      payload: JSON.stringify(payload),
      contentType: "application/json",
    });
    const obj = JSON.parse(res.getContentText());
    if (obj.relevantChunks.length > 0) {
      return {
        searchText,
        category: obj.relevantChunks[0].chunk.customMetadata[0].stringValue,
      };
    }
    return "No response.";
  });
  console.log(res);
}

When this script is run, the following result is returned.

[
  { "searchText": "penguin", "category": "animal" },
  { "searchText": "sushi", "category": "other" },
  { "searchText": "orange", "category": "other" },
  { "searchText": "spinach", "category": "vegetable" },
  { "searchText": "toyota lexus", "category": "vehicle" }
]

In the case of my previous report “Categorization using Gemini Pro API with Google Apps Script”, the following result was obtained.

[
  { "searchText": "penguin", "category": "animal" },
  { "searchText": "sushi", "category": "other" },
  { "searchText": "orange", "category": "fruit" },
  { "searchText": "spinach", "category": "vegetable" },
  { "searchText": "toyota lexus", "category": "vehicle" }
]

I guess that the difference between { "searchText": "orange", "category": "other" } and { "searchText": "orange", "category": "fruit" } is due to the explanation in the document. In the case of { "searchText": "orange", "category": "other" }, the 1st and 2nd search results were other and fruit.

In this sample, the texts are searched from a document. Ref Also, the texts can be searched from a corpus. Ref

I would like to believe that the corpora can be used with the text generation in the future update.

References

 Share!