25 Mar 2024, 10:54

PythonTips

python / Gemini / AI / LLM

Abstract

The Gemini API unlocks potential for diverse applications but requires consistent output formatting. This report proposes a method using question phrasing and API calls to craft a bespoke output, enabling seamless integration with user applications. Examples include data categorization and obtaining multiple response options.

Introduction

With the release of the LLM model Gemini as an API on Vertex AI and Google AI Studio, a world of possibilities has opened up. Ref The Gemini API significantly expands the potential of various scripting languages and paves the way for diverse applications. However, leveraging the Gemini API smoothly requires consistent output formatting, which can be tricky due to its dependence on the specific question asked.

To address this challenge, this report explores a method for achieving consistent output formatting. The findings indicate that output formats can be controlled by:

The question itself: By carefully crafting the question, users can influence the structure and presentation of the response. Function calls: Specific function calls within the API provide additional control over the formatting of the output. About the function call, my report might be useful for understanding it.

This report introduces the following 2 applications using this method.

Categorizing data.
Obtaining multiple identified responses.

By strategically combining question phrasing and function calls, users can obtain outputs that seamlessly integrate into their applications.

Preparation

In order to test the following scripts, please create an API key. Please access https://makersuite.google.com/app/apikey and create your API key. At that time, please enable Generative Language API at the API console. This API key is used for this sample script. This official document can be also seen. Ref.

The following scripts are Python scripts. So, please install Google AI Python SDK and Python Client for Generative Language API.

1. Categorization

In this section, we introduce sample scripts for the first application of controlling output formats: categorizing words with the Gemini API. The script takes words and categorizes them according to the corresponding values in categories. This categorization can be applied to various situations, such as flexible labeling of emails, and management of documents, images, and videos. It is worth noting that this application has significant potential.

1.1 By only question

In this sample, the output format is controlled by the provided question. This is the fundamental approach. The question is Select the related category of '{word}' from the following categories of {c}. Return the selected category as a word with the lowercase letter.. Each word is used in a loop.

import google.generativeai as genai
import google.ai.generativelanguage as glm

# Please set your API key
apiKey = "###"

# Please set words you want to categorize.
words = ["apple", "sushi", "orange", "spinach", "toyota lexus"]

# Please set your categories.
categories = ["fruit", "vegetable", "vehicle", "human", "animal", "other"]

genai.configure(api_key=apiKey)
model = genai.GenerativeModel("gemini-pro")
c = ", ".join(categories)
for word in words:
    question = f"Select the related category of '{word}' from the following categories of {c}. Return the selected category as a word with lowercase letter."
    response = model.generate_content(question)
    res = "Return no value."
    if response.candidates[0].content:
        res = response.candidates[0].content.parts[0].text
    print(f"{word} ---> {res}")

When this script is run, the following result is obtained. In this case, only a word of the selected category is correctly returned.

apple ---> fruit
sushi ---> other
orange ---> fruit
spinach ---> vegetable
toyota lexus ---> vehicle

When I tested this script, there was a case that multiple categories were returned in a single word.

1.2 By function calling

In this sample, the output format is controlled by the question and the function calling. This is the new approach. The question is simple like Select the category of {word}.. In this case, each category has a description. When this script is run, Gemini API selects one of the functions. The function name is the category name. When the function is selected with Gemini API, the category can be obtained by the function name. When Gemini API selects the function, the function name is always in constant format. This situation is used. Each word is used in a loop.

import google.generativeai as genai
import google.ai.generativelanguage as glm

# Please set your API key
apiKey = "###"

# Please set words you want to categorize.
words = ["apple", "sushi", "orange", "spinach", "toyota lexus"]

# Please set your categories including the descriptions.
categories = [
    {
        "category": "fruit",
        "description": "Sweet or savory, nature's lure for animals: packages hiding plant babies for a tasty journey, spreading life with every bite.",
    },
    {
        "category": "vegetable",
        "description": "Plant's colorful bounty, not sweet treats, but packed with vitamins. Below ground or bursting bright, they fuel growth and meals, not the seeds themselves, but vital for their survival. ",
    },
    {
        "category": "vehicle",
        "description": "Wheeled contraption, powered beast, or sky serpent - carries people and things across land, water, or air. Used for travel, transport, even exploration. Makes journeys faster, farther, sometimes even fun!",
    },
    {
        "category": "human",
        "description": "Two-legged tool builders, masters of complex languages, driven by emotions and stories. They reshape landscapes, explore vast unknowns, and yearn for meaning amidst fleeting lives.",
    },
    {
        "category": "animal",
        "description": "Living, breathing machines with diverse shapes and sizes. They move, eat, grow, and reproduce, driven by instincts and senses. From soaring flyers to slithering dwellers, they weave an intricate web of life, shaping ecosystems and captivating imaginations.",
    },
    {
        "category": "other",
        "description": "Except for fruit, vegetable, vehicle, human, and animal",
    },
]

function_declarations = [
    glm.FunctionDeclaration(name=e["category"], description=e["description"])
    for e in categories
]
genai.configure(api_key=apiKey)
tool = glm.Tool(function_declarations=function_declarations)
model = genai.GenerativeModel("gemini-pro", tools=[tool])
for word in words:
    question = f"Select the category of {word}."
    response = model.generate_content(question)
    print(response.candidates[0].content.parts[0])
    res = response.candidates[0].content.parts[0].function_call.name
    print(f"{word} ---> {res}")

When this script is run, the following result is obtained. In this case, only a word of the selected category is correctly returned.

apple ---> fruit
sushi ---> other
orange ---> fruit
spinach ---> vegetable
toyota lexus ---> vehicle

When I tested this script, there was a case that the selected category of sushi was wrong. It is considered that the explanation of other might be required to be modified.

2. Multiple responses

This section showcases the second application for controlling output formats: obtaining multiple identified responses with the Gemini API. We provide sample scripts using multiple sample words as input, retrieving their descriptions. Notably, this method allows retrieving multiple identified responses per API call, significantly reducing process costs. This efficiency makes it a valuable tool for various applications.

2.1 By only question

In this sample, the output format is controlled by the given question. This is the fundamental approach. The question is Explain the following words within 50 words for each word. Return them as a JSON object that the key and the value are the word and the explanation, respectively.. All words are used in one API call. In this case, multiple responses are possible. In order to control the output format, it uses Return them as a JSON object that the key and the value are the word and the explanation, respectively.. By this, multiple identified responses can be obtained.

import google.generativeai as genai
import re
import json

# Please set your API key
apiKey = "###"

# Please set words you want to explain.
words = ["fruit", "vegetable", "vehicle", "human", "animal", "other"]

question = "Explain the following words within 50 words for each word. Return them as a JSON object that the key and the value are the word and the explanation, respectively."

c = ", ".join(words)
genai.configure(api_key=apiKey)
model = genai.GenerativeModel("gemini-pro")
response = model.generate_content(f"{question} {c}")
obj = json.loads(re.findall("{[\w\s\S]*}", response.candidates[0].content.parts[0].text)[0])
for word in words:
    print(f"{word} ---> {obj.get(word, '')}")

When this script is run, the value of response.candidates[0].content.parts[0].text is as follows. The JSON data is returned as the markdown format.

{
  "fruit": "A sweet and fleshy part of a plant that contains seeds.",
  "vegetable": "A plant or part of a plant that is used as food, typically as part of a meal.",
  "vehicle": "A machine that is used to transport people or goods.",
  "human": "A member of the species Homo sapiens, characterized by intelligence, culture, and the ability to walk upright.",
  "animal": "A multicellular organism that typically has a nervous system, an internal skeleton, and the ability to move about.",
  "other": "Anything that is not included in the previous categories."
}

When this JSON data is parsed, the following result is obtained.

fruit ---> A sweet and fleshy part of a plant that contains seeds.
vegetable ---> A plant or part of a plant that is used as food, typically as part of a meal.
vehicle ---> A machine that is used to transport people or goods.
human ---> A member of the species Homo sapiens, characterized by intelligence, culture, and the ability to walk upright.
animal ---> A multicellular organism that typically has a nervous system, an internal skeleton, and the ability to move about.
other ---> Anything that is not included in the previous categories.

2.2 By function calling

In this sample, the output format is controlled by the question and the function calling. This is the new approach. The question is simple like Explain the following words within 50 words for each word. Return them as an object using function calling.. In this case, the properties with the keys of words and the description of Explanation of {w} within 50 words. are created in a function. When this script is run, Gemini API puts the created descriptions of each word into each value of the properties. And, the created descriptions can be obtained as the argument of the function of the response. All words are used in one API call. In this case, multiple responses are possible.

import google.generativeai as genai
import google.ai.generativelanguage as glm

# Please set your API key
apiKey = "###"

# Please set words you want to explain.
words = ["fruit", "vegetable", "vehicle", "human", "animal", "other"]

question = "Explain the following words within 50 words for each word. Return them as an object using function calling."

genai.configure(api_key=apiKey)
tool = glm.Tool(
    function_declarations=[
        glm.FunctionDeclaration(
            name="results",
            description="Put all answers in each property in an object.",
            parameters=glm.Schema(
                type=glm.Type.OBJECT,
                properties={w: glm.Schema(type=glm.Type.STRING, description=f"Explanation of {w} within 50 words.") for w in words},
                required=words,
            ),
        )
    ]
)
model = genai.GenerativeModel("gemini-pro", tools=[tool])
messages = [{"role": "user", "parts": [question, *words]}]
response = model.generate_content(messages)
obj = response.candidates[0].content.parts[0].function_call.args
for w in words:
    print(f"{w} ---> {obj[w]}")

When this script is run, the following result is obtained.

fruit ---> Fruit is a sweet and fleshy part of a plant that is used as food.
vegetable ---> Vegetable is an edible plant or plant part.
vehicle ---> Vehicle is a machine that is used to transport people or goods.
human ---> Human is a member of the species Homo sapiens, a highly intelligent, social primate.
animal ---> Animal is a multicellular organism that is able to move, breathe, reproduce, and sense its environment.
other ---> Other is something that is different or unique from the rest.

When a single API call retrieves multiple answers, it significantly reduces processing costs. Compared to the approach of iterating through words individually, our proposed methods demonstrate an estimated 80 % reduction in processing cost.

In this sample, the texts are used. Of course, this method can be used for creating the descriptions for images with low process cost. It is considered that this will be useful for a lot of users. However, as an important point, in the current stage, the function calling cannot be used to models/gemini-pro-vision. When the function calling is tried to be used to models/gemini-pro-vision, an error like Function calling is not enabled for models/gemini-pro-vision occurs. From this situation, it introduces the sample script using the section “2.1 By only question”.

import google.generativeai as genai
import re
import json
import glob
import os
from PIL import Image

# Please set your API key
apiKey = "###"

# Prepare images.
image_dir = "###" # Please set the path of a directory including the image files.

images_list = glob.glob(os.path.join(image_dir, "*.jpg"))
obj = []
for p in images_list:
    obj.append({"filename": os.path.split(p)[1], "image": Image.open(p)})
obj = sorted(obj, key=lambda x: x["filename"])
images = [o["image"] for o in obj]
keys = [o["filename"] for o in obj]

filenames = ",".join(keys)
question = f"Explain the images of {filenames} within 50 words for each image. Images of {filenames} are following images in order. Return them as a JSON object that the key and the value are the filename and the explanation, respectively."
imagess = sum([[f"Next image is {keys[i]}.", e] for (i, e) in enumerate(images)], [])
genai.configure(api_key=apiKey)
model = genai.GenerativeModel("gemini-pro-vision")
response = model.generate_content([f"{question} Order of filenames is {filenames}.", *imagess])
print(response.candidates[0].content.parts[0].text)
obj = json.loads(re.findall("{[\w\s\S]*}", response.candidates[0].content.parts[0].text)[0])
for word in keys:
    print(f"{word} ---> {obj.get(word, '')}")

When the sample images are as follows (Those images are generated by Gemini. https://gemini.google.com/app),

the following result is obtained.

image1.jpg ---> A painting of three dogs running in the snow. There is one husky, one alaskan malamute, and one kleiner munsterlander.
image2.jpg ---> A painting of four cats on a table.
image3.jpg ---> A photograph of three vintage motorcycles parked on a grassy field.
image4.jpg ---> A painting of two cats playing around two parked cars.
image5.jpg ---> A photograph of two red apples and two oranges.

This result could be obtained by one API call. It is found that each description corresponds to each image. By this, the process cost can be reduced.

Summary

This report demonstrates that controlling output formats within the Gemini API unlocks novel applications, as showcased in this document. Furthermore, these findings suggest that the Gemini API has the potential to significantly impact the industry and pave the way for innovative breakthroughs.

Note

These sample scripts currently lack comprehensive error processing because of show simple samples. To ensure successful implementation in your specific environment, adding robust error-handling mechanisms is crucial.
The above illustration was created by Gemini with giving the abstract.

Crafting Bespoke Output Formats with Gemini API