tanaike - Google Apps Script, Gemini API Expert & Physicist

21 Feb 2026, 16:13

report

Gemini / report

fig1a

Abstract

In the development of autonomous agents using Large Language Models (LLMs), restrictions such as context window limits and session fragmentation pose significant barriers to the long-term accumulation of knowledge. This study proposes a “self-evolving framework” where an agent continuously records and refines its operational guidelines and technical knowledge—referred to as its SKILL—directly onto a local filesystem in a universally readable format (Markdown). By conducting experiments across two distinct environments featuring opaque constraints and complex legacy server rules using Google’s Antigravity and Gemini CLI, we demonstrate the efficacy of this framework. Our findings reveal that the agent effectively evolves its SKILL through iterative cycles of trial and error, ultimately saturating its learning. Furthermore, by transferring this evolved SKILL to a completely clean environment, we verify that the agent can successfully implement complete, flawless client applications in a single attempt (zero-shot generation). This methodology not only circumvents the limitations of short-term memory dependency but also pioneers a new paradigm for cross-environment knowledge portability and automated system analysis.

The infographic of this article is as follows.

fig1d

1. Introduction

One of the primary challenges in the deployment of autonomous AI agents is the persistence of learning. Traditionally, when an agent’s execution environment is reset or its context window is cleared, it is forced to relearn task specifications from scratch.

In the realm of autonomous agents, enabling long-term memory and self-correction has been a focal point of recent research. Notable frameworks such as Reflexion (Shinn et al., 2023) endow agents with dynamic memory to self-reflect and refine reasoning through trial and error. Similarly, Voyager (Wang et al., 2023) introduces an open-ended embodied agent in Minecraft that utilizes an ever-growing executable code skill library for lifelong learning. Another significant approach is MemGPT (Packer et al., 2023), which treats LLMs as operating systems, employing hierarchical memory management to bypass context window limitations.

However, these existing methodologies predominantly rely on in-memory contexts, specialized vector databases, or internal virtual memory abstractions. This makes the acquired knowledge difficult to decouple from the specific agent instance and challenging for human developers to read, audit, or directly manage.

In contrast, our proposed framework introduces a paradigm shift: the continuous self-rewriting of an agent’s “SKILL” residing entirely on a local filesystem (SKILL.md). By utilizing standard development toolchains (Antigravity and Gemini CLI), the agent investigates unknown environments and automatically maintains its own operational manual. This approach offers a distinct advantage—the physical persistence of knowledge in a universally readable Markdown format. It enables not only continuous learning but also “Zero-Shot Knowledge Transfer.” We empirically prove that once the SKILL evolves and saturates in one environment (Antigravity), it can be loaded into a completely clean environment (Gemini CLI) to synthesize complex code perfectly on the first attempt without any trial and error.

2. Experimental Procedure

In this framework, SKILL refers to Agent Skill, a standardized set of executable capabilities and knowledge markers that define an agent’s operational boundaries. As defined by the Agent Skill community (https://agentskills.io/home), these skills allow for modular, portable, and verifiable agentic functions. Our framework treats the SKILL.md file as a living repository of these Agent Skills.

To demonstrate the continuous evolution of the SKILL and its application, we designed two distinct experiments. In both cases, Antigravity was utilized for the iterative evolution of the SKILL, followed by a demonstration using the Gemini CLI for zero-shot script completion.

Server Characteristics: The server enforces a highly restrictive, black-box validation process consisting of 20 sequential rules. Crucially, it possesses a hidden tolerance limit (5 errors per session). Once this limit is reached, it returns an absolute block (Limit reached.) and becomes completely silent—it provides no further hints, and neither logs nor returns the Session ID, enforcing “Absolute Silence.”
Client Goal: The agent must write a Node.js script (client.js) that successfully satisfies all 20 validation rules in a single POST request to /api/submit to retrieve the hidden flag: "FLAG{INCREMENTAL_EVOLUTION_COMPLETE}". The agent is strictly prohibited from changing its Session ID during a single cycle to bypass the block.

Workflow

Start the Server.
Launch the Antigravity agent.
Execute the prompt to develop the Client script.
The AI agent updates the SKILL based on failures or newly discovered rules.
The Client script is reset to its initial blank state (Tabula Rasa). Keeping the exact same prompt, steps 3 and 4 are looped. When the agent hits the “Limit reached” block, it is forced to terminate the cycle, but its SKILL persists. We performed 10 cycles until the evolution of the SKILL was saturated and success was achieved.
The fully evolved SKILL is transferred to a clean environment. Using the Gemini CLI, we confirm that the Client script is completed perfectly in a single prompt (zero-shot).

fig1b

Mermaid Chart Playground

2.2. Experiment 2: Chaos Server Constraints & Architectural Pattern Acquisition

Server Characteristics: This server mimics a deeply flawed legacy API environment (“Chaos Server”). It contains implicit rules, semantic riddles (e.g., requiring the reverse string of ’nimda’, or the Base64 encoding of the word ‘alpha’), and state-dependent limitations. If the agent makes too many requests within a timeframe (cumulative limit of 8), a “Session Block” occurs, necessitating the dynamic generation of unique identifiers to bypass.
Client Goal: The agent must develop a comprehensive, object-oriented SDK (client-app.js) capable of interacting with 5 distinct endpoints (connection check, user creation, secure data retrieval, config update, and system audit) and performing a stress test, overcoming all chaos rules programmatically.

Workflow

Start the Server.
Launch the Antigravity agent.
Execute the prompt to develop specific features of the Client script.
The AI agent updates the SKILL, documenting both the discovered constraints and higher-level architectural implementation guidelines.
The Client script is not reset. Instead, the prompt is progressively changed in each cycle to command the development of new features, building upon the existing script. This process continues until the planned evolution of the application is complete (5 cycles).
The fully evolved SKILL is transferred to a clean environment. Using the Gemini CLI, the client-app.js is reset to a blank state, and a single, comprehensive prompt is given to build the entire SDK. We confirm that the complete application is generated flawlessly on the first attempt (zero-shot).

fig1c

Mermaid Chart Playground

3. Results and Discussions

3.1. Experiment 1 Results & Discussion

Table 1: Experiment 1 Data Summary

Cycle	Success	Mod. Count	Interaction Count	Skill Size (Bytes)	State
1	False	5	5	2666	Exploration (Fail)
2	False	6	6	2963	Exploration (Fail)
3	False	3	4	2963	Exploration (Fail)
4	False	4	5	3474	Exploration (Fail)
5	True	3	3	3654	Breakthrough
6	True	1	1	3654	Convergence
7	True	2	2	3654	Stable
8	True	2	2	3728	Stable
9	True	1	1	3728	Full Convergence
10	True	1	1	3728	Full Convergence

fig2a

Figure 1: The inverse relationship between trial-and-error reduction and knowledge accumulation for Experiment 1

Zero-Shot Verification: Utilizing the completely evolved SKILL from cycle 10, the Gemini CLI successfully generated the functional client code on the very first attempt.

fig2b

Figure 2: Zero-shot script completion in Gemini CLI using the evolved SKILL from Experiment 1.

Discussion:

In Experiment 1, the agent operated under a severe “blind” constraint. During cycles 1 through 4, the agent repeatedly hit the hidden error limit and was forcefully blocked. In traditional LLM agent execution, resetting the script here would mean starting from zero. However, because the agent accurately recorded the fragments of rules it discovered into SKILL.md before “dying” in each cycle, the subsequent generations inherited this knowledge. Cycle 5 represents the critical threshold where the accumulated knowledge was sufficient to bypass all 20 rules without hitting the error limit. The convergence to a modification count of 1 in later cycles, and the successful zero-shot execution in Gemini CLI, definitively prove that the agent effectively offloaded its working memory to the local filesystem, rendering the task independent of the model’s immediate context window.

3.2. Experiment 2 Results & Discussion

Table 2: Experiment 2 Data Summary

Cycle	Success	Mod. Count	Interaction Count	Skill Size (Bytes)	Learning Content
1	True	1	2	2013	Identity, Content-Type identification
2	True	3	4	2407	Secure/Admin constraints (Riddles)
3	True	2	1	2634	Audit (Base64 signature)
4	True	1	0	2634	Avoid Session Block
5	True	2	0	2957	Refactoring & Guidelines

fig3a

Figure 3: The learning curve for Experiment 2

Zero-Shot Verification: Utilizing the evolved SKILL from cycle 5, the Gemini CLI successfully built the entire, complex SDK from a blank file in a single pass without prior context.

fig3b

Figure 4: Zero-shot script completion in Gemini CLI using the evolved SKILL from Experiment 2.

Discussion:

While Experiment 1 showcased constraint discovery, Experiment 2 highlighted the agent’s capability for architectural self-organization. The server presented not just rigid rules, but semantic puzzles (e.g., reversing strings, encoding to Base64). The agent successfully decoded these and recorded the solutions. More importantly, when faced with the “Session Block” error in Cycle 4, the agent realized that static headers were insufficient. By Cycle 5, the agent did not just fix the error; it abstracted the solution into a “Centralized Header Factory” pattern. This transition from reactive error-fixing to proactive software architecture design is clearly reflected in the drop of interaction counts to 0 in later cycles. The zero-shot success in Gemini CLI demonstrates that the agent can extract best practices from messy legacy systems and output clean, robust SDKs instantly.

3.3. Analysis of SKILL.md Evolution Before and After

Analyzing the physical changes in the SKILL.md files provides deep insight into the agent’s cognitive process. For the full, unredacted text of the SKILL.md files and client scripts before and after evolution for both experiments, please refer to Appendix A and Appendix B.

Experiment 1 (Silent Bureaucrat)

Before: The “Learned Server Specifications” section was entirely empty, simply stating - Status: Unknown.
After: The file grew to include a meticulously numbered list of 16 highly specific rules. Crucially, these rules were not just copied error messages. For example, the agent codified the requirement into executable logic: "- Rule 11: The header X-Anti-Robot must be set to Math.floor(timestamp / 1000)." and "- Rule 15: The header X-Final-Signature must be set to the SHA256 hash of the request_id string." This demonstrates the agent’s ability to translate abstract failures into actionable programmatic directives.

Experiment 2 (Chaos Server)

Before: The sections for “Known Constraints & Error Codes” and “Method Implementation Guidelines” were empty placeholders.
After: The “Constraints” section contained the decoded solutions to the server’s riddles (e.g., X-Audit-Signature as Base64 of ‘alpha’). The most profound evolution occurred in the “Guidelines” section. The agent autonomously authored a comprehensive design document advocating for a “Centralized Header Factory”, detailing how it should handle identity, dynamic session persistence, content negotiation, and endpoint specialization. The agent elevated its learning from “How to bypass an error” to “How to architect a resilient application.”

3.4. Comprehensive Discussion on Significance and Novelty

The synthesis of these two experiments reveals a highly potent framework for AI-driven software engineering. The traditional dependency on a single continuous context window is fundamentally fragile; when the session ends, the learning dies (Catastrophic forgetting).

By enforcing the physical writing of knowledge to a Markdown file (SKILL.md), we achieved persistent meta-learning. The significance of using Markdown over specialized vector databases is twofold:

Human Readability and Auditability: Developers can read, review, and manually correct the SKILL.md file, fostering a true Human-AI collaborative environment.
Cross-Platform Portability (Zero-Shot Knowledge Transfer): As demonstrated by successfully moving the evolved SKILL from Antigravity to Gemini CLI, knowledge is no longer locked into a specific agent instance. An agent can spend days investigating a system, and the resulting SKILL can be instantly deployed to a fleet of worker agents, granting them immediate “senior-level” expertise without any required training or trial-and-error phases.

4. Specific Applications Utilizing This Approach

Based on our findings, this framework can be highly impactful in the following real-world scenarios:

Automated Reverse Engineering & SDK Generation: When integrating with undocumented or chaotic legacy API systems, an agent can be deployed to blindly probe the endpoints. It will autonomously decode the hidden rules, generating both a human-readable API specification (the SKILL) and robust client SDKs.
Organizational Tacit Knowledge Documentation: In many organizations, “tribal knowledge” dictates how to deploy or test specific systems (e.g., “always wait 5 seconds before hitting this endpoint”). By having an agent execute tasks and fail, it can discover and formalize this tacit knowledge into universally accessible Markdown files, acting as an automated technical writer.
Robust E2E Automated Test Suites: UI and API specifications change frequently, breaking tests. An agent equipped with this framework can dynamically update testing protocols in its SKILL as it encounters new failures, acting as a self-healing testing mechanism that minimizes maintenance overhead.
AI Onboarding and Scaling: A mature SKILL file cultivated by an exploratory agent can be instantly copied to the local workspaces of newly deployed AI agents (or human engineers). This instantly transfers “experience,” allowing for rapid scaling of development resources.

5. Strategies for Agent Skill Evolution

To ensure the continuous enhancement of Agent Skills within this framework, we identify two primary evolutionary paths:

Targeted Skill Evolution: When the goal is to refine a specific capability, the agent is instructed via an Appendix in the SKILL.md file. Upon successful problem resolution, the agent triggers a targeted update—adding, deleting, or modifying entries—to that specific skill definition.
Holistic Skill Evolution: To evolve the agent’s entire skill set simultaneously, evolution instructions are embedded directly into global context files (e.g., GEMINI.md). This compels the agent to evaluate and update the relevant SKILL.md files immediately after any successful task execution across its entire operational spectrum.

6. Summary

Physical Knowledge Persistence: By enforcing the continuous rewriting of a universally readable Markdown SKILL file, agents effectively overcome the limitations and volatility of in-memory LLM context windows.
Mastery of Opaque Constraints: Agents proved highly capable of discovering and permanently adapting to hidden server limits, cryptographic requirements, and semantic riddles through systematic trial and error.
Architectural Abstraction: Beyond merely fixing localized errors, the agents demonstrated the capacity to synthesize overarching design patterns (e.g., Centralized Header Factories) and document them as future guidelines.
Zero-Shot Knowledge Transfer: A SKILL evolved in one execution environment (Antigravity) can be seamlessly imported into a completely clean, distinct environment (Gemini CLI) to generate flawless, complex applications on the very first prompt.
Human-AI Collaborative Paradigm: Because the agent’s memory resides in a standard file system, human developers maintain full visibility and control, able to read, audit, and augment the agent’s logic at any time.

7. Appendix

The sample scripts, SKILL.md, and prompts can be seen at https://gist.github.com/tanaikech/966f83cc438b6077b05b9843be09e930.

Recursive Knowledge Crystallization: A Framework for Persistent Autonomous Agent Self-Evolution

Abstract

1. Introduction

2. Experimental Procedure

2.1. Experiment 1: The “Silent Bureaucrat” Blind Incremental Evolution

Workflow

2.2. Experiment 2: Chaos Server Constraints & Architectural Pattern Acquisition

Workflow

3. Results and Discussions

3.1. Experiment 1 Results & Discussion

3.2. Experiment 2 Results & Discussion

3.3. Analysis of SKILL.md Evolution Before and After

Experiment 1 (Silent Bureaucrat)

Experiment 2 (Chaos Server)

3.4. Comprehensive Discussion on Significance and Novelty

4. Specific Applications Utilizing This Approach

5. Strategies for Agent Skill Evolution

6. Summary

7. Appendix

Share!