Memento-Skills: A framework for AI agents to rewrite their own executable skills without retraining the base LLM

This article was generated by AI and cites original sources.

On April 8, 2026, VentureBeat reported on Memento-Skills, a new research framework that targets a recurring deployment bottleneck for autonomous AI agents: how to let an agent adapt to new environments and tasks without retraining the underlying large language model (LLM). Instead of updating model weights, the system builds an evolving external memory made of executable skill artifacts that the agent can update over time. The reported work frames skill updates as an iterative control loop—querying a specialized skill router, executing the selected skill, reflecting on outcomes, and then rewriting the stored skill artifacts when executions fail.

Why self-evolving agents hit a wall after deployment

The source describes a core limitation of “frozen” LLMs: once deployed, their parameters remain fixed, restricting performance to knowledge encoded during training and whatever fits in the model’s immediate context window. For autonomous agents, this creates pressure to build adaptation mechanisms that can respond to changes in tasks, tools, or environment behavior without repeatedly retraining the base model.

VentureBeat notes that many existing approaches to agent adaptation rely on manually designed skills to handle new tasks. Some automatic skill-learning methods exist, but the source characterizes them as producing text-only guides that effectively amount to prompt optimization, or as logging single-task trajectories that don’t transfer well across different tasks. In other words, the system may learn “what to say” for one scenario, but it may not produce reusable, executable capabilities that generalize.

The article also highlights a technical mismatch in many retrieval-augmented generation (RAG) systems. Standard retrieval often uses semantic similarity—for example, dense embeddings or other similarity-based routers. But the source argues that similarity does not guarantee behavioral utility when skills are represented as executable artifacts like markdown documents or code snippets. VentureBeat gives an example of a failure mode: a standard RAG system might retrieve a “password reset” script when a query is about “refund processing” simply because the enterprise documents share terminology.

In this framing, the key technology challenge is not only learning, but selecting the right executable skill—based on how it will perform—rather than selecting the most semantically similar artifact.

Memento-Skills: external memory made of structured, executable skill artifacts

To address these issues, the researchers behind Memento-Skills describe the system as “a generalist, continually-learnable LLM agent system that functions as an agent-designing agent” (quoted from the source). The design replaces passive conversation logging with a persistent, evolving library of skills that functions as external memory.

According to the source, skills are stored as structured markdown files and each reusable skill artifact contains three core elements:

(1) Declarative specifications describing what the skill is and how it should be used.

(2) Specialized instructions and prompts that guide the language model’s reasoning.

(3) Executable code and helper scripts that the agent runs to solve the task.

This matters because it reframes “learning” as updating executable tools and instructions that can be reused across tasks, rather than only changing what the base LLM says in a given prompt.

The source further explains that Memento-Skills implements continual learning through a mechanism called “Read-Write Reflective Learning”. Instead of updating memory by simply appending logs, the system treats memory updates as active policy iteration. When facing a new task, the agent queries a specialized skill router to retrieve the most behaviorally relevant skill (not merely the most semantically similar one) and then executes it.

After execution, the system closes the learning loop through reflection. If the execution fails, the system uses an orchestrator to evaluate the execution trace and then rewrite the skill artifacts by patching code or prompts for the specific failure mode. In cases where needed, it can create an entirely new skill.

For the router update, the source describes a one-step offline reinforcement learning process that learns from execution feedback rather than text overlap. The rationale given in the source is that “The true value of a skill lies in how it contributes to the overall agentic workflow and downstream execution,” with reinforcement learning positioned as a way to evaluate and select skills based on long-term utility.

Guardrails for production: unit-test gates and regression prevention

Because the system can rewrite executable skill artifacts, the source describes safety controls aimed at preventing regressions in production environments. Memento-Skills uses an automatic unit-test gate: before saving changes to the global library, the system generates a synthetic test case, executes it through the updated skill, and checks results.

This is a notable technical detail. It suggests that the framework’s self-modification loop is designed to be bounded by automated evaluation, rather than allowing unrestricted updates. The source later returns to this theme in enterprise guidance, saying that while Memento-Skills includes foundational safety rails like unit-test gates, broader governance and security would likely be needed for wider adoption.

On evaluation, the researchers also argue for structured self-improvement. VentureBeat quotes Jun Wang emphasizing the need for “a well-designed evaluation or judge system” to assess performance and provide consistent guidance, and for a “guided form of self-development” where feedback steers the agent toward better designs rather than unconstrained self-modification.

Benchmarks and reported gains: GAIA, HLE, and skill library growth

The source reports that Memento-Skills was evaluated on two benchmarks: General AI Assistants (GAIA) and Humanity’s Last Exam (HLE). GAIA is described as requiring complex multi-step reasoning, multi-modality handling, web browsing, and tool use. HLE is described as an expert-level benchmark spanning eight diverse academic subjects, including mathematics and biology.

VentureBeat reports that the entire system used Gemini-3.1-Flash as the underlying frozen language model. It also states that Memento-Skills was compared against a “Read-Write baseline” that retrieves skills and collects feedback but lacks the self-evolving features.

On GAIA, the source reports that Memento-Skills improved test set accuracy by 13.7 percentage points, reaching 66.0% compared to 52.3% for the static baseline. On HLE, where domain structure supports cross-task skill reuse, the system reportedly more than doubled baseline performance, rising from 17.9% to 38.7%.

The article also provides a retrieval-oriented result tied to the specialized router. It reports that Memento-Skills boosts end-to-end task success rates to 80% compared to 50% for standard BM25 retrieval—framed as avoiding the “classic retrieval trap” of selecting an irrelevant skill due to semantic similarity.

Finally, the source ties performance to skill growth dynamics. Both benchmark experiments reportedly began with just five atomic seed skills (such as basic web search and terminal operations). On GAIA, the agent expanded the seed group into a compact library of 41 skills. On HLE, the system scaled to 235 distinct skills.

While these reported results are specific to the experimental setup, they suggest (as analysis) that the combination of executable skill artifacts, behaviorally relevant routing, and feedback-driven mutation with tests can change the performance profile of agent systems without changing the base LLM’s parameters.

Enterprise deployment: workflows over isolated tasks, plus open questions

VentureBeat frames enterprise fit as a domain-alignment problem. The source emphasizes a tradeoff between agents handling isolated tasks versus structured workflows. Jun Wang is quoted discussing skill transfer: when tasks are isolated or weakly related, the agent cannot rely on prior experience and must learn through interaction, limiting cross-task transfer. When tasks share substantial structure, previously acquired skills can be reused, making learning more efficient and reducing the need for additional interaction.

Based on that reasoning, the source reports Wang’s view that workflows are likely the most appropriate setting because they provide a structured environment for composing, evaluating, and improving skills.

At the same time, the article includes cautions about where the framework may not fit yet. Wang cautions against over-deployment in areas not suited for the framework, noting that “Physical agents remain largely unexplored in this context” and that tasks with longer horizons may require more advanced approaches, such as multi-agent LLM systems for coordination, planning, and sustained execution over extended sequences of decisions.

For the industry, the implication (hedged by the source’s own framing) is that self-improvement may become more practical if teams can map their use cases to environments where skills can be composed and evaluated—especially when successful execution feedback can drive controlled rewrites.

Source: VentureBeat