SSO - Allen Institute for AI

In-Context Policy Improvement

Like other continual learning methods*, SSO uses in-context "memories" with information about the task and environment to improve the LLM actor's policy. The memories that SSO generates are instructions for achieving subgoals we call skills. Unlike previous work, SSO continuously evaluates generated memories, creates memories that define modular subgoals, and facilitates memory retrieval.

* e.g. Voyager, ExpeL, and CLIN agents

Skill Set Optimization

Each iteration of SSO includes:

Rolling out a single trajectory with the LLM actor and current skill set
Constructing new skills
Refining executed skills

To construct new skills, we extract potential subtrajectories, score them using discounted reward and similarity and length, sample an updated skill set using beam search, and generate subgoals and instructions for each new skill. We refine the constructed skill set by filtering skills that did not result in high rewards when used previous trajectories. Then, when providing skills in-context, we retrieve only the most relevant skills based on cosine similarity of skill initial states and the current environment state.

Skill Lifecycle

Each row of this plot shows all of the skills created in the cooresponding iteration and when they were executed. On both ScienceWorld and NetHack, SSO prunes most new skills after few iterations. The LLM actor uses more recent skills as it continues to improve at the task and learn new skills and improve old skills.

State-of-that-art Results

SSO outperforms previous state-of-the-art in ScienceWorld by 35% in task adaptation and 14% in task transfer. Learned and reinforced skills such as those listed below provide knowledge of subgoals that are transferable across tasks.

You move to the kitchen

Go to the hallway
Go to the kitchen

The stove is turned on. on the stove is: a substance called liquid [substance]

focus on the thermometer
focus on the substance you want to heat
move the focused substance to the stove
activate the stove

Skill Set Optimization: Reinforcing Language Model Behavior via Transferable Skills

Continual learning for LLM actors via discovering and reinforcing in-context skills

Abstract

In-Context Policy Improvement

Skill Set Optimization

Skill Lifecycle

State-of-that-art Results

BibTeX