@article{yin2023lumos,
title={{Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs}},
author={Yin, Da and Brahman, Faeze and Ravichander, Abhilasha and Chandu, Khyathi and Chang, Kai-Wei and Choi, Yejin and Lin, Bill Yuchen},
journal={arXiv preprint arXiv:2311.05657},
year={2023}
}
Instead of using Self-Instruct method, we use LLMs to convert ground-truth intermediate reasoning steps into the expected high-quality annotations aligning with our proposed formulations.
Finally, we are able to generate ~40K annotations to train Lumos planning and grounding modules (one of the largest resources for language agent fine-tuning). The annotation sources cover web, complex QA and math task types. See our final annotation data in Huggingface Dataset and prompt details in Github.
We first evaluate Lumos on complex QA, web and maths tasks.
We find that Lumos outperforms GPT-4/3.5-based agents on complex QA and web tasks. In particular, Lumos outperforms GPT-4 5.1 step success rate on Mind2Web and GPT-3.5-turbo-based ReAct 5.1 LLM accuracy. Lumos also achieves better performance than 2-4x bigger language agents on maths tasks.
We compare Lumos formulation with other baseline formulations to train open-source agents. The baseline formulations are Chain-of-Thought Training and Integrated Agent Training.
Lumos performs the best among the baselines on three different complex interative tasks.
We first evaluate Lumos trained with the unified annotations composed by task-specific ones. We then test Lumos on an unseen complex interactive task, WebShop.
We find that after the unified training, Lumos would have slightly higher performance on web and complex QA tasks. We also observe that Lumos can bring an improvement over domain-specific agents 5-10 reward improvement, and also better performance than larger agents with 13B and 30B sizes.
We also conduct deeper analysis about annotation quality and the choice of annotation formats. We answer the following questions:
We find that by controling the training annotation size to be the same, our annotations can still help get better performance than the ones produced by Self-Instruct method and passed by rigorous execution sanity checking. Also, we find that making planning module generate high-level subgoals would be a superior choice to generating a very long sequence of low-level subgoals.