Topic | Advancements in Embodied AI: Integrating Large Language Models and Open-Vocabulary Manipu…
1. 写在前面
简略地了解一下基于LLMs的embodied AI进展
2. paper:Embodied Task Planning with Large Language Models (arxiv23)
2.1 basic info
- task: embodied task planning
- model: TaPA (TAsk Planing Agent) framework is proposed.
- main idea: aligns large language models (LLMs) with visual perception models to generate executable plans in physical environments.
2.2 main contribution
- Multimodal Dataset Construction
- a dataset containing triplets of
- Grounded Plan Tuning
- Finetuning pre-trained LLMs for grounded planning, considering the physical constraints of the scene.
- Extending Open-Vocabulary Object Detection
Enhanced detection for multi-view RGB images, crucial for understanding scene context.
2.3 main idea
The TaPA framework integrates LLMs with visual information from open-vocabulary object detectors. It processes human instructions and available object lists to generate feasible action plans for navigation and manipulation tasks.
2.4 results
3. paper: Large Language Models as Generalizable Policies for Embodied Tasks (arxiv23)
3.1 basic info
- task: visual embodied tasks
- model: Large Language model Reinforcement Learning Policy (LLaRP)
- main idea: integrates pre-trained LLMs with egocentric visual observations to directly output actions in the environment.
3.2 main contribution
- LLaRP Framework
- A new framework that combines LLMs with reinforcement learning for embodied AI tasks.
- Generalization Capabilities
- Demonstrated robustness to paraphrased instructions and ability to generalize to novel tasks.
- Language Rearrangement Benchmark
Introduction of a new benchmark comprising 150,000 training tasks and 1,000 test tasks for language-conditioned rearrangement.
3.3 main idea
- use pre-trained frozen LLM to process text instructions and visual observations;
- some blocks (highlighted in red) are trained through reinforcement learning;
- then the frozen LLM and the blocks can generalize to novel tasks.
4. else papers
- GOAT: GO to Any Thing
- CLIP-Fields Weakly Supervised Semantic Fields
- Discuss Before Moving: Visual Language Navigation via Multi-expert Discussions
:(之后有机会再针对每篇文章写一些详细的
版权声明:
作者:lichengxin
链接:https://www.techfm.club/p/90216.html
来源:TechFM
文章版权归作者所有,未经允许请勿转载。
THE END
二维码
共有 0 条评论