Topic | Advancements in Embodied AI: Integrating Large Language Models and Open-Vocabulary Manipu…

1. 写在前面

简略地了解一下基于LLMs的embodied AI进展

2. paper:Embodied Task Planning with Large Language Models (arxiv23)

2.1 basic info

  • task: embodied task planning
  • model: TaPA (TAsk Planing Agent) framework is proposed.
  • main idea: aligns large language models (LLMs) with visual perception models to generate executable plans in physical environments.

2.2 main contribution

  1. Multimodal Dataset Construction
  • a dataset containing triplets of
  1. Grounded Plan Tuning
  • Finetuning pre-trained LLMs for grounded planning, considering the physical constraints of the scene.
  1. Extending Open-Vocabulary Object Detection
    Enhanced detection for multi-view RGB images, crucial for understanding scene context.

2.3 main idea

The TaPA framework integrates LLMs with visual information from open-vocabulary object detectors. It processes human instructions and available object lists to generate feasible action plans for navigation and manipulation tasks.

2.4 results

3. paper: Large Language Models as Generalizable Policies for Embodied Tasks (arxiv23)

3.1 basic info

  • task: visual embodied tasks
  • model: Large Language model Reinforcement Learning Policy (LLaRP)
  • main idea: integrates pre-trained LLMs with egocentric visual observations to directly output actions in the environment.

3.2 main contribution

  1. LLaRP Framework
  • A new framework that combines LLMs with reinforcement learning for embodied AI tasks.
  1. Generalization Capabilities
  • Demonstrated robustness to paraphrased instructions and ability to generalize to novel tasks.
  1. Language Rearrangement Benchmark
    Introduction of a new benchmark comprising 150,000 training tasks and 1,000 test tasks for language-conditioned rearrangement.

3.3 main idea

image.png
  • use pre-trained frozen LLM to process text instructions and visual observations;
  • some blocks (highlighted in red) are trained through reinforcement learning;
  • then the frozen LLM and the blocks can generalize to novel tasks.

4. else papers

  • GOAT: GO to Any Thing
  • CLIP-Fields Weakly Supervised Semantic Fields
  • Discuss Before Moving: Visual Language Navigation via Multi-expert Discussions

:(之后有机会再针对每篇文章写一些详细的

版权声明:
作者:lichengxin
链接:https://www.techfm.club/p/90216.html
来源:TechFM
文章版权归作者所有,未经允许请勿转载。

THE END
分享
二维码
< <上一篇
下一篇>>