Topic | Advancements in Embodied AI: Integrating Large Language Models and Open-Vocabulary Manipu…

lichengxin • 2023-12-09 23:15 • 杂文

1. 写在前面

简略地了解一下基于LLMs的embodied AI进展

2. paper：Embodied Task Planning with Large Language Models （arxiv23）

2.1 basic info

task: embodied task planning
model: TaPA (TAsk Planing Agent) framework is proposed.
main idea: aligns large language models (LLMs) with visual perception models to generate executable plans in physical environments.

2.2 main contribution

Multimodal Dataset Construction

a dataset containing triplets of

Grounded Plan Tuning

Finetuning pre-trained LLMs for grounded planning, considering the physical constraints of the scene.

Extending Open-Vocabulary Object Detection
Enhanced detection for multi-view RGB images, crucial for understanding scene context.

2.3 main idea

The TaPA framework integrates LLMs with visual information from open-vocabulary object detectors. It processes human instructions and available object lists to generate feasible action plans for navigation and manipulation tasks.

2.4 results

3. paper： Large Language Models as Generalizable Policies for Embodied Tasks （arxiv23）

3.1 basic info

task: visual embodied tasks
model: Large Language model Reinforcement Learning Policy (LLaRP)
main idea: integrates pre-trained LLMs with egocentric visual observations to directly output actions in the environment.

3.2 main contribution

LLaRP Framework

A new framework that combines LLMs with reinforcement learning for embodied AI tasks.

Generalization Capabilities

Demonstrated robustness to paraphrased instructions and ability to generalize to novel tasks.

Language Rearrangement Benchmark
Introduction of a new benchmark comprising 150,000 training tasks and 1,000 test tasks for language-conditioned rearrangement.

3.3 main idea

image.png

use pre-trained frozen LLM to process text instructions and visual observations;
some blocks (highlighted in red) are trained through reinforcement learning;
then the frozen LLM and the blocks can generalize to novel tasks.

4. else papers

GOAT: GO to Any Thing
CLIP-Fields Weakly Supervised Semantic Fields
Discuss Before Moving: Visual Language Navigation via Multi-expert Discussions

:(之后有机会再针对每篇文章写一些详细的

版权声明：
作者：lichengxin
链接：https://www.techfm.club/p/90216.html
来源：TechFM
文章版权归作者所有，未经允许请勿转载。

THE END

二维码

水果的英语翻译

< <上一篇

研究生创业录（二）

下一篇>>

搜索内容