色婷婷色综合,亚洲天堂2014,亚洲精品2区,亚洲午夜一区二区

<Back

Reinforcement Learning from Diverse Human Preferences

Wanqi Xue, Bo An, Shuicheng Yan, Zhongwen Xu

IJCAI 2024 Conference

August 2024

Keywords: Reinforcement Learning, Human Preferences, Human Feedback, Rewards

Abstract:

The complexity of designing reward functions has been a major obstacle to the wide application of deep reinforcement learning (RL) techniques. Describing an agent s desired behaviors and properties can be difficult, even for experts. A new paradigm called reinforcement learning from human preferences (or preference-based RL) has emerged as a promising solution, in which reward functions are learned from human preference labels among behavior trajectories. However, existing methods for preference-based RL are limited by the need for accurate oracle preference labels. This paper addresses this limitation by developing a method for crowd-sourcing preference labels and learning from diverse human preferences. The key idea is to stabilize reward learning through regularization and correction in a latent space. To ensure temporal consistency, a strong constraint is imposed on the reward model that forces its latent space to be close to the prior distribution. Additionally, a confidence-based reward model ensembling method is designed to generate more stable and reliable predictions. The proposed method is tested on a variety of tasks in DMcontrol and Meta-world and has shown consistent and significant improvements over existing preference-based RL algorithms when learning from diverse feedback, paving the way for real-world applications of RL methods.

View More PDF>>

主站蜘蛛池模板: 墨玉县| 扶沟县| 和硕县| 威远县| 鹤壁市| 五寨县| 新田县| 叙永县| 宜川县| 三明市| 浙江省| 鹤峰县| 长岛县| 昌都县| 南澳县| 安陆市| 洮南市| 中山市| 涟水县| 吉木乃县| 武宁县| 宽甸| 宣化县| 武义县| 岑巩县| 阿鲁科尔沁旗| 射洪县| 象山县| 长宁县| 禄丰县| 陵川县| 永安市| 漳浦县| 容城县| 怀仁县| 册亨县| 如皋市| 彰化市| 襄樊市| 修文县| 双桥区|