강화학습 1강 Introduction

https://www.davidsilver.uk/teaching/

Teaching - David Silver

www.davidsilver.uk

https://www.youtube.com/playlist?list=PLpRS2w0xWHTcTZyyX8LMmtbcMXpd3s4TU

강화학습의 기초 이론

www.youtube.com

David Silver 강의와 팡요랩 강의를 보고 작성하였습니다.

용어 정리

Reward 정의

: 스칼라 형태의 피드백 신호(scalar feedback signal)

Agent의 할 일은 축적되는 보상(cumulative reward)를 최대화 하는 것이다.

강화학습은 보상 가정(Reward Hypothesis) 기반이다.

Reward의 예시들

- 헬리콥터를 운전할 때

+ : 부딪히지 않고 주어진 경로에 도달할 때,

- : 부딪힐 때

Sequential Decision Making

목적 : 미래의 모든 보상을 최대화하는 action을 선택하는 것!

Agent and Environment

매 시간 스텝 t 마다 Agent는

1. At 라는 Action을 실행하고

2. Observation Ot를 획득하고

3. Reward Rt를 획득한다.

환경은

1. Action At를 획득하고

2. Oberscation Ot+1을 방출하고

3. Reward Rt+1을 방출한다.

History and State

Ht = O1, R1, A1, ... , At-1, Ot, Rt

모든 관측가능한 변수가 전제되어야 한다.

State는 다음에 어떤 일이 결정될지를 알려주는 정보이다.

(State is the information used to determine what happens next)

형식적으로, state는 history의 함수이다.

St = f(Ht)

Information State

: 정보 상태(a.k.a Markov state)는 역사로부터 유용한 모든 정보를 포함한다.

미래는 현재가 주어진 과거와 독립적이다.

(The future is independent of the past given the present)

마르코프 하다는 것은 무슨 의미일까...?

*A state St is Markov if and only if

>>> P[St+1 | St] = P[St+1 | S1, ..., St]

즉, St+1의 확률을 알아낼 때, 이전의 모든 과거의 값들은 무시할 수 있다라는 의미이다.

왜냐하면 St가 발생할 수 있었던 이유는 과거의 데이터가 축적되었기 때문에 다 내포되어 있다는 의미이다.

Fully Observable Environments

Full observability : agent가 직접적으로 환경 상태를 관측할 수 있다.

*반대되는 말

Partial observability

*Agent의 구성 요소

Policy (정책) : Agent의 행동 결정함

- state에서 action으로 맵핑되는 개념

- Deterministic policy : a = π(s)

- Stochastic policy : π(a|s) = P[At = a | St = s]

Value Function (가치 함수) : 미래 보상의 기대 값

- 상태의 goodness/badness를 평가

- Vπ(s) = Eπ[R(t+1) + γR(t+2) + γ^2 R(t+3) + ... | St = s]

Model (모델) : 환경이 다음에 무엇을 할지 예측함

- P는 다음 상태를 예측한다.

- R은 바로 다음 보상을 예측한다.

Categorizing RL agents (1)

- Value Based : Value function

- Policy Based : Policy

- Actor Critic : Policy, Value function

Categorizing RL agents (2)

- Model Free : Policy and/or Value function

- Model Based : Policy and/or Value function, Model

Learning and Planning

모델의 유무에 따라서 두 개의 근본적인 문제로 나눌 수 있다.

1. Reinforcement Learning (= 환경을 모르는 상태)

2. Planning : 환경을 알고(= reward를 어떻게 하면 받을 수 있는지 알고, 모델을 아는 경우)

Exploration and Exploitation

Exploration : 좀 더 많은 정보를 찾는 것

Exploitation : 원래 알던 정보를 찾는 것

Prediction and Control

Prediction : 미래를 평가 (=value fuction을 학습시킨다.)

Control : 미래를 최적화 (=최고의 policy를 찾는다.)

'RL > 이론' 카테고리의 다른 글

강화학습 6강 Value Function Approximation (0)	2023.01.01
강화학습 5강 Model Free Control (0)	2023.01.01
강화학습 4강 Model Free Prediction (1)	2023.01.01
강화학습 3강 Planning by Dynamic Programming (0)	2023.01.01
강화학습 2강 MDP (0)	2023.01.01

ingus kinematics

강화학습 1강 Introduction

용어 정리

'RL > 이론' 카테고리의 다른 글

티스토리툴바

강화학습 1강 Introduction

용어 정리

'RL > 이론' 카테고리의 다른 글

'RL/이론' Related Articles

티스토리툴바