Search the whole station

# 强化学习考试代考 ELEN E6885代写 Midterm Exam代写

80

## ELEN E6885: Introduction to Reinforcement Learning

### Problem 1 (20 Points, 2 Points each)

True or False. No explanation is needed.

1. Reinforcement learning uses the formal framework of Markov decision processes (MDP) to define the interaction between a learning agent and its environment in terms of states, actions and rewards.

2. MDP instances with small discount factors tend to emphasize near-term rewards.

3. If the only difference between two MDPs is the value of discount factor γ, then they must have the same optimal policy.

4. Both ϵ-greedy policy and Softmax policy will balance between exploration and exploitation.

5. Every finite MDP with bounded rewards and discount factor γ [0, 1) has a unique optimal policy.

6. In practice, both policy iteration and value iteration are widely used, and it is not clear which, if either, is better in general.

7. Generalized policy iteration (GPI) is a term used to refer to the general idea of letting policy-evaluation and policy improvement processes interact, independent of the granularity and other details of the two processes.

8. Sarsa converges with probability 1 to an optimal policy and action-value function as long as all state–action pairs are visited an infinite number of times and the policy converges in the limit to the greedy policy.

9. In partially observable Markov decision process (POMDP), it is likely that an agent cannot identify its current state. So the best choice is to maintain a probability distribution over the states and actions. and then updates this probability distribution based on its real-time observations.

10. Many off-policy methods utilize importance sampling, a general technique for estimating expected values under one distribution given samples from another.

### Problem 2 (20 Points, 5 Points each) 强化学习考试代考

1. Consider a 100×100 grid world domain where the agent starts each episode in the bottomleft corner, and the goal is to reach the top-right corner in the least number of steps. To learn an optimal policy to solve this problem you decide on a reward formulation in which the agent receives a reward of +1 on reaching the goal state and 0 for all other transitions. Suppose you try two variants of this reward formulation, (P1), where you use discounted returns with γ (0, 1), and (P2), where no discounting is used. As a consequence, a good policy can be learnt in (P1) but no learning in (P2), why?

2. Suppose the reinforcement learning player was greedy. Might it always learn to play better, or worse, than a nongreedy player?

3. Is the MDP framework adequate to usefully represent all goal-directed learning tasks? Can you think of any clear exceptions?

4. Given a stationary policy, is it possible that if the agent is in the same state at two different time steps, it can choose two different actions? If yes, please provide an example.

### Problem 4 (30 Points) 强化学习考试代考

Consider an undiscounted MDP (γ = 1) with two non-terminal states A, B and a terminal state C. The transition function and reward function of the MDP are unknown. However, we have observed the following two episodes:

A, a1, 1, A, a1, +1, A, a2, +3, C,

A, a2, +1, B, a3, +2, C,

where a1, a2, a3 are actions, and the number after each action is an immediate reward. For example, A, a1, 3, A means that the agent took action a1 from state A, received an immediate reward 3 and ended up in state A.

1. [9 Pts] Using a learning rate of α = 0.1, and assuming initial state values of 0, what updates

to V (A) does on-line TD(0) method make after the first episode?