# 强化学习考试代考 ELEN E6885代写 Midterm Exam代写

## Midterm Exam

## ELEN E6885: Introduction to Reinforcement Learning

强化学习考试代考 Problem 1 (20 Points, 2 Points each) True or False. No explanation is needed. 1. Reinforcement learning uses the formal framework of Markov

**Problem 1 **(**20 Points, 2 Points each**)

True or False. No explanation is needed.

1. Reinforcement learning uses the formal framework of Markov decision processes (MDP) to define the interaction between a learning agent and its environment in terms of states, actions and rewards.

2. MDP instances with small discount factors tend to emphasize near-term rewards.

3. If the only difference between two MDPs is the value of discount factor *γ*, then they must have the same optimal policy.

4. Both *ϵ*-greedy policy and Softmax policy will balance between exploration and exploitation.

5. Every finite MDP with bounded rewards and discount factor *γ **∈ *[0*, *1) has a unique optimal policy.

6. In practice, both policy iteration and value iteration are widely used, and it is not clear which, if either, is better in general.

7. Generalized policy iteration (GPI) is a term used to refer to the general idea of letting policy-evaluation and policy improvement processes interact, independent of the granularity and other details of the two processes.

8. Sarsa converges with probability 1 to an optimal policy and action-value function as long as all state–action pairs are visited an infinite number of times and the policy converges in the limit to the greedy policy.

9. In partially observable Markov decision process (POMDP), it is likely that an agent cannot identify its current state. So the best choice is to maintain a probability distribution over the states and actions. and then updates this probability distribution based on its real-time observations.

10. Many off-policy methods utilize importance sampling, a general technique for estimating expected values under one distribution given samples from another.

**Problem 2 **(**20 Points, 5 Points each**) 强化学习考试代考

Short-answer questions.

1. Consider a 100×100 grid world domain where the agent starts each episode in the bottomleft corner, and the goal is to reach the top-right corner in the least number of steps. To learn an optimal policy to solve this problem you decide on a reward formulation in which the agent receives a reward of +1 on reaching the goal state and 0 for all other transitions. Suppose you try two variants of this reward formulation, (P1), where you use discounted returns with *γ **∈ *(0*, *1), and (P2), where no discounting is used. As a consequence, a good policy can be learnt in (P1) but no learning in (P2), why?

2. Suppose the reinforcement learning player was greedy. Might it always learn to play better, or worse, than a nongreedy player?

3. Is the MDP framework adequate to usefully represent all goal-directed learning tasks? Can you think of any clear exceptions?

4. Given a stationary policy, is it possible that if the agent is in the same state at two different time steps, it can choose two different actions? If yes, please provide an example.

**Problem 4 **(**30 Points**) 强化学习考试代考

Consider an *undiscounted *MDP (*γ *= 1) with two non-terminal states *A, B *and a terminal state *C*. The transition function and reward function of the MDP are unknown. However, we have observed the following two episodes:

*A, a*_{1}*, **−*1*, A, a*_{1}*, *+1*, A, a*_{2}*, *+3*, C,*

*A, a*_{2}*, *+1*, B, a*_{3}*, *+2*, C,*

where *a*_{1}*, a*_{2}*, a*_{3} are actions, and the number after each action is an immediate reward. For example, *A, a*_{1}*, **−*3*, A *means that the agent took action *a*_{1} from state *A*, received an immediate reward *−*3 and ended up in state *A*.

1. [9 Pts] Using a learning rate of *α *= 0*.*1, and assuming initial state values of 0, what updates

to *V *(*A*) does *on-line *TD(0) method make after the first episode?

3. [6 Pts] Based on your results in the previous question, solve Bellman equation to find out the state-value function of *V** _{π}*(

*A*) and

*V*

*(*

_{π}*B*). (Assume

*V*

*(*

_{π}*C*) = 0 as state

*C*is the terminal state.)

4. [5 Pts] What value function would batch TD(0) find, i.e., if TD(0) was applied repeatedly to these two episodes?

更多代写：c++程序代写 ap考试代考 英国Finance金融作业代写 opinion essay怎么写 methodology有哪些 代上网课多少钱

合作平台：essay代写 论文代写 写手招聘 英国留学生代写