Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Consider an undiscounted MDP having three states (1, 2, 3), with rewards -1, -2

ID: 3012164 • Letter: C

Question

Consider an undiscounted MDP having three states (1, 2, 3), with rewards -1, -2 and 0 respectively. State 3 is a terminal (sink) state. In states 1 and 2, there are two possible actions: a and b. The transition model is as follows. in state 1, action a moves the agent to state 2 with probability 0.8 and makes the agent stay put with probability 0.2 in state 2, action a moves the agent to state 1 with probability 0.8 and makes the agent stay put with probability 0.2 in either state 1 or state 2, action b moves the agent to state 3 with probability 0.1 and makes the agent stay put with probability 0.9. For this MDP, how many possible value functions are there? How many possible policies? Assuming that V_0 () of each node is just its immediate reward. Show one iteration of value iteration. Specifically compute V_1 (1), V_1 (2) and V_1 (3)

Explanation / Answer

As states 1 and 2 will pay a cost for each time step it spends in as The agent wants to get to State 3 at the earliest possible of time. There are a few drwback to this, being the only action that reaches state 3 (action b) succeeds with low probability, so the agent should always try to minimize the cost it incurs while trying to reach the terminal state. This leads to the suggestsion that the agent should definitely try action b in state 1; in state 2, it might be better to try action a to get to state 1 (which is the better place to wait for admission to state 3), instead of aiming directly for state 3. The decision in state 2 involves a numerical tradeoff.

Alternating steps for value determination can be found below:

. • Initialization: U h1, 2, 0i, P hb, bi. •

Value determination: u1 = 1 + 0.1u3 + 0.9u1, u2 = 2 + 0.1u3 + 0.9u2, u3 = 0 Solve the above to get u1 = 10 and u2 = 20. •

Policy Update: In State 1: P j T(1, a, j)uj = 0.8×(20)+ 0.2×(10) = 18, while P j T(1, b, j)uj = 0.1×0+ 0.9×(10) = 9

So action b is still preferred for State 1. In State 2: P j T(1, a, j)uj = 0.8×(10)+0.2×(20) = 12, while P j T(1, b, j)uj = 0.1×0+0.9×(20) = 18

for, STATE 2 the preferable action is a. SO WE CAN PROCEED.

Value determination: u1 = 1 + 0.1u3 + 0.9u1, u2 = 2 + 0.8u1 + 0.2u2, u3 = 0 Solve the above to get u1 = 10 and u2 = 12.5. •

Policy Update: In State 1: P j T(1, a, j)uj = 0.8×(12.5)+0.2×(10) = 12, while P j T(1, b, j)uj = 0.1×0+0.9×(10) = 9 So action b is still preferred for State 1. In State 2: P j T(1, a, j)uj = 0.8×(10)+0.2×(12.5) = 10.5, while P j T(1, b, j)uj = 0.1×0+0.9×(12.5) = 11.25

Again STATE2 prefers action a. Thus we terminate for it stays the same from our earlier iteration.