Proof for proposition 5 presented in paper–

 

Proposition 5: A solution with total reward within ε of the optimal expected reward is attained after solving a bounded EMTDP with a non-linear solver O( log 1/ ε ) times.

 

Proof:  Let  and  be the upper and lower bound on the optimal expected reward maintained by the binary search method.  Thus the number of iterations needed to guarantee that these bounds are within ε are ε >= (-) = (()/). This implies that we achieve this precision after k =  iterations of the binary search method, in each we have to solve a bounded EMTDP problem.  This yields the O(log (1/ ε)) complexity when referring to how the tolerance affects the run-time of the same problem (i.e. - maintained constant).

 

Lemma 1:  If in a given state  all communication of messages of the same length have equal cost and the communication cost is a shared cost, then changing the order of communication (e.g., whether agent A communicates as in figure 2-b vs agent B communicates as in Figure 4-a) has no impact on the expected reward of the optimal EMTDP policy obtained, regardless of the number of actions of agent A vs agent B in state .

 

Proof:  Diagrams as given in the paper.

 

We prove the result by showing that the optimal policy when agent A communicates first is a lower bound on the optimal policy when agent B communicates first, and vice versa.

Suppose that the flow Xi yields the optimal policy when agent A communicates first (as in figure 2-b). We now construct a feasible policy for the case when agent B communicates first (as in Figure 4-a) that has the same expected reward, which proves half of the result as the optimal policy for figure 4-a will have a reward at least this big. Define  and ;in essence, if in Figure 2-b, has the same flow Xi, then set the flow of in Figure 4-a to Xi.

If we establish the above equivalences, then the following three observations can be made. First, the expected number of times any action is executed after communication and after no communication is the same in Figure 2-b and Figure 4-a. Thus, there is no impact on future states. Second, the cost of communication in the two cases becomes identical. In particular, in Figure 2-b, given a constant cost k of communication, the total cost of communication is: k * /(1-). After setting of , the communication cost in Figure 4-a is k * /(1-)= k * /(1-).

Third, given that non-linear constraints in Figure 2-b are satisfied, we show that non-linear constraints in Figure 4-a are also automatically satisfied. In particular, the non-linear constraints in Figure 2-b are derived from the probability constraint that for any two states and , where i,j {1…m}, and any action  where k{1…n}:

 

 

                 (for i,j {1…m})

       for k1,k2  {1…n},

 

      

                 (as i,j {1…m})

       for ,{1…n},

 

These non-linear equations imply that the constraints on states in Figure 4-a are automatically satisfied. Thus by setting the quantities in Figure 4-a equal to the quantities in Figure 2-b, we have created a feasible policy for Figure 4-a which has the same expected reward as the optimal policy of Figure 2-b, thus the optimal policy for Figure 4-a is better. The proof that the optimal policy when B communicates first is a lower bound on the optimal policy when A communicates first is analogous.

 

Lemma 2: If in a given state  all communication of messages of the same length have equal cost and the communication cost is a shared cost, then the hierarchical transformation 4-b produces the same reward as the sequential transformation 2-b.

 

Proof:                                                             

2(b) EMTDP: Sequential ( same fig as given in paper )

 

a1b1

 

4(b) EMTDP: Hierarchy

 
 

 

 

 

 


We prove the result by showing that the optimal policy when agent A communicates first is a lower bound on the optimal policy when agent A acts and then decides whether to communicate; and vice versa.

 

Suppose that the flow Xi yields the optimal policy when agent A communicates first ( as in figure 2-b ). We now construct a feasible policy for the case where agent A decides the action and then decides the communication action (as in figure 4-b) that has the same expected reward, which proves half of the result as the optimal policy for figure 4-b will have a reward at least this big. Define  = , that is if in Figure 2-b  has the flow Xi, then set the flow of  in Figure 4-b to Xi. In a similar way, set.

 

If we establish the above equivalences, then the following three observations can be made. First, the expected number of times any action  is executed after communication and after no communication is the same in Figure 2-b and Figure 4-b. Thus, there is no impact on future states. Second, the cost of communication in the two cases becomes identical. In particular, in figure 2-b, given a constant cost k of communication, the total cost of communication is: k * /(1-). After setting of =, the communication cost in Figure 4-a is k * /(1-) = k * /(1-).

 

Third, given that non-linear constraints in Figure 2-b are satisfied, we can show that non-linear constraints in Figure 4-b are also automatically satisfied. In particular, the non-linear constraints in Figure 2-b are derived from the probability constraint that for any two states and , where i,j {i…..m}, and any action  where k  {1….n}:

 

                      

                  =

                  for , { 1…..n },

= =

                 

                 

                 

 

These non-linear equations imply that the constraints on states in figure 4-b are automatically satisfied. Thus by setting quantities in figure 4-b equal to the quantities in figure 2-b, we have created a feasible policy for figure 4-b which has the same expected reward as the optimal policy of figure 2-b, thus the optimal policy for figure 4-b is better. The proof, that the optimal policy when agent A acts and then decides whether to communicate is a lower bound on the optimal policy when agent A communicates first is analogous.

 

Lemma 3: If in a given state  all communication of messages of the same length have equal cost and the communication cost is a shared cost, then agent A choosing both agents actions and then communicating to the other agent as in Figure 4-c has no impact on the expected reward of the optimal EMTDP policy obtained (Figure 2-b), regardless of the number of actions of agent A vs agent B in state.

Proof: In the sequential EMTDP the policy involved agent A choosing its action and then either communicating its action or not. In the EMTDP with extra communication, agent A chooses its action and also agent B’s action and then communicates both of their actions. However communicating agent A’s action doesn’t have any impact on agent B and hence the problem can as well be treated as agent A communicating B’s action. This would ensure that the communication cost is same irrespective of the sequential or the extra communication case.

To ensure that the extra communication case gives at least the reward given by the sequential case, all the stages where agent A communicates, in the extra communication case it communicates. This would ensure that the communication cost remains the same. In the case where agent A doesn’t communicate in sequential case, the agent A doesn’t communicate anything in the extra communication case too. The agent B would therefore be in an equivalent situation in both cases and hence the same non-linear constraints apply.

The proof for the sequential case ensuring at least the reward given by the extra communication case is analogous.

 

Lemma 4: If in a given state all communication of messages of same length have equal cost and the communication cost is a shared cost, then the simultaneous transformation case as in Figure 4(d) has the same expected reward as the reward obtained from the optimal EMTDP policy(Figure 2-b), regardless of the number of actions of agent A vs agent B in state .

Proof: We assume a two agent case to present our proof. The new diagrams for the two agent case would be as follows:

 

Figure 2(b)                                                                                                      Figure 4(d)

 

In order to prove the original assertion we need to prove the following two cases:

 

The value of the policy generated by the new simultaneous transformation cannot be lower than the original sequential transformation.

 

Proof: Suppose the simultaneous transformation can at best give a lower value. We now show that we can get a value equal to the original sequential transformation, thus reaching a contradiction. Given an optimal policy from the sequential transformation, we can derive an optimal policy for the simultaneous transformation by deriving values for. In particular, we set equivalence from sequential to simultaneous as follows:

 

                      

                      

                      

                      

                              

                      

                      

 

In the simultaneous case, there is also . To determine these, we note that:

                      

                      

                      

                      

                      

 

where P11 and P12 are the probabilities for agent 1’s actions, and P21 and P22 are the probabilities for agent 2’s actions.

Now, we set:

                      

                      

                      

                      

                      

 

The above transformations show that starting from the sequential case, we can set the policy for the simultaneous case to be identical to the sequential case.

We also show below that the non-linear constraints for the simultaneous case as shown in Figure 4-d are satisfied-

 

Consider, the first equation given in the figure 4-d.

 

 

The same kind of equivalence can be shown for the other two non-linear constraints present in Figure 4-d.

 

The value of the policy generated by the sequential transformation cannot be lower than the new simultaneous transformation.

 

Suppose the sequential transformation can at best give a lower value. We now show that we can get a value equal to the new simultaneous transformation, thus reaching a contradiction.

 

We derive the values for the sequential transformation given an optimal simultaneous transformation in the following manner:

 

                      

                      

                      

                      

                      

                      

                      

                      

 

We can also show that  using  similar to the way the vice versa case was shown above.

 

Hence, we proved that simultaneous transformation is equivalent to the sequential transformation.