Proof for proposition
5 presented in paper
Proposition 5: A
solution with total reward within ε of the optimal expected reward is
attained after solving a bounded EMTDP with a non-linear solver O( log 1/ ε
) times.
Proof: Let
and
be the upper and lower
bound on the optimal expected reward maintained by the binary search
method. Thus the number of iterations
needed to guarantee that these bounds are within ε are ε >= (
-
) = ((
)/
). This implies that we achieve this precision after k =
iterations of the
binary search method, in each we have to solve a bounded EMTDP problem. This yields the O(log (1/ ε)) complexity
when referring to how the tolerance affects the run-time of the same problem
(i.e.
-
maintained constant).
Lemma 1: If in a given state
all communication of
messages of the same length have equal cost and the communication cost is a
shared cost, then changing the order of communication (e.g., whether agent A
communicates as in figure 2-b vs agent B communicates as in Figure 4-a) has no
impact on the expected reward of the optimal EMTDP policy obtained, regardless
of the number of actions of agent A vs agent B in state
.
Proof: Diagrams as given in the paper.
We prove the result by showing that the optimal policy when agent A communicates first is a lower bound on the optimal policy when agent B communicates first, and vice versa.
Suppose that the flow Xi yields the optimal policy when
agent A communicates first (as in figure 2-b). We now construct a feasible
policy for the case when agent B communicates first (as in Figure 4-a) that has
the same expected reward, which proves half of the result as the optimal policy
for figure 4-a will have a reward at least this big. Define
and
;in essence, if in Figure 2-b,
has the same flow Xi, then set the flow of
in Figure 4-a to Xi.
If we establish the above equivalences, then the following
three observations can be made. First, the expected number of times any action
is executed after communication and after no communication is
the same in Figure 2-b and Figure 4-a. Thus, there is no impact on future
states. Second, the cost of communication in the two cases becomes identical.
In particular, in Figure 2-b, given a constant cost k of communication, the
total cost of communication is: k *
/(1-
). After setting of
, the communication cost in Figure 4-a is k *
/(1-
)= k *
/(1-
).
Third, given that non-linear constraints in Figure 2-b are
satisfied, we show that non-linear constraints in Figure 4-a are also
automatically satisfied. In particular, the non-linear constraints in Figure
2-b are derived from the probability constraint that for any two states
and
, where i,j
{1
m}, and any action
where k
{1
n}:
(for i,j
{1
m})
for k1,k2
{1
n},
![]()
![]()
(as i,j
{1
m})
for
,![]()
{1
n}, ![]()
These non-linear equations imply that the constraints on
states
in Figure 4-a are automatically satisfied. Thus by setting
the quantities in Figure 4-a equal to the quantities in Figure 2-b, we have
created a feasible policy for Figure 4-a which has the same expected reward as
the optimal policy of Figure 2-b, thus the optimal policy for Figure 4-a is
better. The proof that the optimal policy when B communicates first is a lower
bound on the optimal policy when A communicates first is analogous.
Lemma 2: If in a
given state
all communication of
messages of the same length have equal cost and the communication cost is a
shared cost, then the hierarchical transformation 4-b produces the same reward
as the sequential transformation 2-b.
Proof:
2(b) EMTDP:
Sequential ( same fig as given in paper ) a1b1


4(b) EMTDP: Hierarchy
We prove the result by showing that the optimal policy when agent A communicates first is a lower bound on the optimal policy when agent A acts and then decides whether to communicate; and vice versa.
Suppose that the flow Xi yields the
optimal policy when agent A communicates first ( as in figure 2-b ). We now
construct a feasible policy for the case where agent A decides the action and
then decides the communication action (as in figure 4-b) that has the same
expected reward, which proves half of the result as the optimal policy for
figure 4-b will have a reward at least this big. Define
=
, that is if in Figure 2-b
has the flow Xi, then
set the flow of
in Figure 4-b to Xi. In
a similar way, set
.
If we establish the above
equivalences, then the following three observations can be made. First, the
expected number of times any action
is executed after
communication and after no communication is the same in Figure 2-b and Figure
4-b. Thus, there is no impact on future states. Second, the cost of
communication in the two cases becomes identical. In particular, in figure 2-b,
given a constant cost k of communication, the total cost of communication is: k
*
/(1-
). After setting of
=
, the communication cost in Figure 4-a is k *
/(1-
) = k *
/(1-
).
Third, given that non-linear
constraints in Figure 2-b are satisfied, we can show that non-linear
constraints in Figure 4-b are also automatically satisfied. In particular, the
non-linear constraints in Figure 2-b are derived from the probability
constraint that for any two states
and
, where i,j
{i
..m}, and any action
where k
{1
.n}:
![]()
= ![]()
for
,
{ 1
..n },
=
= ![]()
![]()
![]()
![]()
These non-linear equations imply
that the constraints on states
in figure 4-b are automatically satisfied. Thus by setting
quantities in figure 4-b equal to the quantities in figure 2-b, we have created
a feasible policy for figure 4-b which has the same expected reward as the
optimal policy of figure 2-b, thus the optimal policy for figure 4-b is better.
The proof, that the optimal policy when agent A acts and then decides whether
to communicate is a lower bound on the optimal policy when agent A communicates
first is analogous.
Lemma 3: If in a given state
all communication of
messages of the same length have equal cost and the communication cost is a
shared cost, then agent A choosing both agents actions and then communicating
to the other agent as in Figure 4-c has no impact on the expected reward of the
optimal EMTDP policy obtained (Figure 2-b), regardless of the number of actions
of agent A vs agent B in state
.
Proof: In the sequential EMTDP the policy involved agent A choosing its action and then either communicating its action or not. In the EMTDP with extra communication, agent A chooses its action and also agent Bs action and then communicates both of their actions. However communicating agent As action doesnt have any impact on agent B and hence the problem can as well be treated as agent A communicating Bs action. This would ensure that the communication cost is same irrespective of the sequential or the extra communication case.
To ensure that the extra
communication case gives at least the reward given by the sequential case, all
the stages where agent A communicates
, in the extra communication case it communicates
. This would ensure that the communication cost remains the
same. In the case where agent A doesnt communicate in sequential case, the
agent A doesnt communicate anything in the extra communication case too. The
agent B would therefore be in an equivalent situation in both cases and hence
the same non-linear constraints apply.
The proof for the sequential case
ensuring at least the reward given by the extra communication case is
analogous.
Lemma 4: If in a given state
all communication of messages of same length have equal cost
and the communication cost is a shared cost, then the simultaneous
transformation case as in Figure 4(d) has the same expected reward as the
reward obtained from the optimal EMTDP policy(Figure 2-b), regardless of the
number of actions of agent A vs agent B in state
.
Proof: We assume a two agent case to present our proof. The new diagrams for the two agent case would be as follows:


Figure 2(b) Figure
4(d)
In order to prove the original assertion we need to prove the following two cases:
The value of the policy generated by the
new simultaneous transformation cannot be lower than the original sequential
transformation.
Proof: Suppose the simultaneous transformation can at best give a
lower value. We now show that we can get a value equal to the original
sequential transformation, thus reaching a contradiction. Given an optimal
policy from the sequential transformation, we can derive an optimal policy for
the simultaneous transformation by deriving values for
. In particular, we set equivalence from sequential to
simultaneous as follows:
![]()
![]()
![]()
![]()
![]()
![]()
In the simultaneous case, there is
also
. To determine these, we note that:
![]()
![]()
![]()
![]()
where P11 and P12 are the probabilities for agent 1s actions, and P21 and P22 are the probabilities for agent 2s actions.
Now, we set:
![]()
![]()
![]()
![]()
The above transformations show that starting from the sequential case, we can set the policy for the simultaneous case to be identical to the sequential case.
We also show below that the non-linear constraints for the simultaneous case as shown in Figure 4-d are satisfied-
Consider
, the first equation given in the figure 4-d.
![]()
![]()
![]()
![]()
The same kind of equivalence can be shown for the other two non-linear constraints present in Figure 4-d.
The value of the policy generated by the sequential
transformation cannot be lower than the new simultaneous transformation.
Suppose the sequential transformation can at best give a lower value. We now show that we can get a value equal to the new simultaneous transformation, thus reaching a contradiction.
We derive the
values for the sequential transformation given an optimal
simultaneous transformation in the following manner:
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
We can also show that
using
similar to the way the
vice versa case was shown above.
Hence, we proved that simultaneous
transformation is equivalent to the sequential transformation.