COM-MTDP Example
Currently under construction

Much of the following is an excerpt from the COM-MTDP JAIR article, providing additional details about:

NOTE: The special symbols and notation contained in this document may not be displayed correctly by your browser.


Helicopter Domain

Consider two helicopters that must fly across enemy territory to their destination, as illustrated in Figure 1. The first, piloted by agent Transport, is a transport vehicle with limited firepower. The second, piloted by agent Escort, is an escort vehicle with significant firepower. Somewhere along their path is an enemy radar unit, but its location is unknown (a priori) to the agents. Escort is capable of destroying the radar unit upon encountering it. However, Transport is not, but it can escape detection by the radar unit by traveling at a very low altitude (nap-of-the-earth flight), though at a lower speed than at its typical, higher altitude. In this scenario, Escort will not worry about detection, given its superior firepower; therefore, it will fly at a fast speed at its typical altitude.

HeloScenario.gif

Figure 1: Illustration of helicopter team scenario.

The two agents form a top-level joint commitment, GD, to reach their destination. There is no incentive for the agents to communicate the achievement of this goal, since they will both eventually reach their destination with certainty. However, in the service of their top-level goal, GD, the two agents also adopt a joint commitment, GR, of destroying the radar unit. We consider here the problem facing Escort with respect to communicating the achievement of goal, GR. If Escort communicates the achievement of GR, then Transport knows that it is safe to fly at its normal altitude (thus reaching the destination sooner). If Escort does not communicate the achievement of GR, there is still some chance that Transport will observe the event anyway. If Transport does not observe the achievement of GR, then it must fly nap-of-the-earth the whole distance, and the team receives a lower reward because of the later arrival. Therefore, Escort must weigh the increase in expected reward against the cost of communication.


COM-MTDP Model of States

In the COM-MTDP model of this scenario (presented in Figures 23 and  4), the world state is the position (along a straight line between origin and destination) of Transport, Escort, and the enemy radar. The enemy is at a randomly selected position somewhere in between the agents' initial position and their destination. Transport has no possible communication actions, but it can choose between two domain-level actions: flying nap-of-the-earth and flying at its normal speed and altitude. Escort has two domain-level actions: flying at its normal speed and destroying the radar. Escort also has the option of communicating the special message, sGR, indicating that the radar has been destroyed. In the tables of Figures 23 and 4, the ``·'' symbol represents a wild-card (or ``don't care'') entry.

a ={Escort (E),Transport (T)}
S =XE×XT×XR
Position of Escort: XE={0,1,...,8,9,Destination}
Position of Transport: XT={0,0.5,...,9,9.5,Destination,
                                                                          Destroyed}
Position of Radar: XR={1,2,...,8,Destroyed}
Aa =AE×AT = {fly,destroy,wait}×{fly-NOE,fly-normal,wait}
Sa =SE×ST = {clear (sGR),null}×{null}
RA( < xE,xT,xR > ,a) =
xE
xT
a
RA
0,...,9
0,...,9.5,Destroyed
·
0
0,...,9
Destination
·
rT
Destination
0,...,9.5,Destroyed
·
rE
Destination
Destination
·
rE+rT
RS(s, < null,null > ) =0
RS(s, < sGR,null > ) =-rS Î [0,1]
Figure 2: COM-MTDP model of states, actions, and rewards for helicopter scenario.

Figure 3: COM-MTDP model of transition probabilities for helicopter scenario (excludes zero probability rows).


COM-MTDP Model of Observability

Figure 4: COM-MTDP model of observability for helicopter scenario. These tables exclude both zero probability rows and input feature columns from which O is independent. For example, both agents' observation functions are independent of the transport's selected action, so neither table includes a aT column.

If Escort arrives at the radar, then it observes its presence with certainty and can destroy it to achieve GR. The likelihood of Transport's observing the radar's destruction is a function of its distance from the radar. We can vary this function's observability parameter (l in Figure 4) within the range [0,1] to generate distinct domain configurations (0 means that Transport will never observe the radar's destruction; 1 means Transport will always observe it). If the observability is 1, then they achieve mutual belief of the achievement of GR as soon as it occurs (following the argument presented in Section 4.1). However, for any observability less than 1, there is a chance that the agents will not achieve mutual belief simply by common observation. The helicopters receive a fixed reward for each time step spent at their destination. Thus, for a fixed time horizon, the earlier the helicopters reach there, the greater the team's reward. Since flying nap-of-the-earth is slower than normal speed, Transport will switch to its normal flying as soon as it either observes that GR has been achieved or Escort sends the message, sGR. Sending the message is not free, so we impose a variable communication cost (rS in Figure 2), also within the range [0,1].


Implemented Policies

We constructed COM-MTDP models of this scenario for each combination of observability and communication cost within the range [0,1] at 0.1 increments. For each combination, we applied the Jennings and STEAM policies, as well as a completely silent policy. For this domain, the policy, paSJ, dictates that Escort always communicate sGR upon destroying the radar. For STEAM, we vary the t and Cc parameters with the observability and communication cost parameters, respectively. We used two different settings (low and medium) for the cost of miscoordination, Cmt. Following the published STEAM algorithm [Tambe, 1997], Escort sends message sGR if and only if STEAM's inequality t·Cmt > Cc, holds. Thus, the two different settings, low and medium, for Cmt generate two distinct communication policies; the high setting is strictly dominated by the other two settings in this domain.

We also constructed and evaluated locally and globally optimal policies. Under domain configurations with high observability, the globally optimal policy has the escort wait an additional time step after destroying the radar and then communicate only if the transport continues flying nap-of-the-earth. The escort cannot directly observe which method of flight the transport has chosen, but it can measure the change in the transport's position (since it maintains a history of its past observations) and thus infer the method of flight with complete accuracy. In a sense, the escort following the globally optimal policy is performing plan recognition to analyze the transport's possible beliefs. It is particularly noteworthy that our domain specification does not explicitly encode this recognition capability. In fact, our algorithm for finding the globally optimal policy does not even make any of the assumptions made by our locally observable policy (i.e., single agent is deciding whether to communicate or not, regarding a single message, at a single point in time); rather, our general-purpose search algorithm traverses the policy space and ``discovers'' this possible means of inference on its own. We expect that such COM-MTDP analysis can provide an automatic method for discovering novel communication policies of this type in other domains, even those modeling real-world problems.

Figure 5: The candidate policies, as they vary across domains and situations.
Obs. (l) Comm. Cost (rS) Transport (XT) DXT Jennings STEAM (Cmt=low) Locally Optimal Globally Optimal
0.00.0
0.00.1
0.00.2
0.00.3
0.00.4
0.00.5 <4.0
0.00.5 ≥4.0
0.00.6 <3.0
0.00.6 ≥3.0
0.00.7 <2.0
0.00.7 ≥2.0
0.00.8 <1.0
0.00.8 ≥1.0
0.00.9
0.01.0
0.10.0
0.10.1 <1
0.10.1 ≥1
0.10.2 <1
0.10.2 ≥1
0.10.3 <1
0.10.3 ≥1
0.10.4 <1
0.10.4 ≥1
0.10.5 <4.0<1
0.10.5 <4.0≥1
0.10.5 ≥4.0<1
0.10.5 ≥4.0≥1
0.10.6 <3.0
0.10.6 ≥3.0
0.10.7 <2.0
0.10.7 ≥2.0
0.10.8 <1.0
0.10.8 ≥1.0
0.10.9
0.11.0
0.20.0
0.20.1 <1
0.20.1 ≥1
0.20.2 <1
0.20.2 ≥1
0.20.3 <1
0.20.3 ≥1
0.20.4 <1
0.20.4 ≥1
0.20.5 <4.0<1
0.20.5 <4.0≥1
0.20.5 ≥4.0<1
0.20.5 ≥4.0≥1
0.20.6 <3.0
0.20.6 ≥3.0
0.20.7 <2.0
0.20.7 ≥2.0
0.20.8 <1.0
0.20.8 ≥1.0
0.20.9
0.21.0
0.30.0
0.30.1 <1
0.30.1 ≥1
0.30.2 <1
0.30.2 ≥1
0.30.3 <1
0.30.3 ≥1
0.30.4 <1
0.30.4 ≥1
0.30.5 <4.0<1
0.30.5 <4.0≥1
0.30.5 ≥4.0<1
0.30.5 ≥4.0≥1
0.30.6 <3.0
0.30.6 ≥3.0
0.30.7 <2.0
0.30.7 ≥2.0
0.30.8
0.30.9
0.31.0
0.40.0
0.40.1 <1
0.40.1 ≥1
0.40.2 <1
0.40.2 ≥1
0.40.3 <1
0.40.3 ≥1
0.40.4 <1
0.40.4 ≥1
0.40.5 <4.0<1
0.40.5 <4.0≥1
0.40.5 ≥4.0<1
0.40.5 ≥4.0≥1
0.40.6 <3.0
0.40.6 ≥3.0
0.40.7 ≤1.0
0.40.7 >1.0
0.40.8
0.40.9
0.41.0
0.50.0
0.50.1 <1
0.50.1 ≥1
0.50.2 <1
0.50.2 ≥1
0.50.3 <1
0.50.3 ≥1
0.50.4 <1
0.50.4 ≥1
0.50.5 <4.0
0.50.5 ≥4.0
0.50.6 ≤2.0
0.50.6 >2.0
0.50.7
0.50.8
0.50.9
0.51.0
0.60.0
0.60.1
0.60.2
0.60.3
0.60.4
0.60.5 <4.0
0.60.5 ≥4.0
0.60.6
0.60.7
0.60.8
0.60.9
0.61.0
0.70.0
0.70.1
0.70.2
0.70.3
0.70.4
0.70.5 <1.0
0.70.5 ≥1.0
0.70.6
0.70.7
0.70.8
0.70.9
0.71.0
0.80.0
0.80.1
0.80.2
0.80.3
0.80.4
0.80.5
0.80.6
0.80.7
0.80.8
0.80.9
0.81.0
0.90.0
0.90.1
0.90.2 ≤3.0
0.90.2 >3.0
0.90.3
0.90.4
0.90.5
0.90.6
0.90.7
0.90.8
0.90.9
0.91.0
1.00.0
1.00.1
1.00.2
1.00.3
1.00.4
1.00.5
1.00.6
1.00.7
1.00.8
1.00.9
1.01.0

References

[Tambe, 1997]
Tambe, M. 1997. Towards flexible teamwork  Journal of Artificial Intelligence Research, 7, 83-124.




File translated from TEX by TTH, version 3.11.
On 27 Jun 2002, 14:29.