Things Linux: A Framework for Distributed Agent Based Model Development

Note: This is a longish article, originally published a year ago in Linux Journal. It includes a brief description of agent-based modeling, and then it introduces ABM++, a framework for developing distributed ABMs on Linux clusters.

--Doug
__________________

Introduction

What is agent-based modeling, otherwise known as ABM? What is distributed computing? Why do it?

These are all easily answered questions, right? Surprisingly, not really, even after all the years that this technology has been in use. There are still a whole lot of software developers out there today implementing applications using purely procedural methodologies and languages like C, FORTRAN, PASCAL, Visual Basic, etc. Worse, there are plenty of people using languages that are capable of good object oriented implementations, like C++ and Java, but they are using the language to only implement procedural designs. These people, for whatever reason, have simply not become aware of the advantages provided by ABM software methodologies.

We won't go into those advantages here because we do not wish to initiate one of those ABM software design Holy Wars (although, we might suggest that procedural software engineering approaches are artifacts left over from the Dark Ages, with current practitioners strongly resembling critters last seen roaming the Earth during the late Jurassic period).
But we certainly don't want to start any flame wars.

Rather, we will simply suggest that ABM methodologies represent current best practices for a large range of current-day software design challenges and then proceed on to the meat of this article, which discusses distributed agent-based modeling.

What is ABM Methodology?

What is agent-based modeling? The answer is incredibly simple, yet I invariably find myself frustrated at the inability of some hard-core died-in-the-wool procedural developers to grasp the concept. Here's the whole kernel of ABM methodology in a nutshell:

An agent-based model design is one in which analogs of those real-world entities that are to be modeled are represented as software agents, or objects, at a level of detail and resolution necessary to address the questions the model is required to answer. The agents interact with each other as the model runs, producing dynamic information about the system being simulated.

Some ABM practitioners might quibble with the distinctions between "software agent" and "software object"; if so I encourage them to present their quibbles in the comments section following this article. These quibbles, however, are not important to the thrust of this article, which is how to do develop agent based models in a distributed computing environment.

Now, let's illustrate what this means in the context of a real ABM implementation. The example we will use here is EpiSims, a large-scale distributed epidemiological simulation. EpiSims was designed and implemented by my group at Los Alamos National Laboratory during the mid 1990s. Its purpose is to simulate the spread of infectious disease in large urban populations, and to allow the analyst to evaluate the impact of various proposed intervention strategies to slow or halt the spread of disease. EpiSims is still in use at various locations around the United States.
So, what are the software agents in EpiSims? They are

Person agents,

Location agents, places where people go throughout the course of a day and come into contact with each other. There are various subclasses of location agents: homes, schools, workplaces, hospitals, shopping centers, mass transit, etc., and

A disease agent.

That's it. Those are the agents. Infected person agents carry an instance of a disease agent along with them that defines how disease is transmitted to other person agents when contact occurs at a location, as the simulated people conduct their daily activities. The disease agent also controls the progression of the disease in its Person-agent host.

Person agents are characterized by a small set of demographic information such as age, gender, family structure, ethnicity, and household income which have been identified as necessary for conducting effective disease intervention analysis.

Distributed ABMs

Why go to the trouble to develop a distributed agent based model? The short answer is that there are many problems that are too big to be solved in a serial computing environment. Again using EpiSims as an example: the city of Chicago has a population of approximately 8.6 million people, and over 2 million households. That's a lot of software agents. The memory requirement represented by this ABM design is too large for most single-CPU environments. The computational requirement likewise cannot be supplied by a single CPU.

The question then becomes: how best to develop an ABM design that can take advantage of the hundreds-to-thousands of processors and large amounts of distributed memory that today's Linux clusters provide?

It turns out that there are a some rules of thumb to consider when developing a distributed design. Here are a few of them:

Distribute your ABM as evenly as possible. You don't want underutilized nor over-utilized compute nodes in your distributed runs. Your run will ultimately be limited to the performance of the slowest node in your run configuration.

Distribute your ABM in a way so as to minimize message passing requirements between compute nodes. Message bandwidth is slow compared to on-board memory bandwidth.

Simple synchronization methods are the easiest to implement, but they typically do not scale well. Start simple, add complexity later as needed for performance.

How to Get Started With a Distributed ABM Implementation

The following sections of this article describe ABM++, an open source (GPL) software framework that allows the developer to implement agent based models in C++ for deployment on distributed memory Linux clusters. ABM++ can be downloaded from http://parrot-farm.net/ABM++/. The framework provides the necessary functionality to allow applications to run on distributed architectures. A C++ message passing API is provided which provides the ability to send MPI messages between distributed objects. The framework also provides an interface that allows objects to be serialized into message buffers, allowing them to be moved between distributed compute nodes. A synchronization method is provided, and both time-stepped and distributed discrete event time update mechanisms are provided.

ABM++ is completely flexible with respect to how the developer chooses to design his C++ representations of agents. All of the functionality necessary for distributed computation is provided by the framework; the developer provides the C++ agent implementations for his application. Distributed computing functionality is provided by the framework to the application via simple inheritance and containment. The source code includes a simple working example in the Applications directory to illustrate how the framework is used.

Background

ABM++ is a re-engineered version of the distributed computing toolkit that was developed at Los Alamos National Laboratory during the period of 1990 – 2005. EpiSims, TRANSIMS, and MobiCom are three large-scale distributed agent based models developed at Los Alamos that utilized the distributed toolkit. The toolkit was re-engineered and re-implemented in 2009 to make it more modular and extensible.

Tools for Managing Distributed Agents

In a distributed ABM, the agents are distributed across the compute nodes in the cluster. In most applications some agents might be statically distributed: they never migrate between distributed compute nodes after the initial distribution. Other agents are dynamically distributed: they can migrate between compute nodes as the simulation runs. As an example of this, consider a design for a social network model that simulates interactions between all individual persons in a city the size of Chicago, which has a population of 8.6 million people. The agents in this design are

individual people, and

locations in the city (households, workplaces, schools, shops, hospitals, etc.). For a Chicago model there might be 200,000 non-household locations to be simulated in addition to the ~2,000,000 households.

In this example design, location agents will be statically distributed and person agents will be dynamically distributed. At simulation initialization time all 2,200,000+ locations will be distributed across the cluster compute nodes. Then, as the simulation runs people will migrate between CPUs as their simulated daily activities cause them to move from location to location during the course of a simulated day. Person agents will come into contact with other person agents at locations as a result of the social network mobility patterns, and whatever person agent interactions are of interest can then be simulated.

The ABM++ framework provides specialized container classes that can be populated with information about what CPU all statically distributed agents reside on. This facilitates the ability to send messages between distributed agents. An example distributed ABM is included in the source distribution which demonstrates the used of these specialized container classes.

Time Update Tools

Two methods of time update are provided by the framework: time step and discrete event updates. The provided example framework code includes an example which uses time step updates.

Synchronization Tools

Interprocess communications between compute nodes in a distributed ABM often impose synchronization requirements. In our example social network simulation, person agents migrate between compute nodes as the person agent randomly moves between locations. It is important that both compute nodes be at the same simulated time at the time of each move, otherwise causality errors would be introduced into the simulation if a person agent arrived at a compute node with a different notion of the current simulation time than the node he had just departed.

The current version of the framework provides one synchronization method that uses a master/worker design. This type of synchronization has the advantage of being simple, but it has the disadvantage of not scaling well to thousands of compute nodes. A future version of the framework will include a second synchronization method that utilizes a random pair-wise compute node method that scales well to large cluster configurations.

C++ API to the Message Passing Interface (MPI)

A library is included with ABM++ called MPIToolbox that provides a C++ API to MPI. This API provides an interface to MPI that includes methods for serializing agents into message buffers, allowing agents to be sent via the MPIToolbox between compute nodes in the cluster. The advantage of the MPIToolbox is that it handles the lower-level interfaces to MPI transparently, easing the task of implementing message passing code.

An Example

The Applications directory contains a full working implementation of a simple social network ABM. The agents in the simulation consist of Location agents and Person agents, as described above. The Location agent software objects are statically distributed, and the Person agents migrate between them.

It is intended that the code in the Applications directory be used as a stub, or starting point for actual distributed ABM implementations. In particular, the DSim.[Ch], DistribControl.[Ch], and ABMEvents.[Ch] files are intended to be modified to meet specific application requirements.

Example Problem statement, to be implemented as a distributed ABM

Create 100 locations of the type described above on each CPU of each available compute node. Create 1000 people at each location. Randomly send all people to other locations every 15 minutes. Assume it takes 15 minutes for a person to reach another location from his current location.

DSim.C

This file contains the int main() routine for our example simulation application.
It is in main() that several framework globals are instantiated. It is also here that the simulation end time is set and the simulation is started. One of the globals created is DCtl, instantiated as shown below starting at line 60 of DSim.C :

   //
   // Create an instance of a class object that will perform
   // distributed run control
   //
   DCtl = new DistribControl();
One instance of the DistribControl class is created on each compute core.

DistribControl.C

This file contains the method definitions of the DistribControl class. The DCtl object is responsible for understanding the distributed topology of the ABM and controlling the simulation run. A brief description of the public methods of the DistribControl class follows. See DistribControl.C for the details of implementation.
DistribControl::Init

Create Distributed Object containers on each compute core. The Distributed Object containers are used to dereference compute core ids where distributed location objects reside.

Create a simulation Controller object for each compute core. The simulation Controller object causes time updates to occur.

Create an instance of the ABMEvents::TimeStep class. The Controller object uses this TimeStep class to perform rescheduling time step updates.

Create some locations on each compute core.

Broadcast the location ids that reside on this core to all the other workers in this distributed run.

Synchronize with the master so that none of the workers leave DistribControl::Init until they are all done initializing.

DistribControl::Run
This method just causes the rescheduling ABMEvents::TimeStep event to execute each time interval.

DistribControl::Shutdown
This method coordinates with all other compute cores to accomplish a synchronized shutdown of the distributed run.

DistribControl::HandleIO
This method is only invoked on the MPI master core. It uses an MPI Toolbox method to sit in spinlock listening for incoming MPI messages.

DistribControl::HandleEvent
This method runs on the master and all worker compute cores. It is a specialized method that is inherited from the TmessageRecipient class in the MPI Toolbox. It handles all incoming messages based on the type of message being received. For example, if the message ID was of type kReceivePerson, the DistribControl::HandleEvent knows that the message buffer contains a serialized representation of a person-agent that is arriving at this compute core, and that the DistribControl::ReceivePerson method is to be called, passing the message body as a parameter from which a new instance of the Person class will be instantiated.

DistribControl::ReceivePerson
A new instance of type Person is instantiated from the message bugger passed as an argument. The person-agent is then inserted into it's destination location.

DistribControl::SendPerson
This is the method that is called when a person-agent needs to move to a new location. If the destination location is one that is local to this compute core, Location::ReceivePerson is called for that location. If the destination location resides on another compute core, the person-agent is serialized into an MPIToolbox message, and the message is sent to its destination compute core.

DistribControl::Synchronize
This method is called as a result of DCtl having received a message from the simulation master process that synchronization is required. This method will sit in spinlock until it receives another message from the master process to continue simulating.

DistribControl::AddLocationData
This method is called after DCtl receives a message of type kSendLocationInfo. The message body contains the compute core id and location ids for all distributed locations that reside on other compute cores. This information is used by DCtl to dereference the compute core id for a distributed location.

SocialActivity.C

This file contains just two methods.

SocialActivity::CreateLocations
This method is called by DistribControl::Init to create 100 locations on this compute core.

SocialActivity::MovePeople
This is the method that is called by the Controller object each time step. It sends all of the person-agents residing in Locations on the compute core to other randomly-selected Locations. Note that this method invokes Location::SendPerson, which in turn invokes DCtl->SendPerson, because only the DistribControl object is aware of the distribution of Location objects on the cluster compute cores.

ABMEvents.C

TimeStep::Eval
The TimeStep class is used to define what simulation activities are to occur each time step. The time step interval is defined at line 42, and the time step functionality is defined in the TimeStep::Eval function which begins at line 45:

Call the DistribControl Synchronize method to ensure that all compute cores are synchronized to the same simulation time step.

Call the SocialActivity::MovePeople() method to randomly move person agents between locations.

Reschedule the next time step event.

Location.[Ch]

The Location class is a statically distributed agent in the simulation. It has methods to send and re receive person-agents, and a container to hold them in. Each location has a unique ID, defined at line 27 of Location.h.

Person.[Ch]

The example person agent is defined by the Person class. Person agents are dynamically distributed objects. A Person object has a unique id, and Encode and Decode methods for serializing the person agent data into MPIToolbox message buffers. For this simple example the Person Id is the only data that is serialized at message passing time.

The ABM++ MPI Appliance

As a way to get users up and running with the ABM++ framework described above we have created a virtual machine configured to provide users with an adequate environment for development of C++ MPI applications and hopefully extension of the ABM++ framework itself. Please see the ABM++ User's Guide for instructions on how to use the appliance.

Thursday, April 12, 2012

A Framework for Distributed Agent Based Model Development

1 comment: