Quick Start#

If you have not installed MARLlib yet, please refer to Installation before running.

Configuration Overview #

../_images/configurations.png — Prepare all the configuration files to start your MARL journey#

To start your MARL journey with MARLlib, you need to prepare all the configuration files to customize the whole learning pipeline. There are four configuration files that you need to ensure correctness for your training demand:

scenario: specify your environment/task settings
algorithm: finetune your algorithm hyperparameters
model: customize the model architecture
ray/rllib: changing the basic training settings

Scenario Configuration #

MARLlib provides ten environments for you to conduct your experiment. You can follow the instruction in the Environments section to install them and change the corresponding configuration to customize the chosen task.

Algorithm Hyper-parameter #

After setting up the environment, you need to visit the MARL algorithms’ hyper-parameter directory. Each algorithm has different hyper-parameters that you can finetune. Most of the algorithms are sensitive to the environment settings. Therefore, you need to give a set of hyper-parameters that fit the current MARL task.

We provide a commonly used hyper-parameters directory, a test-only hyper-parameters directory, and a finetuned hyper-parameters sets for the three most used MARL environments, including SMAC , MPE, and MAMuJoCo

Model Architecture #

Observation space varies with different environments. MARLlib automatically constructs the agent model to fit the diverse input shape, including: observation, global state, action mask, and additional information (e.g., minimap)

However, you can still customize your model in model’s config. The supported architecture change includes:

Observation/State Encoder: CNN, FC
Multi-layers Perceptron: MLP
Recurrent Neural Network: GRU, LSTM
Q/Critic Value Mixer: VDN, QMIX

Ray/RLlib Running Options #

Ray/RLlib provides a flexible multi-processing scheduling mechanism for MARLlib. You can modify the file of ray configuration to adjust sampling speed (worker number, CPU number), training speed (GPU acceleration), running mode (locally or distributed), parameter sharing strategy (all, group, individual), and stop condition (iteration, reward, timestep).

How to Customize #

To modify the configuration settings in MARLlib, it is important to first understand the underlying configuration system.

Level of the configuration #

There are three levels of configuration, listed here in order of priority from low to high:

File-based configuration, which includes all the default *.yaml files.
API-based customized configuration, which allows users to specify their own preferences, such as {"core_arch": "mlp", "encode_layer": "128-256"}.
Command line arguments, such as python xxx.py --ray_args.local_mode --env_args.difficulty=6 --algo_args.num_sgd_iter=6.

If a parameter is set at multiple levels, the higher level configuration will take precedence over the lower level configuration.

Compatibility across different levels #

It is important to ensure that hyperparameter choices are compatible across different levels. For example, the Multiple Particle Environments (MPE) support both discrete and continuous actions. To enable continuous action space settings, one can simply change the continuous_actions parameter in the mpe.yaml to True. It is important to pay attention to the corresponding setting when using the API-based approach or command line arguments, such as marl.make_env(xxxx, continuous_actions=True), where the argument name must match the one in mpe.yaml exactly.

Training #

from marllib import marl
# prepare env
env = marl.make_env(environment_name="mpe", map_name="simple_spread")
# initialize algorithm with appointed hyper-parameters
mappo = marl.algos.mappo(hyperparam_source="mpe")
# build agent model based on env + algorithms + user preference
model = marl.build_model(env, mappo, {"core_arch": "mlp", "encode_layer": "128-256"})
# start training
mappo.fit(env, model, stop={"timesteps_total": 1000000}, checkpoint_freq=100, share_policy="group")

prepare the `environment`#

task mode	api example
cooperative	`marl.make_env(environment_name="mpe", map_name="simple_spread", force_coop=True)`
collaborative	`marl.make_env(environment_name="mpe", map_name="simple_spread")`
competitive	`marl.make_env(environment_name="mpe", map_name="simple_adversary")`
mixed	`marl.make_env(environment_name="mpe", map_name="simple_crypto")`

Most of the popular environments in MARL research are supported by MARLlib:

Env Name	Learning Mode	Observability	Action Space	Observations
LBF	cooperative + collaborative	Both	Discrete	1D
RWARE	cooperative	Partial	Discrete	1D
MPE	cooperative + collaborative + mixed	Both	Both	1D
SMAC	cooperative	Partial	Discrete	1D
MetaDrive	collaborative	Partial	Continuous	1D
MAgent	collaborative + mixed	Partial	Discrete	2D
Pommerman	collaborative + competitive + mixed	Both	Discrete	2D
MAMuJoCo	cooperative	Partial	Continuous	1D
Google Research Football	collaborative + mixed	Full	Discrete	2D
Hanabi	cooperative	Partial	Discrete	1D

Each environment has a readme file, standing as the instruction for this task, including env settings, installation, and important notes.

initialize the `algorithm`#

running target	api example
train & finetune	`marl.algos.mappo(hyperparam_source=$ENV)`
develop & debug	`marl.algos.mappo(hyperparam_source="test")`
3rd party env	`marl.algos.mappo(hyperparam_source="common")`

Here is a chart describing the characteristics of each algorithm:

algorithm	support task mode	discrete action	continuous action	policy type
IQL: multi-agent version of D(R)QN.	all four	heavy_check_mark:		off-policy
IPG	all four	heavy_check_mark:	heavy_check_mark:	on-policy
IA2C: multi-agent version of A2C	all four	heavy_check_mark:	heavy_check_mark:	on-policy
IDDPG: multi-agent version of DDPG	all four		heavy_check_mark:	off-policy
ITRPO: multi-agent version of TRPO	all four	heavy_check_mark:	heavy_check_mark:	on-policy
IPPO: multi-agent version of PPO	all four	heavy_check_mark:	heavy_check_mark:	on-policy
COMA: MAA2C with Counterfactual Multi-Agent Policy Gradients	all four	heavy_check_mark:		on-policy
MADDPG: DDPG agent with a centralized Q	all four		heavy_check_mark:	off-policy
MAA2C: A2C agent with a centralized critic	all four	heavy_check_mark:	heavy_check_mark:	on-policy
MATRPO: TRPO agent with a centralized critic	all four	heavy_check_mark:	heavy_check_mark:	on-policy
MAPPO: PPO agent with a centralized critic	all four	heavy_check_mark:	heavy_check_mark:	on-policy
HATRPO: Sequentially updating critic of MATRPO agents	cooperative	heavy_check_mark:	heavy_check_mark:	on-policy
HAPPO: Sequentially updating critic of MAPPO agents	cooperative	heavy_check_mark:	heavy_check_mark:	on-policy
VDN: mixing Q with value decomposition network	cooperative	heavy_check_mark:		off-policy
QMIX: mixing Q with monotonic factorization	cooperative	heavy_check_mark:		off-policy
FACMAC: mixing a bunch of DDPG agents	cooperative		heavy_check_mark:	off-policy
VDA2C: mixing a bunch of A2C agents’ critics	cooperative	heavy_check_mark:	heavy_check_mark:	on-policy
VDPPO: mixing a bunch of PPO agents’ critics	cooperative	heavy_check_mark:	heavy_check_mark:	on-policy

*all four: cooperative collaborative competitive mixed

construct the agent `model`#

model arch	api example
MLP	`marl.build_model(env, algo, {"core_arch": "mlp")`
GRU	`marl.build_model(env, algo, {"core_arch": "gru"})`
LSTM	`marl.build_model(env, algo, {"core_arch": "lstm"})`
encoder arch	`marl.build_model(env, algo, {"core_arch": "gru", "encode_layer": "128-256"})`

kick off the training `algo.fit`#

setting	api example
train	`algo.fit(env, model)`
debug	`algo.fit(env, model, local_mode=True)`
stop condition	`algo.fit(env, model, stop={'episode_reward_mean': 2000, 'timesteps_total': 10000000})`
policy sharing	`algo.fit(env, model, share_policy='all') # or 'group' / 'individual'`
save model	`algo.fit(env, model, checkpoint_freq=100, checkpoint_end=True)`
GPU accelerate	`algo.fit(env, model, local_mode=False, num_gpus=1)`
CPU accelerate	`algo.fit(env, model, local_mode=False, num_workers=5)`

policy inference algo.render

setting	api example
render	`algo.render(env, model, local_mode=True, restore_path='path_to_model')`

By default, all the models will be saved at /home/username/ray_results/experiment_name/checkpoint_xxxx

Logging & Saving #

MARLlib uses the default logger provided by Ray in ray.tune.CLIReporter. You can change the saved log location here.