Getting started

Erlang 19.x from 'jessie-backports' on Debian Jessie

On Debian Jessie hosts, the role will configure an APT preference for backported Erlang 19.x packages from Debian Stretch. They provide better Elliptic Curve Cryptography (ECC) support and allow deactivation of TLS client-initiated protocol renegotiation, which mitigates potential DoS attacks.

Encrypted client connections

The role will check if the debops.pki and debops.dhparam Ansible roles configured their environment on a host, and will automatically enable or disable support for encrypted AMQP connections. Plaintext connections will be available if encryption is disabled.

RabbitMQ clustering

By default the debops.rabbitmq_server role configures RabbitMQ service in a standalone mode, without external access through the firewall. To allow for clustering, you need to define IP addresses and/or CIDR subnets, which will be allowed to connect to the epmd (Erlang Port Mapper Daemon) and einc (Erlang Inter-Process Communication) TCP ports. To do that, set the variable below in the Ansible inventory:

---
# Allow for cluster communication
rabbitmq_server__cluster_allow: [ '192.0.2.0/24' ]

After that, re-run the role to apply changes to the firewall configuration.

At the moment role does not create clusters automatically. To create a cluster manually using three hosts (host1, host2, host3) with host1 being the main cluster node, login to the other hosts and using the root account, run the commands:

rabbitmqctl stop_app
rabbitmqctl join_cluster rabbit@host1
rabbitmqctl start_app

You can check the RabbitMQ cluster status by running the command:

rabbitmqctl cluster_status

See the RabbitMQ Clustering Guide for more details.

Rolling restart and cluster bootstrap

Starting with RabbitMQ 4.2 the broker uses Khepri (Raft) for metadata storage by default (in 4.0 and 4.1 it is available as an opt-in feature flag), which means that restarting a majority of cluster nodes simultaneously causes a timeout_waiting_for_leader boot deadlock. To prevent this, the service playbook uses serial: 1 together with any_errors_fatal: true and max_fail_percentage: 0, and runs a post-task health check (rabbitmqctl await_startup + cluster_status + assert that the current node is visible in running_nodes). Nodes are restarted one at a time and the play stops on the first failure.

On top of that, the Restart rabbitmq-server handler first calls rabbitmqctl stop_app (to avoid duplicate_node_name races with EPMD), restarts the systemd unit and waits for rabbitmqctl await_startup to return. The handler also carries throttle: 1 as a second line of defense in case the role is used outside of the service playbook.

Both invocation modes are supported out of the box:

  • Running the playbook against the whole group at once:

    debops run service/rabbitmq_server
    

    serial: 1 forces sequential processing, so nodes are configured and restarted one after another even if the inventory targets the whole cluster.

  • Running the playbook per host via --limit (useful when the role is not configured to form the cluster automatically and each node needs a manual rabbitmqctl join_cluster in between):

    debops run service/rabbitmq_server --limit host1
    debops run service/rabbitmq_server --limit host2
    debops run service/rabbitmq_server --limit host3
    

The post-task assertion only checks that the current node itself rejoined the cluster, so it does not trip up either scenario; peer availability is guaranteed by the sequential execution model.

Inter-node communication is not encrypted

Erlang supports encrypting communication between nodes (processes on the same or other hosts) using TLS, which RabbitMQ can use to secure traffic between hosts. However one downside is that when inter-node traffic is encrypted, Erlang uses dynamic random ports for communication, which might interfere with the host's firewall. Therefore by default debops.rabbitmq_server role does not configure encrypted inter-node communication. You should consider alternative means of securing the traffic between hosts, for example a separate VLAN or use of a VPN connection.

Example inventory

To configure RabbitMQ on a host, it should be added to the [debops_service_rabbitmq_server] Ansible inventory group:

[debops_service_rabbitmq_server]
hostname

Example playbook

If you are using this role without DebOps, here's an example Ansible playbook that uses the debops.rabbitmq_server role:

---

- name: Manage RabbitMQ service
  collections: [ 'debops.debops' ]
  hosts: [ 'debops_service_rabbitmq_server' ]
  become: True
  # RabbitMQ 4.x with Khepri (Raft) metadata store deadlocks with
  # 'timeout_waiting_for_leader' when a majority of cluster nodes restart
  # in parallel. The three play-level options below force strictly
  # sequential, one-node-at-a-time execution and abort on the first
  # failure so that the cluster stays healthy.
  # DO NOT REMOVE without reading the "Rolling restart and cluster
  # bootstrap" section in
  # docs/ansible/roles/rabbitmq_server/getting-started.rst
  serial: 1
  max_fail_percentage: 0
  any_errors_fatal: true

  environment: '{{ inventory__environment | d({})
                   | combine(inventory__group_environment | d({}))
                   | combine(inventory__host_environment  | d({})) }}'

  pre_tasks:

    - name: Prepare rabbitmq_server environment
      ansible.builtin.import_role:
        name: 'rabbitmq_server'
        tasks_from: 'main_env'
      tags: [ 'role::rabbitmq_server', 'role::secret', 'role::rabbitmq_server:config' ]

  roles:

    - role: secret
      tags: [ 'role::secret', 'role::rabbitmq_server', 'role::rabbitmq_server:config' ]
      secret__directories:
        - '{{ rabbitmq_server__secret__directories }}'

    - role: etc_services
      tags: [ 'role::etc_services', 'skip::etc_services' ]
      etc_services__dependent_list:
        - '{{ rabbitmq_server__etc_services__dependent_list }}'

    - role: ferm
      tags: [ 'role::ferm', 'skip::ferm' ]
      ferm__dependent_rules:
        - '{{ rabbitmq_server__ferm__dependent_rules }}'

    - role: rabbitmq_server
      tags: [ 'role::rabbitmq_server', 'skip::rabbitmq_server' ]

  post_tasks:

    - name: Wait for RabbitMQ node to become available
      ansible.builtin.command:
        cmd: 'rabbitmqctl -q await_startup --timeout 120'
      changed_when: false
      check_mode: false

    - name: Get RabbitMQ cluster status
      ansible.builtin.command:
        cmd: 'rabbitmqctl -q --formatter json cluster_status'
      register: rabbitmq_server__register_cluster_status
      changed_when: false
      check_mode: false

    - name: Assert this node rejoined the cluster
      ansible.builtin.assert:
        that:
          - _my_short in _running_short
        fail_msg: |
          Node rabbit@{{ ansible_hostname }} did not rejoin the cluster
          cleanly. running_nodes={{ _running }}.
      vars:
        _running: "{{ (rabbitmq_server__register_cluster_status.stdout
                     | from_json).running_nodes | default([]) }}"
        _running_short: "{{ _running
                          | map('regex_replace', '^rabbit@', '')
                          | map('regex_replace', '\\..*$', '')
                          | list }}"
        _my_short: "{{ ansible_hostname | regex_replace('\\..*$', '') }}"