Rebuilding the Ansible jumphost under fire


  • Wed 12 February 2020
  • misc

You only really appreciate just how much you've come to rely on a particular tool in your bag of tricks when you're suddenly deprived of it.

Today, I was trying to (manually - strike one) add awscli to my (production, strike 2) Ansible jumphost. Something went off the rails and I was left with a broken Python YAML parser.

If you know anything about Ansible, you know this is a really bad situation. Basically you can't get anything done at all.

No problem - just rebuild it right? Well... how do I build my Ansible jumphost? That's right... with Ansible. But I blew away the staging environment (from whence I build my production) in the interests of saving a bit of RAM on the VM host. And you need to do it from somewhere that's whitelisted for talking to the APIs for creating and destroying VMs.

Ended up coming in over a VPN so I ended up within the whitelist cone and rebuilding staging and prod that way. From my laptop.

I count myself lucky here that there weren't any dependencies missing from my laptop that I really needed in order to make stuff work properly. After all, between pip and pkgsrc, I install 24 packages on the Ansible jumphosts. A fair amount of it is stuff like dnspython and pyOpenSSL that various playbooks depend on. And the Ansible jumphosts are simple compared to some stuff I deploy.

The twofold lessons here are, (a) don't test in prod (and your main Ansible jumphost is absolutely "prod"), and (b) keep a spare jumphost environment around that you can rebuild your main jumphost environment from if things go sideways.