Last night's fight with Ansible

  • Thu 07 January 2016
  • misc

Domain-specific languages for system configuration automation are nothing new (to name a few, cfengine's roots date to 1993, Puppet to 2005, and Chef to 2009). A relative newcomer called Ansible (2012) has been gaining traction with folks in the networking community, in no small part because it is agentless and prefers to run over an ssh connection, thus making it relatively easy to automate management of switches and routers as well as Unix hosts.

For a while now a task has been on my never-ending to-do list to get some experience with Ansible. I generally let other folks be the trendy first adopters and prefer to not goof around with technology that is half-baked without a seriously good reason. Ansible seems to be at the point of being ready for general consumption now, even running with the downrev version that lives in SmartOS' pkgin repository.

I'm not going to bother joining the chorus of people talking about how great Ansible is and why you should use it (trust me, you should). Instead I'm going to talk about a corner case that bit me, since Google couldn't seem to find someone with a similar tale of woe.

Because reasons, I store configuration bits in Subversion, though the failure mode documented herein will be equally lying in wait for all the hipsters who enjoy forking on every commit with Git. What this means is that I have configuration fragments such as SMF contracts, Postfix configuration file tweaks, etc. On a target machine I'm going to want to check out a copy of the repository so that Ansible can subsequently copy the files into place. Luckily, there's a model for that (one for git users too, though users of more-obscure version control systems may find themselves hosed on this point).

My subversion runs over svn+ssh:// to a common account with a forced command that sets the committer's name based on the key used (which should probably the the subject of another blog post). So just like talking to Github, it runs over ssh. It's 2016 now, right? Is anyone seriously still using passwords for their ssh sessions when they could be using keys instead?

The key I'm using is forwarded in from my laptop to the jumphost where I'm building the Ansible sandbox, and thence onward to the target machine that I'm configuring (so that I can talk to the repo). Since I am doing development of my Ansible playbooks, I'm doing a lot of blowing away and regenerating VMs, so I followed the standard guidance in the documentation for not paying close attention to host keys on these ephemeral devices, that is to arrange my ansible.cfg file like so:

transport = ssh
host_key_checking = False

ssh_args = -o ForwardAgent=yes

Everything seemed to work fine except that SSH_AUTH_SOCK was not being set on the far end, a situation that I confirmed by changing the remote task to be executed to be printenv. There's a lot of stuff on the Internet about this mode of failure but it all involves sudo not properly passing through the authentication socket environment variable - we weren't doing that; we were sshing in with keys directly as root (it's the SmartOS way, and after some reflection is not nearly the security issue you might think it is at first blush).

The ssh-to-the-target-host command (confirmed by running ansible-playbook -vvvv) was:

ssh -C -tt -vvv -o ForwardAgent=yes -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10</code>

So I ran it manually. SSH_AUTH_SOCK was propagated. Must be something with Ansible, not ssh itself, or maybe it's some kind of weird thing with non-interactive. Hmmm.

I got it to work once from inside ansible-playbook. Then it broke again when I blew away the VM and retested. That's when the light bulb started to come on. I deleted ~/.ssh/known_hosts on the Ansible jumphost and tested again with a fresh VM. Eureka. Successful checkout.

So what's the behavior? I'm thinking something, likely Ansible itself not OpenSSH, is noticing that there is a key mismatch with the destination, and although it's happy to let the connection go through given that StrictHostKeyChecking=no, it's not going to pass the SSH_AUTH_SOCK environment variable through to the forked ssh process unless it's a lot more sure that you're not being MITMed. A noble goal, but behavior that left me, the novice user playing in an intentionally high-flux sandbox environment, completely scratching his head.

If you delete the ~/.ssh/known_hosts file or otherwise make sure that a conflicting key isn't there, then the first connection with StrictHostKeyChecking=no still stashes the key, without prompting you.