In a little over two years of using Ansible heavily, I’ve been burned a couple of times by deployment-from-svn-and-git mysteriously breaking.

Of course, it’s never svn or git that breaks, it’s ssh authentication agent forwarding, and it’s probably exacerbated or at least made weirder by my insistnce upon persisting ssh keys across instantiations of hosts.

Review time - ssh uses a TOFU/TUFU security model. If you’ve ever used ssh, you’re probably familiar with the “The authenticity of host ‘mynode.example.com (192.0.2.1)’ can’t be established … Are you sure you want to continue connecting (yes/no)?” message. You look at the key fingerprint (hopefully? Maybe?) and decide if you’re going to trust the host, and when you answer in the affirmative the public key for that host gets added to ~/.ssh/known_hosts so that you aren’t prompted in the future for acknowledgement. In other words, trust is established on first use, rather than via some external validator as in PKI.

That means it’s a pain in the butt though when host keys change, as they do frequently if you nuke and reprovision on a regular basis in alignment with the “VMs and servers are cattle not pets” philosophy. Who wants to go through deleting and then the TOFU dance every time ssh keys change just because a VM got reprovisioned? Not me!

So I decided to persist my host keys on the Ansible server, meaning that as part of the “personality” that is pushed into the host with an Ansible role, it gets its old keys back.

Ansible runs over ssh, and will auto-trust keys on first use (“host_key_checking = False” in the [defaults] section of ansible.cfg is fairly standard). I have “-o ForwardAgent=yes” in the [ssh_connection] block of the same file.

Between the two of them, that means the ssh agent should always be enabled and forwarding my credentials, right?

Wrong.

It turns out that there are three conditions that might be true regarding the host keys on a particular ssh session called by Ansible:

  1. We are trusting them because it is our first use and Ansible essentially answered “yes” to the questions. Note that this automatically adds them to the ~/.ssh/known_hosts file for future reference
  2. We are trusting them because they are already in ~/.ssh/known_hosts just like if we were sshing interactively
  3. We are trusting them because of the host_key_checking = false override, despite the fact that there is a conflicting key cached in ~/.ssh/known_hosts.

In all three cases, ssh works well enough to run most plays. But in the third case, it silently doesn’t forward your ssh credentials, out of an abundance of caution. You’ll get an inscrutable warning out of subversion, saying that the checkout failed and asking you to change some parameters in a file that may or may not be referenced when calling subversion via Ansible.

It’s probably obvious at this point that the genesis of the problem was that I am unintentionally caching the “temporary” ssh host keys (attached to the brand new VM) in my Ansible jumphost’s .ssh/known_hosts… as part of the exercise of copying in the persistent ssh keys!

If you’re concerned that you might get bitten by this, a simple trick is to put a check in for a more obvious failure before your subversion/git/whatever task that requires forwarded credentials.

This:

- shell: ssh-add -l
  register: creds

- debug: msg="{{ creds.stdout }}"

will print out your credentials if they were properly forwarded, or fail with a completely obvious message about not being able to talk to ssh-agent.

The way I elected to solve the problem more generally was to blow away ~/.ssh/known_hosts after restoring the old host keys. This is not something you want to do if you run Ansible from your laptop, obviously, but part of the reason that I run Ansible on a jumphost in the datacenter is the aforementioned TOFU model - I more or less trust the five feet of Ethernet cable between the jumphost and the target, inside our cabinet at Equinix. It’s certainly safer than trusting the Internet between my laptop and a new VM.

- local_action: file path=~/.ssh/known_hosts state=absent

- subversion: repo=svn+ssh://dns@svn.example.com/namedb dest=/var/named

I’m debating moving local_action to the end of the ssh_hostkey_restore role, but it’s likely to bite some unsuspecting person there.

Anyone have a nice simple recipe for removing or updating all information about a single host in .ssh/known_hosts (including naked v4 and v6 addresses, and different key types)? I kind of feel like this might be something that ought be handled by ssh-keyscan(1) out of the box but isn’t. That might involve less collateral damage than simply blowing away .ssh/known_hosts.