Custom service to service authentication using IAM/KMS

Similar to abusing IAM and STS, we can also abuse IAM and KMS to let Amazon do our service-to-service authentication for us. Unlike STS, though, KMS is almost perfect for this use case.

Let’s recap a bit from the STS post, though. What I’m aiming for is service to service authentication with the following specs:

  1. Has no chicken and egg trust problem. Re-use AWS to provide the chicken, we’ll use it to lay the eggs.
  2. Can be used only from one service to another service. The service receiving the token shouldn’t be able to reuse the token to impersonate the sender.
  3. Has token rotation, and specifically lifetime validity constraints.
  4. Can have future lifetime validity constraints, so that it can be stored with enqueued work.
  5. Can have scoped tokens, also to support enqueued work.

KMS at first glance looks like a relatively boring HSM as a service, providing encryption/decryption and random data generation. When looking at KMS’s implementation of AES GSM with AAD (additional authenticated data), which it calls encryption context, KMS becomes really interesting. It’s interesting because it’s possible to use the encryption context along with IAM policy to restrict access to encryption or decryption requests. Let’s see an example:

>>> import boto3
>>> import datetime
>>>
>>> now = datetime.datetime.now()
>>> not_after = now + datetime.timedelta(minutes=60)
>>> now = now.strftime("%Y%m%dT%H%M%SZ")
>>> not_after = not_after.strftime("%Y%m%dT%H%M%SZ")
>>>
>>> kms = boto3.client('kms')
>>> token = kms.encrypt(KeyId='alias/authnz-testing', Plaintext='testdata', EncryptionContext={'from': 'servicea-development-iad', 'to': 'serviceb-development-iad', 'not_before': now, 'not_after': not_after})
>>> token
{u'KeyId': u'arn:aws:kms:us-east-1:12345:key/abcdefgh-1234-5678-9abcd-ee72ac95ae8c', 'ResponseMetadata': {'HTTPStatusCode': 200, 'RequestId': '3a48f2ad-072d-11e5-88fb-17df9ce1a01a'}, u'CiphertextBlob': '\n \x999\x9e$yO\x92\x1dg\xbbZ^S\x84\xdaI\xbf\x14@\x81\x8a\x1c\xf2\xf8Z\x05\xed\xed\xb2\x8d)T\x12\x8f\x01\x01\x01\x02\x00x\x999\x9e$yO\x92\x1dg\xbbZ^S\x84\xdaI\xbf\x14@\x81\x8a\x1c\xf2\xf8Z\x05\xed\xed\xb2\x8d)T\x00\x00\x00f0d\x06\t*\x86H\x86\xf7\r\x01\x07\x06\xa0W0U\x02\x01\x000P\x06\t*\x86H\x86\xf7\r\x01\x07\x010\x1e\x06\t`\x86H\x01e\x03\x04\x01.0\x11\x04\x0c\xd3\x96\x0c\x91\x83\xd2l!\xfb\xa6\xc2\x90\x02\x01\x10\x80#\x97Z\xd1\xbb\xb4_\x12\xea\x1a\xed\x85\x0e\x9b1\xfa0j\xca1(\xc7\xc3\x8czT\xd4\x8fk\x08\x00\xa8\xcd\xe5\x82\xb3'}
>>> kms.decrypt(CiphertextBlob=token['CiphertextBlob'], EncryptionContext={'from': 'servicea-development-iad', 'to': 'serviceb-development-iad', 'not_before': now, 'not_after': not_after})
{u'Plaintext': 'testdata', u'KeyId': u'arn:aws:kms:us-east-1:12345:key/abcdefgh-1234-5678-9abcd-ee72ac95ae8c', 'ResponseMetadata': {'HTTPStatusCode': 200, 'RequestId': '6450392b-072d-11e5-87df-5345698b39e1'}}

You may see where I’m going here. Like the previous STS post, I’m doing from and to mappings so that I can use IAM policy to limit a token from a service to a service so that it can’t be re-used by the ‘to’ service to authenticate to other services as the ‘from’ service. Something new I’ve added, though, is not_before and not_after, which is a time period the auth token is valid for. Unlike the STS solution, this allows us to enqueue work with a token that’s valid during the period the work is expected to be done.

So, this is the context we’re working with. Using either KMS key policy or KMS grants, we can limit which principles can encrypt or decrypt using the key. Most importantly, we can use the encryption context to control this. Let’s make some grants:

$ salt-call boto_kms.create_grant 'alias/authnz-testing' grantee_principal='arn:aws:iam::12345:user/servicea-development-iad' operations='["Encrypt"]' constraints='{"EncryptionContextSubset":{"from":"servicea-development-iad"}}' > /dev/null
$ salt-call boto_kms.create_grant 'alias/authnz-testing' grantee_principal='arn:aws:iam::12345:user/servicea-development-iad' operations='["Decrypt"]' constraints='{"EncryptionContextSubset":{"to":"servicea-development-iad"}}' > /dev/null
$ salt-call boto_kms.create_grant 'alias/authnz-testing' grantee_principal='arn:aws:iam::12345:user/serviceb-development-iad' operations='["Decrypt"]' constraints='{"EncryptionContextSubset":{"to":"serviceb-development-iad"}}' > /dev/null
$ salt-call boto_kms.create_grant 'alias/authnz-testing' grantee_principal='arn:aws:iam::12345:user/serviceb-development-iad' operations='["Encrypt"]' constraints='{"EncryptionContextSubset":{"from":"serviceb-development-iad"}}' > /dev/null
$ salt-call boto_kms.list_grants 'alias/authnz-testing'
local:
    ----------
    grants:
        |_
          ----------
          Constraints:
              ----------
              EncryptionContextSubset:
                  ----------
                  from:
                      servicea-development-iad
          GrantId:
              WZ9Y6I7S05pR0LjYzEXKhzVX0JWzapkxPjl3KiXH8BrMI1d4D5pecZ51FnOe11g56
          GranteePrincipal:
              arn:aws:iam::12345:user/servicea-development-iad
          IssuingAccount:
              arn:aws:iam::12345:root
          Operations:
              - Encrypt
        |_
          ----------
          Constraints:
              ----------
              EncryptionContextSubset:
                  ----------
                  to:
                      servicea-development-iad
          GrantId:
              EFm4L4FCsnM5ba23dmdC05Stw1oojsYVjONDkCwJpegmHdJ0gRF8jQd9NZmdXYXfA
          GranteePrincipal:
              arn:aws:iam::12345:user/servicea-development-iad
          IssuingAccount:
              arn:aws:iam::12345:root
          Operations:
              - Decrypt
        |_
          ----------
          Constraints:
              ----------
              EncryptionContextSubset:
                  ----------
                  from:
                      serviceb-development-iad
          GrantId:
              JoT9F5h19KqpunXfo89CnDB1PI1ig4ApuOYwsP20Pc6GFOBX1lWlx72oAh600aYXN
          GranteePrincipal:
              arn:aws:iam::12345:user/serviceb-development-iad
          IssuingAccount:
              arn:aws:iam::12345:root
          Operations:
              - Encrypt
        |_
          ----------
          Constraints:
              ----------
              EncryptionContextSubset:
                  ----------
                  to:
                      serviceb-development-iad
          GrantId:
              8hDVrUmkgcZxIJ8h2WHtgSU7sy3HcSm5dQg3u0uWKBpBcbPGUL27rkGmjTUcvn9JD
          GranteePrincipal:
              arn:aws:iam::12345:user/serviceb-development-iad
          IssuingAccount:
              arn:aws:iam::12345:root
          Operations:
              - Decrypt

As a quick aside: anything that we’re doing here through grants we can also do through key policy. However, key policies are limited in size and can’t easily be dynamically updated. Though we can try to limit the size of the policies by using IAM policy variables, the variable we’d need to use for this (aws:userid) doesn’t work because it includes the instance-id along with the role and we can’t target the ‘to’ service that way. We’ll need grants for each service and we can create and revoke grants at will, which is why I’ve chosen them.

I have two grants per service. One grant that allows the service to decrypt anything that’s sent to it (to) and another to encrypt anything that it’s going to send (from). The important bits are the GranteePrincipal, Operations and Constraints attributes. We allow the GranteePrincipal to perform the Operations listed, as long as the encryption context contains at least the key/value listed in the Constraints. We specify ‘at least the key/value listed’ by using EncryptionContextSubset in the constraints, rather than EncryptionContextEquals.

One thing ignored in the grants is not_before and not_after. A nice property of encryption context is that however data is encrypted is also how it must be decrypted. So, for instance, this doesn’t work:

>>> key = kms.encrypt(KeyId='alias/authnz-testing', Plaintext='testdata', EncryptionContext={'from': 'servicea-development-iad', 'to': 'serviceb-development-iad', 'not_before': now, 'not_after': not_after})
>>> key
{u'KeyId': u'arn:aws:kms:us-east-1:12345:key/abcdefgh-1234-5678-9abcd-ee72ac95ae8c', 'ResponseMetadata': {'HTTPStatusCode': 200, 'RequestId': '3a48f2ad-072d-11e5-88fb-17df9ce1a01a'}, u'CiphertextBlob': '\n \x999\x9e$yO\x92\x1dg\xbbZ^S\x84\xdaI\xbf\x14@\x81\x8a\x1c\xf2\xf8Z\x05\xed\xed\xb2\x8d)T\x12\x8f\x01\x01\x01\x02\x00x\x999\x9e$yO\x92\x1dg\xbbZ^S\x84\xdaI\xbf\x14@\x81\x8a\x1c\xf2\xf8Z\x05\xed\xed\xb2\x8d)T\x00\x00\x00f0d\x06\t*\x86H\x86\xf7\r\x01\x07\x06\xa0W0U\x02\x01\x000P\x06\t*\x86H\x86\xf7\r\x01\x07\x010\x1e\x06\t`\x86H\x01e\x03\x04\x01.0\x11\x04\x0c\xd3\x96\x0c\x91\x83\xd2l!\xfb\xa6\xc2\x90\x02\x01\x10\x80#\x97Z\xd1\xbb\xb4_\x12\xea\x1a\xed\x85\x0e\x9b1\xfa0j\xca1(\xc7\xc3\x8czT\xd4\x8fk\x08\x00\xa8\xcd\xe5\x82\xb3'}
>>> kms.decrypt(CiphertextBlob=key['CiphertextBlob'], EncryptionContext={'from': 'servicea-development-iad', 'to': 'serviceb-development-iad'})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/rlane/Envs/boto3/lib/python2.7/site-packages/botocore/client.py", line 249, in _api_call
    raise ClientError(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (InvalidCiphertextException) when calling the Decrypt operation: None

The decryption request fails because the not_before and not_after keys/values are missing (or incorrect). Based on that, we can check the time validity in our application logic. Of course, if KMS could handle this for us, it would be nice, but at this time it’s not possible to use IAM policy variables in grants, only in key policy.

A downside to needing to pass the encryption context in for the decryption request is that it’s necessary to know all of the information to pass in. This means when we make a request from servicea to serviceb, we need to pass this information along with it. This is another reason I don’t bother with IAM policy for not_before and not_after; it’s necessary to pass these along with the request anyway.

Let’s look at some code for doing KMS authentication. Here’s the server side (using flask):

def get_key_arn():
    # You should cache this.
    key = kms.describe_key(
        KeyId='alias/{0}'.format(app.config['MASTER_KEY_ID'])
    )
    return key['KeyMetadata']['Arn']


def decrypt_token(token, _from, not_before, not_after):
    time_format = "%Y%m%dT%H%M%SZ"
    now = datetime.datetime.utcnow()
    _not_before = datetime.datetime.strptime(not_before, time_format)
    _not_after = datetime.datetime.strptime(not_after, time_format)
    # Ensure the token is within the validity window.
    if not (now >= _not_before) and (now <= _not_after):
        raise TokenDecryptError('Authentication error.')
    try:
        token = base64.b64decode(token)
        data = kms.decrypt(
            CiphertextBlob=token,
            EncryptionContext={
                # This token is sent to us.
                'to': app.config['IAM_ROLE'],
                # From another service.
                'from': _from,
                # It's valid from this time.
                'not_before': not_before,
                # And valid to this time.
                'not_after': not_after
            }
        )
        # Decrypt doesn't take KeyId as an argument. We need to verify the correct
        # key was used to do the decryption.
        # Annoyingly, the KeyId from the data is actually an arn.
        key_arn = data['KeyId']
        if key_arn != get_key_arn():
            raise TokenDecryptError('Authentication error.')
        plaintext = data['Plaintext']
    # We don't care which exception is thrown. If anything fails, we fail.
    except Exception:
        raise TokenDecryptError('Authentication error.')
    return plaintext


def require_auth(f):
    @wraps(f)
    def decorated(*args, **kwargs):
        try:
            authz_subset = keymanager.decrypt_token(
                request.headers['X-Auth-Token'],
                request.headers['X-Auth-From'],
                request.headers['X-Auth-Not-Before'],
                request.headers['X-Auth-Not-After']
            )
            if key_has_privilege(authz_subset, f.func_name):
                return f(*args, **kwargs)
            else:
                return abort(401)
        except TokenDecryptError:
            return abort(401)
        # Paranoia
        return abort(401)
    return decorated

And here’s the client code (using requests):

import datetime
import boto3
import base64
import requests

now = datetime.datetime.utcnow()
not_after = now + datetime.timedelta(minutes=60)
now = now.strftime("%Y%m%dT%H%M%SZ")
not_after = not_after.strftime("%Y%m%dT%H%M%SZ")
auth_context = {
    'from': 'servicea-development-iad',
    'to': 'serviceb-development-iad',
    'not_before': now,
    'not_after': not_after
}
kms = boto3.client('kms')
token = kms.encrypt(
    KeyId='alias/authnz-testing',
    Plaintext='{"Actions":"GetMyUser"}',
    EncryptionContext=auth_context
)['CiphertextBlob']
token = base64.b64encode(token)
headers = {
    'X-Auth-Token': token,
    'X-Auth-From': auth_context['from'],
    'X-Auth-Not-Before': auth_context['not_before'],
    'X-Auth-Not-After': auth_context['not_after']
}
response = requests.get('/myuser', headers=headers)

Notice that there’s something extra fun we’re doing here: we’re limiting the authorization scope of the authentication token from the client side. Even if this token gets stolen, it’s only allowed to perform the actions specified in the token, which is also encrypted, so the attacker wouldn’t know which actions it’s allowed to perform with the token. Doing this is likely a good idea for asynchronous calls enqueued for the future.

Of course, we have to consider both KMS rate limiting and latency. In general we should use full-privilege tokens that last long enough to ensure we never hit rate limiting. We should also try to avoid the encrypt/decrypt latency that comes with calls to KMS. We’ll need some caching for this to work at any reasonable scale.

I won’t go into lengthy detail here, since there’s numerous ways of handling this, but I’ll give a few ideas:

  1. Slightly change our model. Rather than calling encrypt and decrypt, we could create data keys, pass the data keys along with data encrypted using the data key (which would be used as our token). We can then cache the data key in a central location (like DynamoDB, etcd, zookeeper, etc.). The data key itself would be encrypted with the encryption context described above. All clients and all targeted services could keep a decrypted version of the data key in-memory for the validity period. Assuming 50 clients and 50 servers and a 1 hour data key validity it’s 100 kms decryption requests per hour and only 1 encryption request per hour. Using this strategy it’s also possible to handle the encryption/decryption requests out of band of the applications to avoid the latency hit. Assuming a large number of service-to-service mappings, it could be complex to manage this out of band, though.
  2. Only use KMS auth to get a session, then use the session for all further calls. This is the lazy way, since it isn’t much work, but it also won’t provide quite the win, either. We have to take the latency hit for each initial auth, we’ll need to have an encryption request per hour per client, and will need a decryption request for every initial request from each client. Another downside here is that unless we cache the authz payload somewhere centrally, we’ll lose the ability to scope the tokens.
  3. Fingerprint the token on initial decryption and store it in a centralized cache (like memcache or redis) along with its authz payload and validity data. Subsequent requests would check the data in the cache, allowing it to avoid a decryption request. Similar to #2, it’s not quite as affective as #1, since we need an encryption request per client each hour and would also need a decryption request for each initial request from each client, every hour. It also requires a central cache for each service accepting requests.

In all of the above solutions, if the caching fails, we fallback to encryption/decryption requests, which puts us at risk for rate limit failures, but could still allow authentication to continue working.

Like the STS solution, this is a proof of concept. It’s mostly an idea of how you could re-use AWS’s services to avoid having to do the initial trust step in your bootstrap process. I’m sure it has some holes I haven’t thought about, and I’d love to get your feedback!

Leave a Reply

Your email address will not be published. Required fields are marked *