It’s been a while since I’ve written a technical post, so I thought, maybe I should write about what I’ve been working on for the past couple months…
I’ve been building a test and development infrastructure for The Wikimedia Foundation using OpenStack, and a number of other technologies. I’m not done yet, so I won’t get into any gory technical details (I promise I will later!). I will, however, give an overview of the architecture I’m aiming for.
[toc title=”Table of Contents”]
We want a test and development infrastructure for a number of reasons:
- We’d like to replace the Tesla infrastructure I built for the Usability Initiative project
- We’d like to have an infrastructure where we can let volunteer developers and staff work collaboratively, and easily build infrastructures without relying on the limited resources of the (overworked) operations team
- We’d like to be able to run realistic tests of our operational infrastructure so that we can be better prepared for outages, and so that we have an environment to safely train new operations staff
- We’d like to have an infrastructure we can use to vet operations volunteers before we allow them access to the operational infrastructure
- We’d like to have an infrastructure where developers can easily have root
With the above goals in mind, the infrastructure needs to handle most things automatically. We (operations) don’t want to have to manage user accounts. We don’t want to have to create virtual machines for people. We don’t want to have to manage DNS entries, or IP addresses. We do want an infrastructure that is close to our production environment, but is flexible enough to let developers add infrastructure without our help. We do want an infrastructure that lets developers prepare their projects for running inside of the production cluster by using the exact same processes as the operations team.
I’m creating an infrastructure to handle this, and here’s the basic architecture:
- OpenStack as the virtualization technology
- Four nodes: 1 controller node, and 3 compute nodes (should be able to run roughly 100-120 VMs)
- Handles VM creation and scheduling on compute nodes
- Handles IP address allocation
- Has EC2 and Openstack APIs
- Gets user account information from LDAP
- Stores IP/VM information in MySQL
- PowerDNS with an LDAP backend for DNS entries
- Currently using “strict” mode for this
- Each VM gets a DNS entry based on the name of the VM, and a cname record based on the “instance id” provided when the VM is created
- Can handle multiple DNS domains
- Puppet with an LDAP backend for puppet nodes
- Node entries stored in LDAP so that users can easily select server types when creating VMs
- Puppet manifests, files, and templates stored in SVN or git repository
- Everyone with an account will be able to modify puppet, but changes will need to be merged into the main branch by an operations team member
- Ideally branches can be merged into the production cluster’s puppet repository as well
- MediaWiki as the virtual machine manager
- Manages VM creation/deletion/modification, DNS, Puppet, user accounts, sudo access, user groups, user SSH keys, and OpenStack projects
- Using the OpenStackManager extension and the LdapAuthentication extension
- Progress on this is going well. Basically I just need to add localization and more error checking for this to be at a usable level (if you’d like to help with this, please do!)
User account and project management
From an operations perspective, the big thing for this infrastructure is that it is low overhead for us. We’d like to empower our community without overburdening us. A big part of this is not having to deal with user accounts, authentication, and authorization. Here’s how I plan on solving this problem:
User accounts are created and maintained by MediaWiki. MediaWiki will use the LDAP Authentication and OpenStackManager extensions. Wiki admins will be able to create accounts for other people. When the account is created, it’ll automatically get an account on the VMs and in OpenStack, as the account will be created in LDAP, and everything in the infrastructure, excluding project wikis, will use LDAP for user accounts. They will also be added to a shared default group, which will give them non-root access to the default project. root access will be granted on an as-needed basis by the operations team in the default project. In this environment, they’ll be able to participate in any default-group shared projects. They will not be able to create VMs.
OpenStack has a concept of “projects”. When a user is added to a project, they have the ability to create, delete, and manage VMs in that project. The default project will be maintained by the operations team. It will be a clone of the production environment. Access to this OpenStack project will be limited. Instead, we’d like real projects (like Usability Initiative, or Pending Changes, or OWA) to have OpenStack projects created specific to the real project. In this project they can create VMs, and maintain separate infrastructure from the default project.
OWA is a good example of when VMs would need to be created for a project. OWA needs LVS, Squid, Apache, MySQL, and a few other things. It runs differently than our current infrastructure, and could require changes that could possibly break other things. For this, the initial configuration could be totally done within a separate set of VMs, where once configurations stabilize they get moved into the default project.
Project management will be handled via the OpenStackManager extension. Anyone in a project will be able to add/remove users to/from the project. Each OpenStack project is also a MediaWiki namespace, and a posix group on all VMs. Though everyone will have access to the default project VMs, only people added to other projects will be allowed to access the VMs in those projects. Users in non-default projects will also have sudo rights automatically granted to them on those VMs.
Access to VMs will be limited to SSH key authentication. Users will be able to manage their own SSH keys via the OpenStackManager extension. These keys will be managed in LDAP, and VMs will sync the keys to the user’s authorized_keys file.
VM, DNS, and Puppet management
The other part of not overburdening the operations team is for VM, DNS, and Puppet management to be mostly hands off. Currently, with the Tesla architecture, I need to create VMs from scratch, configure them, add user accounts and groups, and add users to sudoers. I also need to assign an IP address and add the VMs to DNS. Once a team believes their project is production ready, if there are architecture changes, I need to add those changes to puppet, and recreate the architecture in our production cluster. This is very time consuming. Ideally, most of this can be handled by the developer, and that’s what I’m aiming for with this infrastructure.
When a new project is formed, an operations team member will create an OpenStack project via the OpenStackManager extension, and will add a project member to it. After doing so, that user can add other users to the project, and can create their custom architecture. They’ll be able to go to an interface to create their VMs. When creating the VMs, they’ll be able to name them, give them a size (CPUs, memory, disk space), and manage puppet classes and variables. The puppet configuration will allow them to create VMs that are already pre-configured as specific types of VMs. For instance, if you want a VM that is configured like our Application servers, you’ll simply need to add the “appserver” class.
Once the VM is created, the OpenStackManager extension will add the DNS information and Puppet information to LDAP. When the VM is finished building, it’ll automatically sync user SSH keys from LDAP, configure itself using puppet, and will be available for SSH login.
Everything we do in the production cluster now occurs through puppet, and we’d like developers to do the same thing on their VMs. Though the OpenStackManager extension will only allow selection of configured classes and variables, that list will be managed in the puppet configuration that will be managed through SVN or git. Developers can create puppet manifests, files and templates, and can add them to the repository. The operations team will maintain the main branch, and will merge changes in. When it is time to move a project to the production cluster, we should be able to merge that puppet configuration into our production puppet repository, allowing developers to be part of the process from beginning to end.
We have a pretty annoying problem on Tesla right now. Most projects are sharing the Prototype VM. Some projects use the trunk version of MediaWiki, others use the deployment branch. Most projects share the same extensions directory. This causes problems where projects often break each other. Also, the Prototype VM isn’t configured like the cluster, and as such, code deployed from this environment may run into unexpected issues. A goal of this new infrastructure will be to solve this problem.
Ideally, most projects won’t need to create their own infrastructure for testing. Most projects are just creating MediaWiki extensions that should run perfectly in our existing infrastructure. What developers really need in this situation is a wiki with a full copy of the extensions directory. They need to be able to create a wiki with a choice of either using the deployment branch, or trunk. They need to be able to limit access to their wikis to people in their project, if it isn’t a shared project. They should be able to create these wikis without root access. They need this wiki to run in an environment that is configured as closely to the production cluster as possible.
My plan for this is a script on the default-project, that will run via sudo, that will automatically check out from SVN, create the wiki’s database, and add the wiki to the Apache configuration automatically. It’ll set up the file permissions automatically for a shared-wiki, or a project wiki. Access to these wikis will be controlled via OpenStack projects, which are also posix groups. This gives each project a little flexibility too, as if they later decide they do need a VM for testing, they’ll be able to create it.
Want to help? Have an idea to make this better?
I’m still fairly early on in the process. I’ve built a lot of the infrastructure, and am mostly done with the OpenStackManager extension, but the infrastructure hasn’t launched, and things can still be easily changed. If you want to help, or have ideas on how this can be done better, I’d love to get some help. If you’d like to help with the operations portion of this, I’d love help with that too. It’s a great learning experience, and I’m testing and developing this in the Tesla environment right now, so I can give out access fairly liberally. Even excluding the virtualization part of the infrastructure, we’ll be building a clone of our production environment mostly from scratch, which will be a great learning experience. It isn’t often you get to build something like this from scratch, so if you are interested let me know!
I’m hoping to have something that is at least ready for basic use by mid or late January. Full implementation of my above plan will likely take a few months though. With help I can probably get it fully ready much sooner.