Apache Ambari is an open-source project that configures and manages Hadoop clusters. The product I work on at work configures and manages Hadoop clusters... in the cloud (ooooh). Now that Ambari has matured enough to be stable and have a fairly usable API, we've decided to discontinue the part of our product that overlaps with Ambari and use and contribute to that project instead. Our product isn't going anywhere, just the ~50% of our codebase that does pretty much exactly what Ambari does will be replaced with Ambari and we'll put those resources we would spend building and maintaining our own code into improving Ambari. Given that Ambari's primary audience is Java developers, their main efforts at a client library are written in Groovy, another JVM-backed language that works with Java seamlessly. They also have a Python client, but it's fairly immature, incomplete, and buggy. Efforts to contribute to it proved onerous, mostly due to concerns about breaking backwards-compatibility, so we decided that I would create a new client that we'd release to the public. And so I'm here to announce our Python Ambari client libraries, aptly named
python-ambariclient.
There were a few things I wanted out of the client that I felt we weren't able to accomplish easily with the existing one:
- An easy-to-intuit, consistent interface that mimicked the API structure.
- Native support for polling the long-running background operations that are common to working with Ambari.
- Easy to add new types of objects to as the Ambari API added new features.
- Minimize the number of actual HTTP requests executed.
- An ORM-style interface that felt natural to use coming from projects like SQLAlchemy and libcloud.
To accomplish all those goals, I felt like a vaguely promises-style API would suit it best. This would allow us to delay firing off HTTP requests until you actually needed the response data to proceed, and I wanted the method-chaining style reminiscent of Javascript projects like jquery. I was able to accomplish both, and I think it turned out pretty well. It's a good example of what I've always wanted in an API client. So, let's dive in to some of the design decisions.
Delegation and Collections and Models, oh my
The main API client is just an entry point that delegates all of the actual logic to a set of collection objects, each of which represents a collection of resources on the Ambari server. For those who are used to REST APIs, this might make sense, but here's some examples to show what I mean:
# get all of the users in the system
users = ambari.users
for user in users:
print user.user_name
# get all of the clusters in the system
clusters = ambari.clusters
for cluster in clusters:
print cluster.identifier
The collections are iterable objects that contain a list of model objects, each representing a resource on the server. There are some helper methods on the collections to do bulk operations, such as:
# delete all users (this will likely fail or break everything if it doesn't)
ambari.users.delete()
# update all users with a new password (bad idea, but hey)
ambari.users.update(password='new-password')
If you want to get a specific model out of a collection, that's easily accomplished by passing a single parameter into the accessor for the collection.
# get the admin user
admin_user = ambari.users('admin')
# get a specific cluster
cluster = ambari.clusters(cluster_name)
# get a specific host
host = ambari.hosts(host_name)
Additionally, you can get a subset of a collection by passing in multiple arguments.
# get a subset of all hosts
hosts = ambari.hosts([hostname1, hostname2, hostname3])
So, this is just the basic entry point model collections. In Ambari, there's a large hierarchy of related resource and sub-resources. Users have privileges, clusters have hosts, services have components, etc. To handle that, each model object can have a set of related collections for the objects that are contained by it. So, for example:
# get all hosts on a specific cluster
ambari.cluster(cluster_name).hosts
# get a specific host on that cluster
host = ambari.cluster(cluster_name).hosts(host_name)
Some of the hierarchies are very deep. These are the deepest examples I can find so far:
# get a repo for a specific OS for a specific version of a specific stack
ambari.stacks(stack_name).versions(stack_version).operating_systems(os_type).repositories(repo_id)
# get a component for a specific service for a specific version of a specific stack
ambari.stacks(stack_name).versions(stack_version).services(service_name).components(component_name)
Obviously those are outliers, in general use you only need to go one or two levels deep for most things, but it's good to know the pattern holds even for deep hierarchies.
When you get to the individual model objects, they behave much like a normal ORM. They have CRUD methods like create, update, delete, and they use attribute-based accessors for the fields returned by the API for that resource. For example:
cluster = ambari.clusters(cluster_name)
print cluster.cluster_id
print cluster.health_report
There's no fancy data validation or type coercion like in SQLAlchemy, just a list of field names that define which attributes are available, but really that's all that I think is necessary in an API client. The server will do more robust validation, and I didn't see any places where automatic coercion made sense. What I mean by automatic coercion is automatically converting datetime fields into datetime objects, or things of that nature. I'm not doing that, and it's possible that that decision turns out to be shortsighted, but I'm guessing the simplicity of the current design will win out.
Wait for it...
Because the client is a promises style API, it doesn't necessarily populate the objects when you expect. For the most part, if it can't accomplish what you're requesting without populating the object with data from the server, it will do it automatically for you. Many operations also are fairly asynchronous, and what you as a user really care about is that you are safe to operate on a resource. To accomplish that, there is a method called wait() on each object. Calling wait() will basically do whatever is required for that model or collection to be in a "ready" state for you to act on it. Whether that's simply just requesting data from the server or waiting for a background operation to complete or waiting for a host to finish registering itself with the Ambari server, the method is the same. .wait():
# wait for a recently-added host to be available in Ambari
ambari.hosts(host_name).wait()
# wait for a bootstrap call to finish and all hosts to be available
ambari.bootstrap.create(hosts=[hostname1, hostname2], **other_params).wait()
I have a request
In the Ambari API, if your POST or PUT command triggers a background operation, a 'request' object is returned in the response body. It will look something like this:
{
"href" : "http://c6401.ambari.apache.org:8080/api/v1/clusters/testcluster/requests/1",
"Requests" : {
"id" : 1,
"status" : "InProgress"
}
}
If any API call returns this information, the Ambari client will automatically recognize that and store that information away. Then, if you call .wait() on the object, it will poll the Ambari API until that request has completed. At some point, it will start throwing exceptions if the request doesn't complete successfully, but that logic hasn't been built in yet.
# install all registered components on a host and wait until that's done
ambari.clusters(cluster_name).hosts(host_name).components.install().wait()
And to be consistent and obey the principle of least surprise, you can chain off wait() calls to do further actions, so this also works:
# install and start all registered components on a host and wait until it's done
ambari.clusters(cluster_name).hosts(host_name).components.install().wait().start().wait()
It's not generally a great idea to just have a huge long method chain like that, but it's possible. It would be written better like:
components = ambari.clusters(cluster_name).hosts(host_name).components
components.install().wait()
components.start().wait()
Wait, that's it?
I wanted it to be extremely easy to add new model classes to the client, because that was one of my biggest complaints with the existing client. So most of the common logic is built into two base classes, called QueryableModel and DependentModel. Now defining a model class is as simple as defining a few pieces of metadata, for example:
class Cluster(base.QueryableModel):
path = 'clusters'
data_key = 'Clusters'
primary_key = 'cluster_name'
fields = ('cluster_id', 'cluster_name', 'health_report', 'provisioning_state',
'total_hosts', 'version', 'desired_configs',
'desired_service_config_versions')
relationships = {
'hosts': ClusterHost,
'requests': Request,
'services': Service,
'configurations': Configuration,
'workflows': Workflow,
}
- 'path' is the piece of the URL that should be appended to access this model. i.e. /api/v1/clusters
- 'data_key' defines which part of the returned data structure contains the data for this particular model. The Ambari API returns the main model's data in a subordinate structure because it also returns a lot of related objects.
- 'primary_key' is the field that is used to generate the URLs to a specific resource. i.e. /api/v1/clusters/cluster_name
- 'fields' is a list of field names that should be returned in the model's data.
- 'relationships' is a list of accessors that should build related collection objects. i.e. ambari.clusters(cluster_name).hosts == collection of ClusterHost models
Some objects are not represented by actual URLs on the server and are only returned as related objects to other models. These are called DependentModels in my client. Here's a pretty simple one:
class BlueprintHostGroup(base.DependentModel):
fields = ('name', 'configurations', 'components')
primary_key = 'name'
class Blueprint(base.QueryableModel):
path = 'blueprints'
data_key = 'Blueprints'
primary_key = 'blueprint_name'
fields = ('blueprint_name', 'stack_name', 'stack_version')
relationships = {
'host_groups': BlueprintHostGroup,
}
When you get a specific blueprint, it returns something like this:
{
"href" : "http://c6401.ambari.apache.org:8080/api/v1/blueprints/blueprint-multinode-default",
"configurations" : [
{
"nagios-env" : {
"properties" : {
"nagios_contact" : "greg.hill@rackspace.com"
}
}
}
],
"host_groups" : [
{
"name" : "namenode",
"configurations" : [ ],
"components" : [
{
"name" : "NAMENODE"
}
],
"cardinality" : "1"
}
],
"Blueprints" : {
"blueprint_name" : "blueprint-multinode-default",
"stack_name" : "HDP",
"stack_version" : "2.1"
}
}
As you can see, the 'Blueprints' key is the 'data_key', so that structure has the data related to the blueprint itself. The 'host_groups' and 'configurations' structures are related objects that don't have URLs associated with them. For those, we can define DependentModel classes to automatically expand them into usable objects. So, now this works:
for host_group in ambari.blueprints(blueprint_name).host_groups:
print host_group.name
for component in host_group.components:
print component['name']
I tried to make things act consistently even where they weren't consistent in the API. It should be noted that objects that are backed by URLs are also returned in related collections like this, and the client will automatically use that data to prepopulate the related collections to avoid more HTTP requests. For example, here is a very trimmed down cluster response:
{
"href" : "http://c6401.ambari.apache.org:8080/api/v1/clusters/testcluster",
"Clusters" : {
},
"requests" : [
{
"href" : "http://c6401.ambari.apache.org:8080/api/v1/clusters/testcluster/requests/1",
"Requests" : {
"cluster_name" : "testcluster",
"id" : 1
}
}
],
"services" : [
{
"href" : "http://c6401.ambari.apache.org:8080/api/v1/clusters/testcluster/services/GANGLIA",
"ServiceInfo" : {
"cluster_name" : "testcluster",
"service_name" : "GANGLIA"
}
}
]
}
As you can see, both the 'requests' and 'services' related collections were returned here. So, if you were to then, do:
for service in ambari.clusters(cluster_name).services:
print service.service_name
It would only have to do the single GET request to populate the cluster object, then use the data returned there to populate the service objects. There is a caveat here. When getting collections in the Ambari API, it generally only returns a minimal subset of information, usually just the primary_key and possibly the primary_key of its parent (in this case, service_name and cluster_name). If you want to access any other fields on that object, it will have to do another GET call to populate the remaining fields. It does this for you automatically:
for service in ambari.clusters(cluster_name).services:
print service.maintenance_state
'maintenance_state' was not among the fields returned by the original call, so it will do a separate GET request for http://c6401.ambari.apache.org:8080/api/v1/clusters/testcluster/services/GANGLIA to populate that information and then return it.
Smoothing out the rough edges
The Ambari API is mostly consistent, but there are some warts from old designs or one-off pieces. The bootstrap API and the configurations are the worst offenders in this regard. All efforts were made to make those areas behave like the other areas as much as possible. I didn't want the user to have to know that, for example, bootstrap requests aren't the same as every other asynchronous task, or that even when a bootstrap finishes the hosts are not visible to Ambari until their agents have booted up and registered themselves. So, I overloaded the wait() method on those objects so that it just does the needful.
# wait until these hosts are in a ready state
ambari.hosts([hostname1, hostname2]).wait()
Similarly, adding a host to a cluster normally involves manually assigning all of the components, but an upcoming Ambari feature will make it so you simply have to pass in a blueprint and host_group and it will automatically do it for you. I pre-emptively smoothed this out in the client so you can do this now, it just involves a few more API requests to be made automatically on your behalf. Wherever things are inconsistent on the API server, my client makes them consistent to the user.
# add a new host to an existing host_group definition
ambari.clusters(cluster_name).hosts.create(host_name, blueprint=blueprint_name, host_group=host_group_name)
When the server-side is updated to include support for this, I can simply pass the information along and let it sort it out. There are a few other cases where warts in the API were smoothed over, but for the most part the idioms in the client matched up with the API server pretty well.
Where do we go from here?
There was one feature that I really wanted to have that I wasn't able to wrap my head around sufficiently to implement in a clean, intuitive way. That is the ability to act on collections of collections. Wouldn't it be awesome if this worked?
# restart all components on all hosts on all clusters
ambari.clusters.hosts.components.restart().wait()
The .wait() would get a list of clusters, then get a list of hosts per cluster in parallel, then get a list of components for each host in parallel, then call the restart API method for each of those, gobble up all the request objects, and wait until all of them completed before returning. This should be possible, but it will require a bit more thought into how to implement it sanely, and there wasn't enough bang for the buck for our use-cases to justify spending the time right now. But maybe I'll get back to it later.
What's it to me?
I realize Ambari is a niche product, and that most of this post will be gobbledy-guck for most of you, but I think the general principles behind the client's design apply well to any REST-based API client. I hope that people find them useful and maybe lift a few of them for their own projects. Most of all, I think this is probably the best client library I've ever written, and it embodies pretty much everything I've wanted in a client library in the past. We plan on rewriting the client library for our own API in a similar fashion and releasing that to the public in the near future*.
* Usual disclaimer about forward-looking statements and all that. I make no guarantee that this will actually happen.