Tuesday, November 18, 2014

Introducing python-ambariclient

Apache Ambari is an open-source project that configures and manages Hadoop clusters.  The product I work on at work configures and manages Hadoop clusters... in the cloud (ooooh).  Now that Ambari has matured enough to be stable and have a fairly usable API, we've decided to discontinue the part of our product that overlaps with Ambari and use and contribute to that project instead.  Our product isn't going anywhere, just the ~50% of our codebase that does pretty much exactly what Ambari does will be replaced with Ambari and we'll put those resources we would spend building and maintaining our own code into improving Ambari.  Given that Ambari's primary audience is Java developers, their main efforts at a client library are written in Groovy, another JVM-backed language that works with Java seamlessly.  They also have a Python client, but it's fairly immature, incomplete, and buggy.  Efforts to contribute to it proved onerous, mostly due to concerns about breaking backwards-compatibility, so we decided that I would create a new client that we'd release to the public.  And so I'm here to announce our Python Ambari client libraries, aptly named python-ambariclient.

There were a few things I wanted out of the client that I felt we weren't able to accomplish easily with the existing one:
  1. An easy-to-intuit, consistent interface that mimicked the API structure.
  2. Native support for polling the long-running background operations that are common to working with Ambari.
  3. Easy to add new types of objects to as the Ambari API added new features.
  4. Minimize the number of actual HTTP requests executed.
  5. An ORM-style interface that felt natural to use coming from projects like SQLAlchemy and libcloud.
To accomplish all those goals, I felt like a vaguely promises-style API would suit it best.  This would allow us to delay firing off HTTP requests until you actually needed the response data to proceed, and I wanted the method-chaining style reminiscent of Javascript projects like jquery.  I was able to accomplish both, and I think it turned out pretty well.  It's a good example of what I've always wanted in an API client.  So, let's dive in to some of the design decisions.

Delegation and Collections and Models, oh my

The main API client is just an entry point that delegates all of the actual logic to a set of collection objects, each of which represents a collection of resources on the Ambari server.  For those who are used to REST APIs, this might make sense, but here's some examples to show what I mean:
# get all of the users in the system
users = ambari.users
for user in users:
    print user.user_name
# get all of the clusters in the system
clusters = ambari.clusters
for cluster in clusters:
    print cluster.identifier
The collections are iterable objects that contain a list of model objects, each representing a resource on the server.  There are some helper methods on the collections to do bulk operations, such as:
# delete all users (this will likely fail or break everything if it doesn't)
ambari.users.delete()
# update all users with a new password (bad idea, but hey)
ambari.users.update(password='new-password')
If you want to get a specific model out of a collection, that's easily accomplished by passing a single parameter into the accessor for the collection.
# get the admin user
admin_user = ambari.users('admin')
# get a specific cluster
cluster = ambari.clusters(cluster_name)
# get a specific host
host = ambari.hosts(host_name)
Additionally, you can get a subset of a collection by passing in multiple arguments.
# get a subset of all hosts
hosts = ambari.hosts([hostname1, hostname2, hostname3])
So, this is just the basic entry point model collections.  In Ambari, there's a large hierarchy of related resource and sub-resources.  Users have privileges, clusters have hosts, services have components, etc.  To handle that, each model object can have a set of related collections for the objects that are contained by it.  So, for example:
# get all hosts on a specific cluster
ambari.cluster(cluster_name).hosts
# get a specific host on that cluster
host = ambari.cluster(cluster_name).hosts(host_name)
Some of the hierarchies are very deep.  These are the deepest examples I can find so far:
# get a repo for a specific OS for a specific version of a specific stack
ambari.stacks(stack_name).versions(stack_version).operating_systems(os_type).repositories(repo_id)
# get a component for a specific service for a specific version of a specific stack
ambari.stacks(stack_name).versions(stack_version).services(service_name).components(component_name)
Obviously those are outliers, in general use you only need to go one or two levels deep for most things, but it's good to know the pattern holds even for deep hierarchies.

When you get to the individual model objects, they behave much like a normal ORM.  They have CRUD methods like create, update, delete, and they use attribute-based accessors for the fields returned by the API for that resource.  For example:
cluster = ambari.clusters(cluster_name)
print cluster.cluster_id
print cluster.health_report
There's no fancy data validation or type coercion like in SQLAlchemy, just a list of field names that define which attributes are available, but really that's all that I think is necessary in an API client.  The server will do more robust validation, and I didn't see any places where automatic coercion made sense.  What I mean by automatic coercion is automatically converting datetime fields into datetime objects, or things of that nature.  I'm not doing that, and it's possible that that decision turns out to be shortsighted, but I'm guessing the simplicity of the current design will win out.

Wait for it...

Because the client is a promises style API, it doesn't necessarily populate the objects when you expect.  For the most part, if it can't accomplish what you're requesting without populating the object with data from the server, it will do it automatically for you.  Many operations also are fairly asynchronous, and what you as a user really care about is that you are safe to operate on a resource.  To accomplish that, there is a method called wait() on each object.  Calling wait() will basically do whatever is required for that model or collection to be in a "ready" state for you to act on it.  Whether that's simply just requesting data from the server or waiting for a background operation to complete or waiting for a host to finish registering itself with the Ambari server, the method is the same.  .wait():
# wait for a recently-added host to be available in Ambari
ambari.hosts(host_name).wait()
# wait for a bootstrap call to finish and all hosts to be available
ambari.bootstrap.create(hosts=[hostname1, hostname2], **other_params).wait()

I have a request

In the Ambari API, if your POST or PUT command triggers a background operation, a 'request' object is returned in the response body.  It will look something like this:
{
  "href" : "http://c6401.ambari.apache.org:8080/api/v1/clusters/testcluster/requests/1",
  "Requests" : {
    "id" : 1,
    "status" : "InProgress"
  }
}
If any API call returns this information, the Ambari client will automatically recognize that and store that information away.  Then, if you call .wait() on the object, it will poll the Ambari API until that request has completed.  At some point, it will start throwing exceptions if the request doesn't complete successfully, but that logic hasn't been built in yet.
# install all registered components on a host and wait until that's done
ambari.clusters(cluster_name).hosts(host_name).components.install().wait()
And to be consistent and obey the principle of least surprise, you can chain off wait() calls to do further actions, so this also works:
# install and start all registered components on a host and wait until it's done
ambari.clusters(cluster_name).hosts(host_name).components.install().wait().start().wait()
It's not generally a great idea to just have a huge long method chain like that, but it's possible.  It would be written better like:
components = ambari.clusters(cluster_name).hosts(host_name).components
components.install().wait()
components.start().wait()

Wait, that's it?

I wanted it to be extremely easy to add new model classes to the client, because that was one of my biggest complaints with the existing client.  So most of the common logic is built into two base classes, called QueryableModel and DependentModel.  Now defining a model class is as simple as defining a few pieces of metadata, for example:
class Cluster(base.QueryableModel):
    path = 'clusters'
    data_key = 'Clusters'
    primary_key = 'cluster_name'
    fields = ('cluster_id', 'cluster_name', 'health_report', 'provisioning_state',
              'total_hosts', 'version', 'desired_configs',
              'desired_service_config_versions')
    relationships = {
        'hosts': ClusterHost,
        'requests': Request,
        'services': Service,
        'configurations': Configuration,
        'workflows': Workflow,
    }
  1. 'path' is the piece of the URL that should be appended to access this model.  i.e. /api/v1/clusters
  2. 'data_key' defines which part of the returned data structure contains the data for this particular model.  The Ambari API returns the main model's data in a subordinate structure because it also returns a lot of related objects.
  3. 'primary_key' is the field that is used to generate the URLs to a specific resource.  i.e. /api/v1/clusters/cluster_name
  4. 'fields' is a list of field names that should be returned in the model's data.
  5. 'relationships' is a list of accessors that should build related collection objects. i.e. ambari.clusters(cluster_name).hosts == collection of ClusterHost models
Some objects are not represented by actual URLs on the server and are only returned as related objects to other models.  These are called DependentModels in my client.  Here's a pretty simple one:
class BlueprintHostGroup(base.DependentModel):
    fields = ('name', 'configurations', 'components')
    primary_key = 'name'

class Blueprint(base.QueryableModel):
    path = 'blueprints'
    data_key = 'Blueprints'
    primary_key = 'blueprint_name'
    fields = ('blueprint_name', 'stack_name', 'stack_version')
    relationships = {
        'host_groups': BlueprintHostGroup,
    }
When you get a specific blueprint, it returns something like this:
{
  "href" : "http://c6401.ambari.apache.org:8080/api/v1/blueprints/blueprint-multinode-default",
  "configurations" : [
    {
      "nagios-env" : {
        "properties" : {
          "nagios_contact" : "greg.hill@rackspace.com"
        }
      }
    }
  ],
  "host_groups" : [
    {
      "name" : "namenode",
      "configurations" : [ ],
      "components" : [
        {
          "name" : "NAMENODE"
        }
      ],
      "cardinality" : "1"
    }
  ],
  "Blueprints" : {
    "blueprint_name" : "blueprint-multinode-default",
    "stack_name" : "HDP",
    "stack_version" : "2.1"
  }
}
As you can see, the 'Blueprints' key is the 'data_key', so that structure has the data related to the blueprint itself.  The 'host_groups' and 'configurations' structures are related objects that don't have URLs associated with them.  For those, we can define DependentModel classes to automatically expand them into usable objects.  So, now this works:
for host_group in ambari.blueprints(blueprint_name).host_groups:
    print host_group.name
    for component in host_group.components:
        print component['name']
I tried to make things act consistently even where they weren't consistent in the API.  It should be noted that objects that are backed by URLs are also returned in related collections like this, and the client will automatically use that data to prepopulate the related collections to avoid more HTTP requests.  For example, here is a very trimmed down cluster response:
{
  "href" : "http://c6401.ambari.apache.org:8080/api/v1/clusters/testcluster",
  "Clusters" : {
  },
  "requests" : [
    {
      "href" : "http://c6401.ambari.apache.org:8080/api/v1/clusters/testcluster/requests/1",
      "Requests" : {
        "cluster_name" : "testcluster",
        "id" : 1
      }
    }
  ],
  "services" : [
    {
      "href" : "http://c6401.ambari.apache.org:8080/api/v1/clusters/testcluster/services/GANGLIA",
      "ServiceInfo" : {
        "cluster_name" : "testcluster",
        "service_name" : "GANGLIA"
      }
    }
  ]
As you can see, both the 'requests' and 'services' related collections were returned here.  So, if you were to then, do:
for service in ambari.clusters(cluster_name).services:
    print service.service_name
It would only have to do the single GET request to populate the cluster object, then use the data returned there to populate the service objects.  There is a caveat here.  When getting collections in the Ambari API, it generally only returns a minimal subset of information, usually just the primary_key and possibly the primary_key of its parent (in this case, service_name and cluster_name).  If you want to access any other fields on that object, it will have to do another GET call to populate the remaining fields.  It does this for you automatically:
for service in ambari.clusters(cluster_name).services:
    print service.maintenance_state
'maintenance_state' was not among the fields returned by the original call, so it will do a separate GET request for  http://c6401.ambari.apache.org:8080/api/v1/clusters/testcluster/services/GANGLIA to populate that information and then return it.

Smoothing out the rough edges

The Ambari API is mostly consistent, but there are some warts from old designs or one-off pieces.  The bootstrap API and the configurations are the worst offenders in this regard.  All efforts were made to make those areas behave like the other areas as much as possible.  I didn't want the user to have to know that, for example, bootstrap requests aren't the same as every other asynchronous task, or that even when a bootstrap finishes the hosts are not visible to Ambari until their agents have booted up and registered themselves.  So, I overloaded the wait() method on those objects so that it just does the needful.
# wait until these hosts are in a ready state
ambari.hosts([hostname1, hostname2]).wait()
Similarly, adding a host to a cluster normally involves manually assigning all of the components, but an upcoming Ambari feature will make it so you simply have to pass in a blueprint and host_group and it will automatically do it for you.  I pre-emptively smoothed this out in the client so you can do this now, it just involves a few more API requests to be made automatically on your behalf.  Wherever things are inconsistent on the API server, my client makes them consistent to the user.
# add a new host to an existing host_group definition
ambari.clusters(cluster_name).hosts.create(host_name, blueprint=blueprint_name, host_group=host_group_name)
When the server-side is updated to include support for this, I can simply pass the information along and let it sort it out.  There are a few other cases where warts in the API were smoothed over, but for the most part the idioms in the client matched up with the API server pretty well.

Where do we go from here?

There was one feature that I really wanted to have that I wasn't able to wrap my head around sufficiently to implement in a clean, intuitive way.  That is the ability to act on collections of collections.  Wouldn't it be awesome if this worked?
# restart all components on all hosts on all clusters
ambari.clusters.hosts.components.restart().wait()
The .wait() would get a list of clusters, then get a list of hosts per cluster in parallel, then get a list of components for each host in parallel, then call the restart API method for each of those, gobble up all the request objects, and wait until all of them completed before returning.  This should be possible, but it will require a bit more thought into how to implement it sanely, and there wasn't enough bang for the buck for our use-cases to justify spending the time right now.  But maybe I'll get back to it later.

What's it to me?

I realize Ambari is a niche product, and that most of this post will be gobbledy-guck for most of you, but I think the general principles behind the client's design apply well to any REST-based API client.  I hope that people find them useful and maybe lift a few of them for their own projects.  Most of all, I think this is probably the best client library I've ever written, and it embodies pretty much everything I've wanted in a client library in the past.  We plan on rewriting the client library for our own API in a similar fashion and releasing that to the public in the near future*.

* Usual disclaimer about forward-looking statements and all that.  I make no guarantee that this will actually happen.

Monday, November 17, 2014

Promise-based REST API clients

In software development, the concept of promises (sometimes also called futures) is deceivingly simple.  There's a pretty good wikipedia article that explains it better than I do, and the opening line is about a good a summary as I can think of:
In computer sciencefuturepromise, and delay refer to constructs used for synchronizing in some concurrent programming languages. They describe an object that acts as a proxy for a result that is initially unknown, usually because the computation of its value is yet incomplete.
What that means in normal people English is that you call a method to do something (i.e. compute a value, gather some data, etc) and instead of doing what you requested immediately and making you wait, it returns a "promise" that it will eventually do what you requested.  While this is generally done to make concurrency easier, I think the concept works well for hierarchical REST API clients as well.

Let me explain.  No, no, there is too much.  Let me sum up.

REST APIs are generally hierarchical.  You often see structures like:

GET /artists - get a list of artists
GET /artists/metallica - get the artist Metallica
GET /artists/metallica/albums - get a list of albums by Metallica

And so on.  What I wanted from an API client was to do something along the lines of this Python snippet:
for album in client.artists('metallica').albums:
    album.delete() # take that Metallica

What I didn't want to happen in the above snippets was to do all of these HTTP requests

GET /artists
GET /artists/metallica
GET /artists/metallica/albums
DELETE /artists/metallica/albums/killemall
DELETE /artists/metallica/albums/ridethelightning
...

But instead just do:

DELETE /artists/metallica/albums

(or if the API didn't support bulk-delete in that way, just do a single GET followed by a series of DELETE calls)

So, enter promises.  If each method in that chain simply returned the promise of fetching the underlying data, I could chain off of it and only actually fire off HTTP requests for the things I actually needed to get.

I should've taken the blue pill

So, that makes pretty good sense, but that's not all.  For your 3 easy payments of $29.95 you get not only this nice, easy-to-use method chaining and minimal HTTP request overhead, but also implicit parallelized requests as well.  What if, given the above example, you could also do something like:
for album in client.artists.albums:
     print album.tracklisting
And that just automatically did basically this:
  1. GET /artists
  2. for every $artist, GET /artists/$artist/albums (in parallel)
  3. for every album from every $artist, print the tracklisting
Now, let's see how deep this rabbit hole can go:
for track in client.artists.albums.tracks:
    print "%d - %s" % (track.number, track.name)
  1. GET /artists
  2. for every $artist, GET /artists/$artist/albums (in parallel)
  3. for every $album from every $artist, GET /artists/$artist/albums/$album/tracks (in parallel)
  4. for every track for every $album from every $artist, print the number and name

I am serious, and stop calling me Shirley

I think this is a really powerful idea that could make an API client very easy to use and actually, you know, fun.  But maybe I'm alone here, since every API client ever is just the same old boring bunch of methods, sometimes in namespaces, sometimes with actual model-style classes for the resources, if you're lucky.  Granted, this idiom could be abused to fire off hundreds of HTTP requests in parallel and potentially overwhelm the server, either by accident or malice.  But, in my opinion the ease-of-use outweighs the potential for mass destruction.  I'd like to see a client that works this way, even if I have to build it myself (especially if I get to build it).  I've managed to build a client that does the first half of this, and does it pretty well IMO (blog post coming soon).  I have a good idea of how to accomplish the second part, but I haven't yet overcome the effort-to-payoff ratio on that one.  It would be cool, and potentially useful, but it's a lot of work for such a small benefit.  Still, that itch needs scratching, and one day it shall be scratched.  Even though I realize that maybe it means I'm insane.

Friday, November 7, 2014

Call me Elmer

My wife commented to me the other day that I'm really good at picking up the slack.  She's kind of slammed with a lot of things right now, and I just do what needs to be done to make things work at home.  It got me thinking, and I think that's really me in a nutshell, both personally and professionally.  When I was interviewing with Rackspace, they asked me what role I filled on teams and I said I was The Glue.  I'm not sure my interviewer really understood what I meant by that and whether it was a good thing, so let me elaborate.  I make sure what needs to get done gets done; I bind together the pieces and make sure things stick.  

To me there are a few archetypes that most good programmers seem to fall into.  Many end up having attributes from multiple archetypes, but generally one is the most prominent.  I'm not just The Glue, I'm also some other things, but what's most obvious and where my biggest contributions come is in being The Glue.  I'm going to avoid talking about the negative archetypes like The Hero, The Recluse, The Sheep, and The Manchild, but certainly even good programmers can revert into those at times, too.

The Glue


A glue programmer is very valuable.  They are skilled in a wide variety of tasks, they pick things up quickly, and they're fearless, or else they wouldn't be able to be The Glue.  The Glue binds everything together and makes sure everything holds up.  The Glue is selfless, as they care more about the project or team's success than about their own individual satisfaction.  They'll do the mundane work that nobody else wants to do, not because they can't do more interesting things, but because it needs to be done for the project to succeed.  They tend to flitter about between systems and responsibilities and make sure the details are being covered.  Their weakness is their feeling of responsibility over everything can make them avoid getting too deep in any one particular area, lest something else falter.  Because of that, they're not the best choice for doing upfront architectural work, because that requires a lot of dedicated focus on a specific problem for a long period of time.  They can lose sight of the big picture while they're focusing on keeping everything together.

The Architect


Architects really enjoy thinking about problems and coming up with solutions.  They're good at looking at things from a distance and seeing how all the components will need to work together to make the whole function.  They're great at starting projects, but often are not as good at finishing them.  When they get down to the details and have to make the hard decisions about compromising their vision, they can be paralyzed.  All projects require some tradeoffs to get finished, and often they feel that this betrays their artistic ideals.  They're a great resource to have, but when you have too many, you'll notice a lot of grand, lofty ideas being discussed and planned with little actual end result to show for it.  These are the types that do well in interviews because they like coming up with solutions to challenging problems, but if they don't also have some bits of other archetypes in them, they'll end up spending all their time coming up with the perfect theoretical solution without actually shipping software.

The Builder


A builder is perfect to pair with an architect, because they're great at seeing another person's vision and bringing it to life. They are good at tracking the details and ensuring constant progress on the project.  They can be known as finishers, and are very task oriented.  They tend to be pretty reliable at making visible progress on projects and getting things out the door on time. They know when to make tradeoffs to ensure the project gets completed, but sometimes they can be known to cut corners too much.  The flipside to that it that they can sometimes, in their rush to get something functional out there, build something out of duct-tape and glue, or something so convoluted and cumbersome to maintain that nobody else dare work on it.  When guided by a quality architect, they're extremely useful, though, and will make sure that you have something to show for all those lofty ideas.

The Firefighter


Production's down?  Who you gonna call?  The firefighter, that's who.  They're great at jumping into a tense situation, keeping composure, finding the problem, and fixing it.  They're not afraid to put a band-aid solution in place to keep things going while they work on a more permanent solution.  That can break down when there are too many fires for them to actually get to the permanent solution, as the adrenaline of the chase will keep them putting bandaids on everything rather than solving fundamental problems.  They are great at debugging systems and understanding their complex interactions so that they can see where the problem is quickly and resolve it.  They should not be confused with The Hero, who rushes to get things out and then gets praised for quickly band-aiding the problem that he created to begin with.  I often refer to that as someone who jumps on the grenade that they had thrown.

The Fixer


This is the guy you call in when there's blood and guts everywhere and the cops are on the way.  You can add a fixer to a delayed project and actually negate the mythical man-month, as they will turn things around.  They're not afraid to step on toes or even run people over if they have to, as long as it serves to move things forward.  In doing that, they can burn bridges and drive people away, even though their intent is more altruistic and less personal than how it's received.  They're still useful to have as they can rescue a doomed project and turn it into something useful for the company, just be aware that they might also drive some people from the team in the process.  Then again, if those people were finishing their projects, you wouldn't have had to call in the fixer to begin with.

What are you?


I think I fall pretty well into The Glue for the most part, with a good chunk of The Builder, and maybe a touch of the others.  I don't enjoy having to be The Fixer, as I know I've offended some coworkers when I've needed to do that (usually when asked, but sometimes of my own volition).  Then again, at other times it's worked out wonderfully when those on the doomed project really wanted the help.  To me, the team is paramount.  If your lack of progress is going to prevent the team from succeeding, I'll try to give you a chance to correct course by offering help.  I can't stop you from hanging yourself once you have enough rope, though, and I'll be the jerk who takes over your project if I have to.  I'd prefer you didn't make that necessary, and hope that you don't take it personally.  Maybe that makes me a bad person.

Where do you fit?  Do you disagree with my self-assessment (assuming you've worked with me)? Or am I way off-base in my over-generalized archetypes?