Monday, June 23, 2014

Designing an Access Control System

Designing the Proper Thing

One thing I've learned in my career is that in order to build something that will outlast its original intention, you need to boil the idea down to a fundamental use case and design the interface to your system around that.  A clean interface, whether that's a user interface or an API, makes all the difference.  It's really the Unix philosophy of "do one thing and do it well".  Once you have that clean, well-defined interface, the actual implementation can be done and redone or built upon as needed without causing consumers of your code to make any changes to compensate.  And not forcing users to redesign around your short-sightedness goes a long way to making them want to use your software. Shortly after I started at Liquid Web, I was tasked with building a system that would allow us to selectively restrict access to portions of our internal applications to employees whose jobs required them to have said access.  This idea is common, and there are systems out there designed to handle similar workloads, LDAP being the most notable.  However, what I found while researching options was that many of them hoisted the bulk of the actual restriction effort on to the user interface, which I found problematic from a maintenance perspective.

Imagine you're writing an interface, and you want to selectively hide parts of the UI so people who can't do those actions aren't distracted by them being there.  You can go about that in a few ways.  For a group-based approach, you can say "Is Joe a member of the flibbertygibbit group?  Ok, he sees the flibbertygibbit widget".  You can take a roles based approach, and then the question becomes "Is Nancy a SuperHeadAdminPerson?  Ok, she can see the 'delete user' button.".    In either of these cases, what happens when the organization restructures itself and suddenly the flibbertygibbit group has all new responsibilities that don't include the flibbertygibbit widget?  What if a new level is inserted above SuperHeadAdminPerson (SuperDuperHeadAdminPerson), and now SuperHeadAdminPerson is actually no longer allowed to delete users?  Why, you get to wait until the developers have time to retool the entire interface to fix the issues with people seeing the wrong things.  That's just bad voodoo, so I wasn't too keen on taking that approach.  I had to find a better way.

After much pondering, some caffeine, probably a nap or two, and much wasting time on Reddit, I found a light bulb and turned it on.  What access control boils down to, after you strip away all the groups, and roles, and everything is: "Can this person do this action?".  It's deceptively simple.  Can Jimbob give this customer a $1 billion credit?  Can Suzie delete this SuperUser account?  Can Dr Evil get sharks with friggin lazer beams on their heads?  It's really that simple.   So, given that simple idea, could we design a system around the idea of "Can $user do $action?".  I should note at this point that LDAP could do that at the time, but the way it did it wasn't very natural, and it wouldn't scale to the level of thousands of possible 'actions' we were envisioning.  It's possible that has since changed, but I haven't had to revisit it so I don't know.  So we went about designing a system that would answer this question quickly, frequently, and with maximum flexibility.

Designing the Thing Proper

They say when you have a nail, you want to hit it with a hammer.   Wait, no, what was it again?  I can't remember.  Anyway, when we went to design the system, we opted to use a) Perl and b) Postgres, because both are excellent tools for building performant, scalable systems, and they just happened to be what everything else at the company was using (except for some crappy legacy systems using mysql).  Oh, and I forgot one other critical piece: memcached.  Now we had a stew going.  Actually, not quite yet.  We didn't want a situation where we had to go in for each user and every possible action and flip a toggle.  That just wouldn't scale.  Thousands of users, potentially tens of thousands of actions, that's just begging to grind to a halt.  I was wishing that there was just some nice hierarchical data structure where we could define blanket permit/deny statements, then give more granular exceptions to that, like "Jimbob can do anything to an account... except delete it.".    While I was pondering this, my boss kindly walked over and yelled something in my ear.  "LTREE you idiot" he said.  His version of reality may vary on that.  If you don't know ltree, go check it out, it's amazing.  It's basically exactly what I wanted.  With that tidbit, I was off to the races.  Now I could set permissions:

jimbob CAN: Account
jimbob CAN'T: Account.Delete

And then when I ask the question:

Can jimbob create an Account? - yes!
Can jimbob update an Account? - yes!
Can jimbob delete an Account? - no!

But I didn't have to specifically tell the system about Account.Create and Account.Update, because jimbob already can do Account (there's an implicit .*, so Account = Account.*)

So going back to my original comment about clean interfaces, as long as the system would continue to answer the "Can $user do $action", it didn't matter how much complexity was leveraged to come about those answers.  All any consumer of my system required to know was the answer to that question, and its thousands of siblings.

So, despite my earlier comments about groups and roles, they are still useful constructs for defining what people can or can't do, as long as you're not exposing that level of detail through the interface.  So, to make a long story just a wee shorter, this was the basic structure we came up with for how the system would answer the ultimate question:

Action - the thing that can or can't be done
Role - a collection of rules about which actions can or can't be done
Group - a collection of users that are assigned the same roles
User - can have roles, be assigned to groups that have other roles, and can also have user-specific rules about additional actions outside the scope of its other roles

So, to answer the question, it was a possible multi-step process:

1. Can this user do this action?
2. Can any role this user has do this action?
3. Does any role on any group assigned to this user have permission to do this action?

This setup gave us a huge amount of flexibility in defining permissions for various users, and we then mapped all the company departments to groups, and put their employees (users) in those groups.   Now we had the best of both worlds: broad definition of permissible actions to whole departments, with the ability to make exceptions where needed.  And the consumer of the system still only cared about one, simple thing.

We made pretty extensive use of memcached to prevent repeated calculations of the same data in quick succession, and from that we had a system that still uses very few resources despite powering access control for both our public API and internal intra-department API, as well as many other internal systems.   Not bad for a few days work (ok ok, it really took about a month).

What's the big deal?

So, I keep going on about how a clean interface enables you to become one with the universe or something. Due to the flexible design of the system and the single clear point of interaction, we were able to adapt it for use by customer accounts with our public API and web UI with about 2 hours of work to allow for running multiple copies pointing at different databases.  When we needed to add rate-limiting to the API, we knew that it was really just a slight adjustment on "Can $user do $action" to be "Can $user do $action... again?".  All of the same idioms lined up, so we simply added a layer to track requests to the methods in a performant way so that asking that question became cheap enough to be useful for that purpose.

This was really a fluke project.  I've never had another project that lasted as long as it has with as few modifications required to keep up with the needs of the company.  Given hindsight, I should have made a simpler way to define the roles, as that was the biggest stumbling block for people who worked on the system later, and I was maybe a tad too over-aggressive with the memcached use, but besides that, it's held up extremely well.  Perhaps those who are now maintaining it will disagree with that assessment.

No comments:

Post a Comment