The cache debacle

February 18, 2009

It has been some time since I have had a chance to write anything for my blog. This doesn’t mean that I haven’t been keeping an eye on the odd posting on a number of my favourite work-related blogs, and I couldn’t help notice the sudden spate of comments in response to a series of posts by Ashraf Motiwala

There have been many intelligent responses to this post, and for the most part I had no intention to get into the debate. That was until I noticed a post titled “why cache and virtual directories???” by Tim Paul. In this post, Tim argues that using a cache can help reduce a significant amount of lag that he believes could potentially result from using a VDS in front of a backend data repository. Tim’s argument is a mathematical one that seems very logical. However, what his argument really brings to light, is his own lack of experience and understanding of how these systems work.

Let’s first look at how Tim presents his argument.

  • I have a directory that performs at 5000 q/sec, roughly equivalent to .2 milliseconds per query
  • The VDS will add, at least, a 2 millisecond “overhead”
  • Each query will now take 2.2 milliseconds to complete
  • Now instead of 5000 q/sec when I access my directory I get only 455 q/sec
  • Therefore, to remove the 2 millisecond overhead and to return to optimal performance it is necessary to implement a cache.

At first glance Tim seems to have presented a thoroughly convincing argument for using a cache. Tim argues that queries will be resolved 11 times slower than if they were handled by the data source directly. If making use of a VDS results in such a dramatic performance hit, it stands to reason that using a cache is the only way out of the doldrums.

What Tim has failed to include in his reasoning, is that all of these systems process queries asynchronously. By calculating the speed of each query at 0.2ms (a ridiculously low figure for any TCP transaction) based on the 5000q/sec figure, Tim is assuming that the queries are treated synchronously, which is never the case. Tim goes on to add the 2ms latency for the VDS system to each query. This assumes that the VDS, along with the backend, is treating each query synchronously, resulting in a massive performance hit. Of course, this scenario is somewhat absurd, and if this were the case then caching would seem to make a whole lot of sense. However as all of these systems work asynchronously, the math of Tim’s argument does not hold any water at all.

Tim goes on to argue about the freshness of data in a fairly flippant manner, suggesting that most organizations will not have a problem if identity data is not up to date within their applications as a result of the cache. Bizarrely, Tim suggests that having incorrect data for anywhere between a few minutes or hours should be acceptable to most organizations. After his scatheing and somewhat misguided attack on the performance hit resulting from the use of a VDS, it seems odd to suggest that it is somehow acceptable for data to be wrong for significant lengths of time. I know that when designing an application I would like to know that even if it took a slight performance hit, it was always working with the correct data. In Tim’s world, it seems okay that things are wrong sometimes, and perhaps this is the same approach he applies to his math.

Perhaps in this post I have shot across the bows and made some fairly hard statements, but in reality the sort of misinformation that, all too often, gets circulated on these sorts of topics is damaging to a genuine understanding of how these systems work and what they are capable of doing.


OpenID in the world of Federated Identity

July 11, 2008

In my last post, I promised that I would have a look into OpenID as an alternative means of setting up and Identity Federation. While this is a small adventure away from my usual home among the more established specs, OpenID is not so distant that I haven’t already been sucked in by the hype and made use of it already. So is OpenID a real alternative that can be used in the enterprise as a means of achieving SSO and of provisioning user data across a federation in a secure manner?

The answer to this question is a little convoluted, so I’ll start out by quoting Andy Dale, who says: “In my bitchier moments I have been heard to say… “OpenID; brought to you by people who didn’t want to read the SAML spec””. And in many ways I agree with him. OpenID has burst onto the internet like a wild horse, and is rapidly gaining popularity. Certainly, it has wide adoption as everyone from Google to Facebook attempts to fight out the battle to become worldwide identity silos. But realistically, OpenID is a relatively immature approach to identity management and as such, many of its specifications are under revision. As Andy points out, it is likely that over time the OpenID spec will evolve into something that is fully SAML compliant.

Part of OpenID’s success has been that it is built around an “easy-to-setup, easy-to-use” framework. From an end-user perspective it is ridiculously easy to set up an OpenID URI and get busy logging onto any site that accepts OpenID as a means to authenticate. Administrators and developers find it exceedingly simple to build their own OpenID systems, and certainly there are a number of ready-to-use systems already out there, that can be fired up on any webserver and be home to your OpenID users.

Of course, there are a number of downsides to all of this promiscuity. As Paul Madsen, one of the Liberty Alliance architects, has pointed out on his blog, as Service Providers require more security in their transactions with an IdP, they will become more discerning or selective in their choice of IdP. This means that to login to your favourite blog, it is unlikely that there will be too much selectivity over your choice of IdP. However, your bank is more likely to be deeply concerned about which IdP you make use of, and will limit your selection to those that it has approved.

And this is largely where the split between OpenID and SAML lies. OpenID provides a quick and easy method of achieving SSO. SAML is more complicated, but it is built around a robust security model that can be trusted by large enterprises.

In more simplistic terms, SAML is more applicable for handling identities within organizations that need to maintain control over user data, and which have security concerns. In essence, the organization needs to be able to determine who it trusts within its identity framework. SAML makes more sense in these environments, especially when coupled with other specifications such as those provided by Liberty Alliance, as it facilitates secure data transactions beyond the scope of SSO.

OpenID is more applicable to users outside of an organization who wish to achieve SSO within less security specific environments. These users want to be able to choose their own Identity Providers and have more control over their own data. Essentially, as Stefan Brands writes in his scorching critique of OpenID at The Identity Corner: “OpenID was designed as a lightweight solution for “trivial” use cases in identity management: its primary goal is to enable Internet surfers to replace self-generated usernames and passwords by a single login credential, without needing more than their browser.”

This is not to say that OpenID does not have a space in the world of Federated Identity, only that it caters to a different market. Perhaps it is best to think of OpenID in exactly the terms that the group behind it presents the technology: OpenID is a lightweight method of identifying individuals that uses the same technology framework that is used to identify websites.