It has been some time since I have had a chance to write anything for my blog. This doesn’t mean that I haven’t been keeping an eye on the odd posting on a number of my favourite work-related blogs, and I couldn’t help notice the sudden spate of comments in response to a series of posts by Ashraf Motiwala
There have been many intelligent responses to this post, and for the most part I had no intention to get into the debate. That was until I noticed a post titled “why cache and virtual directories???” by Tim Paul. In this post, Tim argues that using a cache can help reduce a significant amount of lag that he believes could potentially result from using a VDS in front of a backend data repository. Tim’s argument is a mathematical one that seems very logical. However, what his argument really brings to light, is his own lack of experience and understanding of how these systems work.
Let’s first look at how Tim presents his argument.
- I have a directory that performs at 5000 q/sec, roughly equivalent to .2 milliseconds per query
- The VDS will add, at least, a 2 millisecond “overhead”
- Each query will now take 2.2 milliseconds to complete
- Now instead of 5000 q/sec when I access my directory I get only 455 q/sec
- Therefore, to remove the 2 millisecond overhead and to return to optimal performance it is necessary to implement a cache.
At first glance Tim seems to have presented a thoroughly convincing argument for using a cache. Tim argues that queries will be resolved 11 times slower than if they were handled by the data source directly. If making use of a VDS results in such a dramatic performance hit, it stands to reason that using a cache is the only way out of the doldrums.
What Tim has failed to include in his reasoning, is that all of these systems process queries asynchronously. By calculating the speed of each query at 0.2ms (a ridiculously low figure for any TCP transaction) based on the 5000q/sec figure, Tim is assuming that the queries are treated synchronously, which is never the case. Tim goes on to add the 2ms latency for the VDS system to each query. This assumes that the VDS, along with the backend, is treating each query synchronously, resulting in a massive performance hit. Of course, this scenario is somewhat absurd, and if this were the case then caching would seem to make a whole lot of sense. However as all of these systems work asynchronously, the math of Tim’s argument does not hold any water at all.
Tim goes on to argue about the freshness of data in a fairly flippant manner, suggesting that most organizations will not have a problem if identity data is not up to date within their applications as a result of the cache. Bizarrely, Tim suggests that having incorrect data for anywhere between a few minutes or hours should be acceptable to most organizations. After his scatheing and somewhat misguided attack on the performance hit resulting from the use of a VDS, it seems odd to suggest that it is somehow acceptable for data to be wrong for significant lengths of time. I know that when designing an application I would like to know that even if it took a slight performance hit, it was always working with the correct data. In Tim’s world, it seems okay that things are wrong sometimes, and perhaps this is the same approach he applies to his math.
Perhaps in this post I have shot across the bows and made some fairly hard statements, but in reality the sort of misinformation that, all too often, gets circulated on these sorts of topics is damaging to a genuine understanding of how these systems work and what they are capable of doing.
Posted by rowanp01