When There is Too Much of a Good Thing
Jul 01, 2022 2:01 pm
Also, happy July 4th weekend! I thought I'd share an update on my bees. I set up my first two bee-hives at the end of March, and I've been checking on them every week. Still, as a new beekeeper, practical experience matters and I've made mistakes.
I was a little late adding a new box on my hive (called a super) by a few days. Those few days of difference were all it took for the bees to feel too cramped and they swarmed. A swarm is when the bees basically split their numbers in half and one of those halves go off to a new home. This is something most beekeepers try to avoid.
I also received my first sting about a week ago. I went to work on a hive and they were in a terrible mood.
Despite all that, the bees have been healthy and productive and it looks like I'll have a honey harvest this first year.
I'm totally not prepared for that!
I find there is a similarity here between my marvel and surprise at the bees, and what happens with a lot of groups I consult with. The situation goes something like this:
The team is working on a new project and they are in a planning meeting where they are talking about logging. You see, previous projects always had a lot of issues around not logging enough, or logging too much, and this time they want to get it right.
They make a decision, and time skips ahead to the project's completion and then they hear that the system is down.
It turns out the hard drives filled up with log messages.
In this story I chose logging, but it could be database sizes, cloud infrastructure costs, and general service scalability.
Teams and managers tend to want to try and get it right in the beginning, but they can't predict the future and they need to move quickly, so a compromise is made.
The part that they missed in those conversations and where I chime in is how will they know or detect that they're outgrowing that solution? Often the means to do this are very simple and easy to do, but many groups don't have practice setting any kind of thresholds for monitoring.
And that's when you wind up with too much of a good thing. It's great the system is out there running, but it's too much at the same time. So it creates outages or degraded service.
So if you recognize some of this story in your groups here are some thresholds I like to get teams to start with:
- Page Load Time
- API Response Time
- Hard Drive/Volume utilization
- API Throughput
- Cloud Infra costs
- CPU/RAM/JVM utilization
Most all of these are trivial to implement and what you do is set thresholds to say that if we get to 80% of our max we get an alert. That alert prompts the team to figure out what to do next before there is an outage.
I'd love to hear your experiences doing something similar, and even when you wish you had.
Here's my weekly update for July 1st, 2022...
For the better part of eight years, I’ve been consulting for the Fortune 100.I thought I’d share some quirky truths I’ve learned along the way.
Before I go into them, some clients don’t exhibit these truths, but they’re an exception.
I wrote an article last week about the Theory of Constraints.
In particular, I highlighted how variability is probably the biggest enemy you have to a method like that working.
So in this article, I’m going to explain a way to help reduce variability and measure it.