Lingering duplicate systems and the expense of weeding them out (an illustration)
We have been operating a
fileserver environment for a long
time now, back before we used ZFS. When you
operate fileservers in a traditional general Unix environment, one of the things you need is
disk usage information. So a very long time ago, before I even
arrived, people built a very Unix-y system to do this. Every night,
raw usage information was generated for each filesystem (for a while
with 'du'), written to a special system directory in the filesystem,
and then used to create a text file with a report showing currently
usage and the daily and weekly change in everyone's usage. A local
'report disk usage' script would then basically run your pager on
this file.
After a while, we we able to improve this system by using native ZFS commands to get per-user 'quota' usage information, which made it much faster than the old way (we couldn't do this originally because we started with ZFS before ZFS tracked this information). Later, this made it reasonable to generate a 'frequent' disk usage report every fifteen minutes (with it keeping a day's worth of data), which could be helpful to identify who had suddenly used a lot of disk space; we wrote some scripts to use this information, but never made them as public as the original script. However, all of this had various limitations, including that it stopped updating once the filesystem had filled up.
Shortly after we set up our Prometheus metrics system and actually had a flexible metrics system we could put things into, we started putting disk space usage information into it, giving us more fine grained data, more history (especially fine grained history, where we'd previously only had the past 24 hours), and the ability to put it into Grafana graphs on dashboards. Soon afterward it became obvious that sometimes the best way to expose information is through a command, so we wrote a command to dump out current disk usage information in a relatively primitive form.
Originally this 'getdiskusage' command produced quite raw output because it wasn't really intended for direct use. But over time, people (especially me) kept wanting more features and options and I never quite felt like writing some scripts to sit on top of it when I could just fiddle the code a bit more. Recently, I added some features and tipped myself over a critical edge, where it felt like I could easily re-do the old scripts to get their information from 'getdiskusage' instead of those frequently written files. One thing led to another and so now we have some new documentation and new (and revised) user-visible commands to go with it.
(The raw files were just lines of 'disk-space login', and this was pretty close to what getdiskusage produced already in some modes.)
However, despite replacing the commands, we haven't yet turned off the infrastructure on our fileservers that creates and updates those old disk usage files. Partly this is because I'd want to clean up all the existing generated files rather than leave them to become increasingly out of date, and that's a bit of a pain, and partly it's because of inertia.
Inertia is also a lot of why it took so long to replace the scripts. We've had the raw capability to replace them for roughly six years (since 'getdiskusage' was written, demonstrating that it was easily possible to extract the data from our metrics system in a usable form), and we'd said to each other that we wanted to do it for about that long, but it was always "someday". One reason for the inertia was that the existing old stuff worked fine, more or less, and also we didn't think very many people used it very often because it wasn't really documented or accessible. Perhaps another reason was that we weren't entirely sure we wanted to commit to the new system, or at least to exact form we first implemented our disk space metrics in.