Should I templatize this analysis?
And if so, what's the best way?
Data analysts often deal with “ad-hoc” analyses where someone asks a seemingly new question and the analyst needs to figure out how to gather the data, aggregate it and display it to answer the question.
An example of this might occur in the e-commerce department of a big clothing company. A buyer, who is in charge of picking out the next season’s close, asks the analyst, “it seems like people who buy our jeans tend to be more loyal customers compared to our t-shirts. What’s the most valuable customer segment?”
The analyst clarifies some questions with the buyer (“how do you define ‘value?’ what time horizon?”) and gets back to her desk. As she’s writing out some SQL to pull groups of customers who bought different categories of clothes and their total spend over the past few years, she thinks, “should I be saving this code so that I can re-use it if this question comes up again?”
The answer to that is always “yes,” and I’ve found you can do okay saving code in SQL files with descriptive names if you have a reasonably organized file system.
But if the analyst has been on the job for a few months, she’s probably tired of digging into code like this to change hard-coded values when re-running an analysis. She wants to templatize the analysis! Ideally, by adding variables for things like date ranges, and categories of clothing to include, and maybe the metrics that are output.
This sort of templatization takes work because it usually requires doing things outside of vanilla SQL. It’s a tension between doing the easier thing right now and needing to re-do it by hand later, and doing something 50%-100% harder right now and having an easy-to-run analysis template later.
The questions to consider when evaluating this tradeoff are:
How likely am I or someone on my team to need to run a similar analysis again?
Which variables are likely to change if I need to run the analysis again in the future?
How long will it take for me to templatize this vs. just running it with hardcoded values?
Are there tools I can use to templatize it that will be self-contained and relatively easy for someone else to pick up?
When I think about the ad-hoc questions I’ve been asked as an analyst, there are a few that were truly one-time only things but the majority have fit into themes that could have been templatized and re-used. So if you are unsure, you should assume that you will see this question again.
For most businesses & organizations, there are patterns to the questions that come up and if an analyst can understand the business enough to recognize these patterns and templatize the analyses needed to answer them, that is a very valuable thing.
The technology needed to templatize things can be harder. Many dashboarding tools have tools where the user can specify things like the date range and the metrics being output. These are useful but are often limited by the complexity of operation you can have powering the dashboard.
I have seen Python scripts which write SQL scripts based on command line inputs; these work but are very hard for other analysts to pick up. I have seen Excel sheets hooked up with VBA to SQL scripts where data inputs entered into the Excel sheet are variables in the SQL query. This is messy but it worked.
I personally like to run SQL from within a Jupyter notebook and use Python to format which variables to change for templated scripts, but this comes with its own issues, as a lot of analysts or business users will have access to a SQL client but won’t have Python installed.
There are lot of different approaches to this. It is one of the skills that make people effective analysts, and if you are explicit in thinking about it you’ll improve faster.
