Data tool design as a sign of field dynamism
Hand a data person a data file to explore and analyze, and you'll find that the world generally splits into two groups: one group will attempt to view the contents of the file directly by popping it into a text editor or spreadsheet, and another group will open up their data environment of choice, R, Jupyter notebook, etc., and do a lot of their exploring using summary and visualization tools first before pulling samples.
There's obviously no objectively right or wrong about which way people choose to go about doing the initial data exploration task, it all comes down to a matter of personal preference and habits. Even though you'll find plenty of examples of a faux "holy war" if you search for things like "notebook vs spreadsheet" on the internet. The sort of people who like to debate these things love writing about them. The rest of us seem pretty chill to let people do whatever works because we all individually have quirky "inefficiencies" in our own work habits no matter what we use.
As someone who's older and comes from a heavily Excel-grounded analyst background instead of a modern ML-focused background, I prefer seeing exploring data tables in raw form first. I find I notice potential issues much faster this way. My mental images for how to transform data involve almost physically pivoting, filtering, transforming data in table form. Even though I will occasionally explore data in "the other way" when the data requires it, I find that it is rarely necessary to do so. I expect that for the whole remainder of my career, I'm not going to change this preference.
From the surveys I've glanced at about tool usage, like this 2023 one from Jetbrains, my archetype seems to be in the minority now? At least amongst folk who are in their survey pool (which seems a bit skewed towards machine learning folk. That jives with my sense that more and more people are joining the data world from an ML angle with training is usually grounded in notebooks and R. So I'm quite getting left behind in the times.
Signals of dynamism in our tooling
Counting Stuff exists 100% thanks to reader subscriptions. If you enjoy the weekly posts about data, process, tech, and hobbies, show your support with a subscription.
Subscribe
There's another thing in my typical data workflow that has also stuck with me for a long time – I continue to actively avoid all-in-one analysis solutions. I'll specifically use a SQL client or custom code to pull data from my data source and save it to file. I'll then do my exploration and prototypes using a spreadsheet or notebook. I'll make reports using other tools. Pipelines are done in whatever special system is set up for it.
I'm the BI tool vendor's worst nightmare because I'm constantly uncomfortable and often annoyed at BI tools that try to do everything trip up my flow. My number one most used feature in any BI tool is the "Export to ..." function – specifically because it lets me exit their little walled garden.
I don't have numbers to prove it, but my suspicion is that this habit of constantly swapping tools in according to task continues to be the dominant workflow pattern for data work. I previously discussed how product design tends to create all-in-one products before, but instead of rehashing product design, I wanted to instead celebrate the fact that our work resists being put into a single end-to-end experience. Things are still too early to neatly package things into a universal box. There's still plenty of new experimentation and innovation going on.
Just think of all the times your favorite workflow tool got replaced by something better or different and forced you to change your habits. We all have stories like that, and instinctively we know that something new is just over the horizon. So many of us demand that our tools have clear boundaries that interoperate at least at the file format level. Almost everything speaks CSV, parquet, or JSON because if they didn't, it'd be too annoying to integrate into our existing workflows and it'd get no traction. When tools don't give us this basic import/export functionality, we hate it because we feel locked in. We know some new and better tool will be coming and we want our stuff to work with that new thing too. I can't think of a more optimistic view of the future prospects of innovation in the field.
One day in the future, we will eventually figure out enough best practices that the dynamism will move on to some other part of the analysis toolchain. I think we'll see the signs of that when we stop worrying about how our tools hand work off to each other and we allow distinct steps to get more integrated out of convenience. I can't imagine what that looks like now, but inevitably that day will come.