When to Excel and When Not To: Why Specialists Don't Use Excel for Transformations
01-Feb-2020
So, you received a big dataset from your team in a CSV or TXT file, and you're about to prepare it for analysis. You've noticed that the data is a little muddy, and you're pretty sure that you'll need to clean or transform the data somehow.
For the majority of people, this journey starts with opening the file in Excel. I mean, when people look at the file on their desktop, it even has an Excel (or Excel-ish like) icon on it, so it must be the right place to start, right? Just open it, right?
Wrong.
In my opinion, the reason that the icon appears that way is because people are not pro data manipulators, they don't have other tools on their desktop, and they need at least some way to look at the data in a tabular format. The fact that you are reading this blog means you are not, or do not want to be in that category of analyst.
In my 25 years of experience in designing, deploying, and integrating data solutions, with a wide range of users, from aerospace engineers, to accountants, to researchers, one thing stood out for me:
Considering the alternatives, Excel is probably the most dangerous way to start preparing data for complex projects. This may be a very unpopular opinion, and I can hear all the Excel guys gasping as I write. However, there are many reasons for this.
Don't get me wrong, Excel is one of the most powerful tools we have available to us for analysis. I've done Excel projects with all kinds of cool statistical regression models, hundreds of thousands of lines of code in macros, and a kaleidoscope of graphs and charts for eager users to peruse.
Plain and simple, Excel is for analysis, not data transformation. The data doesn't start there - it comes, pre-prepared; and, if you can pre-prepare data for Excel, you can prepare data for other amazing tools like Power BI, Tableau, and others.
The problem with opening raw (csv,txt) data in Excel:
1. Unexpected Translation Excel will "translate" the data for you without asking. Not just in one column, but in individual cells, often far down in your dataset where you don't notice for some time.
2. Variable Column Types Will allow multiple datatypes in a column. In small datasets you might see this. In large ones, it can turn into a big issue.
3. Unexpected Data Truncation Related to #1, but saving a CSV into Excel can actually cause key information to change from the original. One example that I've seen people struggle with Employee ID numbers with leading zeroes. If you don't know and anticipate that Excel will do this, you'll be surprised when you run an analysis joining to some other employee information only to find no results for some employees.
3. Excel limits your ability to perform joins between datasets. Though there are ways to perfrom simple joins (via VLOOKUPs etc.), things quickly degrade when things get serious.
4. Excel doesn't leverage the power of relational databases and SQL in the way other tools do. Your journey in data transformation will start with these.
This last point is the main point of this article. Using data tools that are designed for data transformation and loading will encourage you to think in that mindset. The more you think in that mindset during transformations, the more you will realise just how much further you can go.
You may use Excel for analysis already, and that's great. Excel is a killer application for what it does. It is ubiquitous, used by almost everyone in business around the globe. However, if you really want to move to the next level in your transformations, you need to learn the tools for that job. It is a case of "just because you can, it doesn't mean you should"..
You need to start to learn Microsoft Access on a deeper level to get started. Not for application development or anything like that. You need to learn Access because it is the gateway to getting serious with data transformation.
Learning to do transformations in a data-centric tool will be key to your success. You will eventually learn about data-access technologies, like linked-tables and ODBC that will open
your world and allow you to obtain and join data together in ways you never thought possible. Even greater, you'll eventually start to learn the power of automation, using languages like Python, VBA, and others.
Do you really want to become a pro at data transformation? You're going to have to get serious, learn some better tools, and drop the Excel training wheels in order to level-up.