The Lost Art of ETL

Specialization is a problem across computing. Devs and coders tend to learn one or two languages and try to make everything fit into what they know. While those languages might be flexible enough to do a lot of things, it doesn’t mean they do them well, nor does it mean that you SHOULD use them for what you are doing. We’ve touched on this in our datastores series already, but there is something to add here. There’s a lost art happening, something even really smart developers sometimes have not heard of: Extract, transform and load, or ETL.

ETL is the art of taking data from one or more sources, running it through a process, manipulating it, doing calculations on it, somehow transforming it and getting a result or relationship from it, and then loading it in another location. In reality, ETL is what happens in some cases when you’re simply moving data from a database and putting it into a cache. Or when you take data out, analyze it and put it into a search index.

ETL is a lost art because people look at their data stores and think they can figure out a way to make everything fit into it. Startups especially have this happen because they have a store that is sufficient for the first few things they needed to do, so they think it can be used for everything. But the reality is that doesn’t make sense. An even bigger problem happens when they realize this doesn’t make sense and instead of searching for something that does, they see a new sexy tool they want to try. But they don’t understand the implications of this new tool or how it can potentially cause a different set of problems.

Choosing the proper store is only one part of the process. The other important part is understanding that you need to use different stores and move data around. If you have a screw that has a hex head on it and you try to screw it in with a Phillips head, it’s not going to work. A wrench isn’t going to work. A socket wrench might work, but what would work best is an Allen wrench. So, for example, you  might need to take data and store some of it in an SQL store, some portion in a NoSQL store, cache it somewhere or put it in a queue to get it in and out.

You might need to use your entire toolbox to get a job done and might need to put glue and ETL in between to appropriately use the data. This is so that you’re not moving into the edges of different systems and deteriorating the entire application because you’re misusing the systems. Then you won’t be able to complete the task at hand or your client will have to wait too long. Other drawbacks include extra expenses and time spent on a project.

Computing is all about timing resources on all ends. When you start bloating or overloading systems, the effects can range in severity from a few minor problems all the way to systems crashing and everything coming to a screeching halt. So you want to stay out of those fringes and learn how to use the right tools for the job. You want to learn how to hand off data from one store to another properly so that data doesn’t become lost and can still be used and analyzed in the way it needs to be. And sometimes you will need to glue the data between stores, and that’s okay. Just don’t get stuck on it being easier to keep it all in one place.

Especially if you’re a startup, you must understand that there is a balancing point. You might be able to start with just one store, but you’re not leaving yourself room to grow. You don’t necessarily want to run out and purchase or start using all of these different data stores before you really know what you’re going to need. But you also don’t want to just put everything into the first store you come across because when you do start to grow, it’s going to break. If you’re unsure of what type of data store to use or what you might need for future growth, consult an expert. Someone who understands the type of data you’re going to have and what you need to do with it will be able to help you plan for growth.

At the end of the day, everything is not created equal in the computing world. Developers like to have the idea that they learned a language and therefore everything they ever need to do can be done in that language. But if you don’t look at how the language and data stores actually function and whether or not they make sense for what you are doing, when you get to scale, you’re often going to make the wrong choices. Using a combination of data stores and moving data with ETL will always have a better outcome than trying to shove every piece of information you have into an misappropriated datastore.

About the Author

Pieter VanIperen, Managing Partner of PWV Consultants, leads a boutique group of industry leaders and influencers from the digital tech, security and design industries that acts as trusted technical partners for many Fortune 500 companies, high-visibility startups, universities, defense agencies, and NGOs. He is a 20-year software engineering veteran, who founded or co-founder several companies. He acts as a trusted advisor and mentor to numerous early stage startups, and has held the titles of software and software security executive, consultant and professor. His expert consulting and advisory work spans several industries in finance, media, medical tech, and defense contracting. Has also authored the highly influential precursor HAZL (jADE) programming language.

Contact us

Contact Us About Anything

Need Project Savers, Tech Debt Wranglers, Bleeding Edge Pushers?

Please drop us a note let us know how we can help. If you need help in a crunch make sure to mark your note as Urgent. If we can't help you solve your tech problem, we will help you find someone who can.

1350 Avenue of the Americas, New York City, NY