Full Title: Engineers Shouldn’t Write ETL: A Guide to Building a High Functioning Data Science Department
Document Note: Frustration among groups.
DS: Misalignment of motivations and slow into production
DE: Lack of debt tech vision and costs of productionizing
Infrastructure: Lack of business context
What went wrong?
In most ocasions you don’t need specialized DE to build solutions
Everybody wants to be the thinker. In traditional departments there were only doers.
You need engineers to do engineer stuff, not to serve other roles.
Stitchfix proposal:
A way that allows for autonomy in roles, true ownership all the way into production, and accountability for output.
The trick is to create an environment that allows for autonomy, ownership, and focus for everyone involved.
engineers and data scientists are impassioned by very different tasks
Nobody enjoys writing and maintaining data pipelines or ETL.
Engineers should not write ETL. For the love of everything sacred and holy in the profession, this should not be a dedicated or specialized role. There is nothing more soul sucking than writing, maintaining, modifying, and supporting ETL to produce data that you yourself never get to use or consume.
Instead, give people end-to-end ownership of the work they produce (autonomy). In the case of data scientists, that means ownership of the ETL. It also means ownership of the analysis of the data and the outcome of the data science. The best-case outcome of many efforts of data scientists is an artifact meant for a machine consumer, not a human one. Autonomy means the data scientists own that code as well. All the way into production.
What is the role of an engineer in this new, horizontal world? To sum it up, engineers must deploy platforms, services, abstractions, and frameworks that allow the data scientists to conceive of, develop, and deploy their ideas with autonomy
We are not optimizing the organization for efficiency, we are optimizing for autonomy.
It is absolutely essential for platform engineers to stay ahead of the data science teams. You need very sharp platform engineers who can make intuitive decisions about what services, frameworks, and capabilities need to be in place before they are desperately needed.
Here’s the thing. ETL engineers, Report Developers, and DBAs are all “Doers”. So, 10 years ago or so, when Big Data and data science started to become buzzwords, there were well-established BI departments who had plenty of Doers and not enough Thinkers. So, they made “Thinker” a role. (View Highlight)
The fundamental flaw that prevents the Thinker and Doer model from living up to its recruiting hype is the assumption that there exists an army of soulless non-mediocre Doer engineers who eagerly implement the ideas and vision of data scientists. Does that sound like the profile of any talented engineers that you know? (View Highlight)
Instead, you will hire mediocre engineers. They will create tremendously over complicated messes. This will exacerbate the contention. Welcome to the Vicious Cycle. (View Highlight)
it is important to recognize that engineers and data scientists are impassioned by very different tasks: (View Highlight)
Data scientists love working on problems that are vertically aligned with the business and make a big impact on the success of projects/organization through their efforts. They set out to optimize a certain thing or process or create something from scratch. These are point-oriented problems and their solutions tend to be as well. They usually involve a heavy mix of business logic, reimagining of how things are done, and a healthy dose of creativity. Thus, they require a deep understanding of how specific portions of the business operate and a high degree of partnership with business verticals. (View Highlight)
Engineers excel in a world of abstraction, generalization, and finding efficient solutions in the places where they are needed. These problems are usually horizontally oriented in nature. They can be most impactful when applied broadly. They require a good overall understanding of how the business operates, but the abstracted nature of solutions mean they are light on business logic and do not require a heavy partnership with or deep understanding of verticals within the business. (View Highlight)
In case you did not realize it, Nobody enjoys writing and maintaining data pipelines or ETL. (View Highlight)
Instead, give people end-to-end ownership of the work they produce (autonomy). In the case of data scientists, that means ownership of the ETL. It also means ownership of the analysis of the data and the outcome of the data science. (View Highlight)