How source systems developers can help Data team.
Finally, I’ve gathered some advice for developers who develop/design systems which later can be used as sources for Data Lakes/Analytical platforms → Data teams. Most of them belong to database design.
To simplify I will use the word Developer(s) (development team) for a person/role who is responsible for source system design and development.
To be honest there are a lot of small things that can be done during a source system design/development and that will lead to:
- reduce efforts for Data teams;
- reduce the number of requests from Data teams to a Development team;
- increase overall performance (for data teams & development team as well).
The main secret from my perspective is that Development has to understand that data from their systems will be used by other teams/systems, like Data team.
Here is a list of recommendations and motivation (in case if don’t follow them):
- Do not reuse existing fields (even if this field is not used anymore) for a different purpose, even if based on your opinion this field is not needed → Other systems can use a value from these fields for some purpose, especially Data team;
- Do not delete/rename fields in tables → It will break ETL pipeline for Data team, because they use field’s name a lot;
- Each table should have Primary Key (PK) field (unique, not null). Better if it is based on 1 field (not compound) → Otherwise it is difficult for Data team to load/recalculate data incrementally, find duplicates, do troubleshooting;
- A PK field should have a meaningful name — not just ID and when you create a Foreign key (FK) field — better if you use the same name for it as it has in the main table → For instance we have table “Product”, so it is better if PK will be named: “product_id” than just “id”. Otherwise, for Data team it is tough to understand if tables have a relationship (there are a lot of technologies in Data team that don’t support explicit foreign keys or they can be unused because of performance) and based on which fields they should be joined;
- Do not hesitate to add technical fields that will store the last update/change timestamp & last update/change user name → Based on last_change_timestamp Data team can: do integration, resolve integration issues;
- Do normalization (if you don’t have serious technical reasons against it), better 3rd NF;
- If you wish to store structured text in the field, then use JSON (don’t) instead of your own. If your DBMS supports JSON type for fields, then use it → Otherwise it will lead to additional efforts for Data team to parse;
- Make a unified standard for fields & tables comments, define them as mandatory. If it is possible add to field/tables comments with name of entity & attributes from a logical data model → Such approach will help to build Data Catalog (where IT/Business can find necessary data, specify data classification like PII, understand data);
- When you develop a transformation logic (that persists data to disk) in your systems (transform data from one table to another) — it will be useful if you also can add add metadata (for external users/applications) that describes column to column mapping (how columns from the source table maps to target table columns with logic description) → It will help to build data lineage and upload it to Data Catalog (some links: OpenLineage, OpenMetadata).
Additionally, there are recommendations regarding CDC (change data capture — it is when we want to gather only those records that have been changed since last time). You don’t need to use all of them together, they are enlisted in priority order:
- Use Event Sourcing/event stream (e.g. send all changes to a message broker);
- Use DBMS that supports CDC tools (e.g. Debezium);
- Create a separate table that will store all changes with change type (insert, delete, update).
Any comments are welcome.