How source systems developers can help Data team.

Alexey Artemov
3 min readOct 18, 2021

--

Finally, I’ve gathered some advice for developers who develop/design systems which later can be used as sources for Data Lakes/Analytical platforms → Data teams. Most of them belong to database design.

To simplify I will use the word Developer(s) (development team) for a person/role who is responsible for source system design and development.

To be honest there are a lot of small things that can be done during a source system design/development and that will lead to:

  • reduce efforts for Data teams;
  • reduce the number of requests from Data teams to a Development team;
  • increase overall performance (for data teams & development team as well).

The main secret from my perspective is that Development has to understand that data from their systems will be used by other teams/systems, like Data team.

Here is a list of recommendations and motivation (in case if don’t follow them):

  • Do not reuse existing fields (even if this field is not used anymore) for a different purpose, even if based on your opinion this field is not needed → Other systems can use a value from these fields for some purpose, especially Data team;
  • Do not delete/rename fields in tables It will break ETL pipeline for Data team, because they use field’s name a lot;
  • Each table should have Primary Key (PK) field (unique, not null). Better if it is based on 1 field (not compound) Otherwise it is difficult for Data team to load/recalculate data incrementally, find duplicates, do troubleshooting;
  • A PK field should have a meaningful name — not just ID and when you create a Foreign key (FK) field — better if you use the same name for it as it has in the main table For instance we have table “Product”, so it is better if PK will be named: “product_id” than just “id”. Otherwise, for Data team it is tough to understand if tables have a relationship (there are a lot of technologies in Data team that don’t support explicit foreign keys or they can be unused because of performance) and based on which fields they should be joined;
  • Do not hesitate to add technical fields that will store the last update/change timestamp & last update/change user name Based on last_change_timestamp Data team can: do integration, resolve integration issues;
  • Do normalization (if you don’t have serious technical reasons against it), better 3rd NF;
  • If you wish to store structured text in the field, then use JSON (don’t) instead of your own. If your DBMS supports JSON type for fields, then use it Otherwise it will lead to additional efforts for Data team to parse;
  • Make a unified standard for fields & tables comments, define them as mandatory. If it is possible add to field/tables comments with name of entity & attributes from a logical data model Such approach will help to build Data Catalog (where IT/Business can find necessary data, specify data classification like PII, understand data);
  • When you develop a transformation logic (that persists data to disk) in your systems (transform data from one table to another) — it will be useful if you also can add add metadata (for external users/applications) that describes column to column mapping (how columns from the source table maps to target table columns with logic description) It will help to build data lineage and upload it to Data Catalog (some links: OpenLineage, OpenMetadata).

Additionally, there are recommendations regarding CDC (change data capture — it is when we want to gather only those records that have been changed since last time). You don’t need to use all of them together, they are enlisted in priority order:

  • Use Event Sourcing/event stream (e.g. send all changes to a message broker);
  • Use DBMS that supports CDC tools (e.g. Debezium);
  • Create a separate table that will store all changes with change type (insert, delete, update).

Any comments are welcome.

--

--

Alexey Artemov
Alexey Artemov

Written by Alexey Artemov

Staff Data Engineer | MLOps | Data Architect | AWS | Databricks | Data Governance. https://www.linkedin.com/in/aartemov/

Responses (7)