Factor 2: Weighing the pros and cons
If we want to provide the data in lakehouse format so Spark jobs can slice and dice the data, then either shared tiering or materialization is an option.
Shared tiering might be preferable if reducing storage cost (by avoiding data duplication) is the primary concern. However, other factors are also at play, as explained earlier in 1. The challenges of shared tiering.
Materialization might be preferable if:
The primary and secondary systems have completely different access patterns, such that maintaining two copies of the data, in their respective formats is best. The secondary can organize the data optimized for its own performance and the primary uses internal tiering, maintaining its own optimized copy.
The primary does not want to own the burden of long term management of the secondary storage.
The primary does not have control over the secondary storage (to the point where it cannot fully manage its lifecycle).
Performance and reliability conscious folks prefer to avoid the inherent risks associated with shared tiering, in terms of conversion logic over multiple schemas over time, performance constraints due to data organization limitations etc.
The secondary only really needs a derived dataset. For example, the lakehouse just wants a primary key table rather than an append-only stream, so the materializer performs key-based upserts and deletes as part of the materialization process.
Data duplication avoidance is certainly a key consideration, but by no means always the most important.
... continue reading