Factor 2: Weighing the pros and cons
If we want to provide the data in lakehouse format so Spark jobs can slice and dice the data, then either shared tiering or materialization is an option.
Shared tiering might be preferable if reducing storage cost (by avoiding data duplication) is the primary concern. However, other factors are also at play, as explained earlier in 1. The challenges of shared tiering.
Materialization might be preferable if:
The primary and secondary systems have completely different access patterns, such that maintaining two copies of the data, in their respective formats is best. The secondary can organize the data optimized for its own performance and the primary uses internal tiering, maintaining its own optimized copy.
The primary does not want to own the burden of long term management of the secondary storage.
The primary does not have control over the secondary storage (to the point where it cannot fully manage its lifecycle).
Performance and reliability conscious folks prefer to avoid the inherent risks associated with shared tiering, in terms of conversion logic over multiple schemas over time, performance constraints due to data organization limitations etc.
The secondary only really needs a derived dataset. For example, the lakehouse just wants a primary key table rather than an append-only stream, so the materializer performs key-based upserts and deletes as part of the materialization process.
Data duplication avoidance is certainly a key consideration, but by no means always the most important.
Final thoughts
The subject of storage unification (aka data virtualization), is a large and nuanced subject. You can choose to place the virtualization layer predominantly client-side, or server-side, each with their pros and cons. Data tiering or data materialization are both valid options, and can even be combined. Just because the primary system chooses to materialize data in a secondary system does not remove the benefits of internally tiering its own data.
Tiering can come in the form of Internal Tiering or Shared Tiering, where shared tiering is a kind of hybrid that serves both primary and secondary systems. Shared tiering links a single storage layer to both primary and secondary systems, each with its own query patterns, performance needs, and logical data model. This has advantages, such as reducing data duplication, but it also means lifecycle policies, schema changes, and format evolution must be coordinated (and battle tested) so that the underlying storage remains compatible with both primary and secondary systems. With clear ownership by the primary system and disciplined management, these challenges can be manageable. Without them, shared tiering becomes more of a liability rather than an advantage.
While on paper, materialization may seem more work as two different systems must remain consistent, the opposite is more likely to be true. By keeping the canonical data a private concern of the primary data system, it frees the primary from potentially complex and frictionful compatibility work juggling competing concerns and different storage technologies with potentially diverging future evolution. I would like to underline that making consistent copies of data is a long and well understood data science problem.
The urge to simply remove all data copies is understandable, as storage cost is a factor. But there are so many other and often more important factors involved, such as performance constraints, reliability, lifecycle management complexity and so on. But if reducing storage at-rest cost is the main concern, then shared tiering, with its additional complexity may be worth it.
I hope this post has been food for thought. With this conceptual framework, I will be writing in the near future about how various systems in the data infra space perform storage unification work.