How we cut AWS bandwidth costs 95% with dm-cache: fast local SSD caching for network storage
The bandwidth billing challenge
When deploying infrastructure across multiple AWS availability zones (AZs), bandwidth costs can become a significant operational expense. Some of our Upsun infrastructure spans three AZs for high availability, but this architecture created an unexpected challenge with our Ceph-based storage system.
Since Ceph distributes data across the cluster and AWS bills for inter-AZ traffic, approximately two-thirds of our disk I/O traffic crossed AZ boundaries—generating substantial bandwidth charges. With all disk operations flowing over the network rather than accessing local storage, we needed a solution that could reduce this costly network traffic without compromising our distributed storage benefits.
The local SSD caching experiment
Our AWS instance types included small amounts of local SSD storage that weren’t being utilized for primary storage. This presented an opportunity: what if we could use these fast, local disks as a read cache in front of our network-based Ceph storage?
We implemented a three-step caching strategy using Linux device mapper (dm-cache):
Volume partitioning: Used LVM to split the local SSD into small 512MB cache volumes Read-only caching: Configured dm-cache to place these volumes in front of our Ceph RBD (RADOS Block Device) volumes, caching reads while passing writes directly through to the network storage Container integration: Exposed the dm-cache devices to our containers as their primary storage interface
Understanding dm-cache architecture
The dm-cache kernel module was originally designed to address a classic storage trade-off: placing small, expensive SSDs in front of large, affordable HDDs to create hybrid storage with both capacity and performance. Our use case follows the same pattern—except instead of slow HDDs, we’re caching in front of network-attached storage.
... continue reading