We built a storage cluster in downtown SF to store 90 million hours worth of video data. Why? We’re pretraining models to solve computer use. Compared to text LLMs like LLaMa-405B, which require ~60 TB of text data to train, videos are sufficiently large that we need 500 times more storage. Instead of paying the $12 million / yr it would cost to store all of this on AWS, we rented space from a colocation center in San Francisco to bring that cost down ~40x to $354k per year, including depreciation.
Why
Our use case for data is unique. Most cloud providers care highly about redundancy, availability, and data integrity, which tends to be unnecessary for ML training data. Since pretraining data is a commodity—we can lose any individual 5% with minimal impact—we can handle relatively large amounts of data corruption compared to enterprises who need guarantees that their user data isn’t going anywhere. In other words, we don’t need AWS’s 13 nines of reliability; 2 is more than enough.
Additionally, storage tends to be priced substantially above cost. Most companies use relatively small amounts of storage (even ones like Discord still use under a petabyte for messages), and the companies that use petabytes are so large that storage remains a tiny fraction of their total compute spend.
Data is one of our biggest contraints, and would be prohibitively expensive otherwise. As long as the cost predictions work out in favor of a local datacenter, and it would not consume too much of the core team’s time, it would make sense to stack hard drives ourselves. [1] 1. We talked to some engineers at the Internet Archive, which had basically the same problem as us; even after massive friends & family discounts on AWS, it was still 10 times more cost-effective to buy racks and store the data themselves!
The Cost Breakdown: Cloud Alternatives vs In-House
Internet and electricity total $17.5k as our only recurring expenses (the price of colocation space, cooling, etc were bundled into electricity costs). One-time costs were dominated by hard drive capex. [2] 2. When deciding the datacenter location we had multiple options across the Bay Area, including options in Fremont through Hurricane Electric for around $10k in setup fees and $12.8k per month, saving us $38.5k initially and $4.7k per month, but ended up opting for a datacenter that was only a couple blocks from our office in SF. Though this came at a premium, it was extremely helpful to get the initial nodes setup and for ongoing maintenance. Our team is just 5 people, so any friction in going to the datacenter would come at a noticeable cost to team productivity.
Table 1: Cost comparison of cloud alternatives vs in-house. AWS is $1,130,000/month including estimated egress, Cloudflare is $270,000/month (with bulk-discounted pricing), and our datacenter is $29,500/month (including recurring costs and depreciation).
Monthly Recurring Costs
Item Cost Notes Internet $7,500/month 100Gbps DIA from Zayo, 1yr term. Electricity $10,000/month 1 kW/PB, $330/kW. Includes cabinet space & cooling. 1yr term. Total Monthly $17,500/month
... continue reading