Building a 30 PB storage cluster in the heart of SF

We built a storage cluster in downtown SF to store 90 million hours worth of video data. Why? We’re pretraining models to solve computer use. Compared to text LLMs like LLaMa-405B, which require ~60 TB of text data to train, videos are sufficiently large that we need 500 times more storage. Instead of paying the $12 million / yr it would cost to store all of this on AWS, we rented space from a colocation center in San Francisco to bring that cost down ~40x to $354k per year, including depreciation. Why Our use case for data is unique. Most cloud providers care highly about redundancy, availability, and data integrity, which tends to be unnecessary for ML training data. Since pretraining data is a commodity—we can lose any individual 5% with minimal impact—we can handle relatively large amounts of data corruption compared to enterprises who need guarantees that their user data isn’t going anywhere. In other words, we don’t need AWS’s 13 nines of reliability; 2 is more than enough. Additionally, storage tends to be priced substantially above cost. Most companies use relatively small amounts of storage (even ones like Discord still use under a petabyte for messages), and the companies that use petabytes are so large that storage remains a tiny fraction of their total compute spend. Data is one of our biggest contraints, and would be prohibitively expensive otherwise. As long as the cost predictions work out in favor of a local datacenter, and it would not consume too much of the core team’s time, it would make sense to stack hard drives ourselves. [1] 1. We talked to some engineers at the Internet Archive, which had basically the same problem as us; even after massive friends & family discounts on AWS, it was still 10 times more cost-effective to buy racks and store the data themselves! The Cost Breakdown: Cloud Alternatives vs In-House Internet and electricity total $17.5k as our only recurring expenses (the price of colocation space, cooling, etc were bundled into electricity costs). One-time costs were dominated by hard drive capex. [2] 2. When deciding the datacenter location we had multiple options across the Bay Area, including options in Fremont through Hurricane Electric for around $10k in setup fees and $12.8k per month, saving us $38.5k initially and $4.7k per month, but ended up opting for a datacenter that was only a couple blocks from our office in SF. Though this came at a premium, it was extremely helpful to get the initial nodes setup and for ongoing maintenance. Our team is just 5 people, so any friction in going to the datacenter would come at a noticeable cost to team productivity. Table 1: Cost comparison of cloud alternatives vs in-house. AWS is $1,130,000/month including estimated egress, Cloudflare is $270,000/month (with bulk-discounted pricing), and our datacenter is $29,500/month (including recurring costs and depreciation). Monthly Recurring Costs Item Cost Notes Internet $7,500/month 100Gbps DIA from Zayo, 1yr term. Electricity $10,000/month 1 kW/PB, $330/kW. Includes cabinet space & cooling. 1yr term. Total Monthly $17,500/month One-Time Costs Category Item Cost Details Storage Hard drives (HDDs) $300,000 2,400 drives. Mostly 12TB used enterprise drives (3/4 SATA, 1/4 SAS). The JBOD DS4246s work for either. Storage Infrastructure NetApp DS4246 chassis $35,000 100 dual SATA/SAS chassis, 4U each Compute CPU head nodes $6,000 10 Intel RR2000s from eBay Datacenter Setup Install fee $38,500 One-off datacenter install fee Labor Contractors $27,000 Contractors to help physically screw in / install racks and wire cables Networking & Misc Install expenses $20,000 Power cables, 100GbE QSFP CX4 NICs, Arista router, copper jumpers, one-time internet install fee Total One-Time $426,500 Our price assuming three-year depreciation (including for the one-off install fees) is $17.5k/month in fixed monthly costs (internet, power, etc.) and $12k/month in depreciation, for $29.5k/month overall. We compare our costs to two main providers: AWS’s public pricing numbers as a baseline, and Cloudflare’s discounted pricing for 30PB of storage. It’s important to note that AWS egress would be substantially lower if we utilized AWS GPUs. This is not reflected on our graph because AWS GPUs are priced at substantially above market prices and large clusters are difficult to attain, untenable at our compute scales. Here are the pricing breakdowns: AWS Pricing Breakdown Cost Component Rate Monthly Cost Notes Storage $0.021/GB/month $630,000 For data over 500TB Egress $0.05/GB $500,000 Entire dataset egressed quarterly (10 PB/month) Total AWS Monthly $1,130,000 Cloudflare R2 Pricing Pricing Tier Rate Monthly Cost Notes Published Rate $0.015/GB/month $450,000 No egress fees Estimated Private Pricing [3] 3. Cloudflare has a more reasonable estimate for the 30 PB, placing it at an overall monthly cost of $270k without egress fees. We also have bulk-discounted pricing estimates after getting pricing quotes—this was our main point of comparison for the datacenter. $0.009/GB/month $270,000 Estimated rate for >20 PB scale That brings monthly costs to $38/TB/month for AWS, $10/TB/month for Cloudflare, and $1/TB/month for our datacenter—about 38x lower and 10x lower respectively. (At the very cheapest end of the spectrum, Backblaze has a $6/TB product that is unsuitable for model training due to egress speed limitations; their $15/TB Overdrive AI-specific storage product is closer to Cloudflare’s in price & performance) While we use Cloudflare as a comparison point, we’ve sometimes done too much load for their R2 servers. In particular, in the past we’ve done enough load during large model training runs that they rate-limited us, later confirming we were saturating their metadata layer and the rate limit wasn’t synthetic. Because our metadata on the heap is so simple, and we have a 100Gbps DIA connection, we haven’t ran into any issues there. [4] 4. We love Cloudflare and use many of their products often; we include this anecdote as a fact about our scale being difficult to handle, not as a dig! This setup was and is necessary for our video data pipelines, and we’re extremely happy that we made this investment. By gathering large scale data at low costs, we can be competitive with frontier labs with billions of dollars in capital. Setup/The Process We cared a lot about getting this built fast, because this kind of project can easily stretch on for months if not careful. Hence Storage Stacking Saturday, or S3. We threw a hard drive stacking party in downtown SF and got our friends to come, offering food and custom-engraved hard drives to all who helped. The hard drive stacking started at 6am and continued for 36 hours (with a break to sleep), and by the end of that time we had 30 PB of functioning hardware racked and wired up. We brought in contractors for additional help and professional installation later on in the event. People at the hard drive stacking party! Cool shots of the servers Our software is 200 lines of Rust code for writing (to determine the drive to write data onto) and a nginx webserver for reading data, with a simple SQLite db for tracking metadata like which heap node each file is on and what data split it belongs to. We kept this obsessively simple instead of using MinIO or Ceph because we didn’t need any of the features they provided; it’s much, much simpler to debug a 200-line program than to debug Ceph, and we weren’t worried about redundancy or sharding. All our drives were formatted with XFS. The storage software landscape offers many options, but every option available comes with drawbacks. People experienced with Ceph strongly warned us to avoid it unless we were willing to hire dedicated Ceph specialists—our research confirmed this advice. Ceph appears far more complex than justified for most use cases, only worthwhile for companies that absolutely need maximum performance and customizability and are prepared to invest heavily in tuning. Minio presents an interesting option if S3 compatibility is essential, but otherwise remains a bit too fancy for us and similar use-cases. Weka and Vast are absurdly expensive at 2k / TB / year or so and are primarily designed for NVMEs, not spinning disks. Post-Mortem Building the datacenter was a large endeavor and we definitely learned lessons, both good and bad. Things That We Got Correct We think the redundancy & capability tradeoffs we made are very reasonable at our disk speeds. We’re able to approximately saturate our 100G network for both read & write. Doing this locally a couple blocks away was well worth it because of the amount of debugging and manual work needed. Ebay is good to find vendors but bad to actually buy things with. After finding vendors, they can often individually supply all the parts we need and provide warranties, which are extremely valuable. 100G dedicated internet is pretty important, and much much easier to debug issues with than using cloud products. Having high-quality cable management during the racking process saved us a ton of time debugging in the long run; making it easy to switch up the networking saved us a lot of headache. We had a very strong simplicity prior, and this saved an immense amount of effort. We are quite happy that we didn’t use ceph or minio. Unlike e.g. nginx, they do not work out of the box. We were willing to write a simple Rust script and roughly saturated our network read & write at 100 Gbps without any fancy code. We were basically right about the price and advantages this offered, and did not substantially overestimate the amount of time / effort it would take. While the improvements list is longer than this, most of those are minor; fundamentally we built a cluster rivaling massive clouds for 40x cheaper. Difficult Bits A map of reality only gets you so far—while setting up the datacenter we ran into a couple problems and unexpected challenges. We’ll include a list: We used frontloaders instead of toploaders for our server rack. This meant we had to screw every single individual drive in—tedious for 2.4k HDDs Our storage was not dense—we could have saved 5x the work on physical placement and screwing by having a denser array of hard drives Shortcuts like daisy-chaining are usually a bad idea. We could have gotten substantially higher read/write speeds without daisy chaining networked nodes, giving each chassis its own HBA (Host Bus Adapter, not a significant cost). Compatibility is key—specifically in networking functionally everything is locked to a specific brand. We had many pain points here. Fiber transceivers will ~never work unless used with the right brand, but copper cables are much more forgiving. FS.com is pretty good and well priced (though their speed estimates were pretty inconsistent); Amazon will also often have the parts you need rapidly. Networking was a substantial cost and required experimentation. We did not use DHCP as most enterprise switches don’t support it and we wanted public IPs for the nodes for convenient and performant access from our servers. While this is an area where we would have saved time with a cloud solution, we had our networking up within days and kinks ironed out within ~3 weeks. We were often bottlenecked by easy access to servers via monitor/keyboard; idle crash carts during setup are helpful. Ideas Worth Trying Working KVMs are extremely useful, and you shouldn’t go without them or good IPMI. Physically going to a datacenter is really inconvenient, even if it’s a block away. IPMI is good, but only if you have pretty consistent machines. Think through your management Ethernet network as much as your real network - it’s really nice to be able to SSH into servers while configuring the network, and IPMI is great! Overprovision your network—e.g. if doable it’s worth having 400 Gigabit internally (you can use 100G cards etc for this!) We could have substantially increased density at additional upfront cost by buying 90-drive SuperMicro SuperServers and putting 20TB drives into them. This would allow us to use 2 racks instead of 10, given us had about the equivalent of 20 AMD 9654s in total CPU capacity, and used less total power. How You Can Build This Yourself Here’s what you need to replicate our setup. Storage 10 CPU head nodes. We used Intel Rr2000 with Dual Intel Gold 6148 and 128GB of DDR4 ECC RAM per server (which are incredibly cheap and roughly worked for our use cases) but you have a lot of flexibility in what you use. If you use the above configuration you likely won’t be able to do anything at all CPU-intensive on the servers (like on-device data processing or ZFS data compression / deduplication / etc, which is valuable if you’re storing non-video data). Our CPU nodes cost $600 each—it seems quite reasonable to us to spend up to $3k each if you want ZFS / compression or the abiliy to do data processing on-CPU. 100 DS4246 chassis—each can hold 24 hard drives. 2,400 3.5 inch HDDs—need to be all SATA or all SAS in each chassis. We would recommend SAS hard drives if possible [5] 5. if you use SAS drives you’ll need to deal with or disable mulipathing, which is reasonably simple as they roughly double speed over similar SATA drives. We used a mix of 12TB and 14TB drives—basically any size should work, roughly the larger the better holding price constant (density makes stacking easier + in general increases resale value). Physical parts to mount the chassis—you’ll need rails or l-brackets. We used l-brackets which worked well, as we haven’t needed to take the chassis out to slot hard drives. If you buy toploaders, you’ll need rails. Multiple “crash carts” with monitors and keyboards that allow you to physically connect to your CPU head nodes and configure them—this is invaluable when you’re debugging network issues. Network A 100 GbE switch a used Arista is fine, should be QSFP28, should cost about $1-2k HBAs (Host Bus Adapters), which connect your head nodes to your DS4246 chassis. The best configuration we tried was with Broadcom 9305-16E HBAs, with 3x HBAs per server (make sure your server has physical space for them!) with SFF-8644 to QSFP mini SAS cables. There are 4 slots per HBA, so you can cable each DS4246 chassis directly to the HBA. [6] 6. The option we ended up going with for convenience was putting LSI SAS9207-8e HBAs, which have 2 ports each, into the CPU head nodes- then daisy-chaining the DS4246s together with QSFP+ to QSFP+ DACs.. We deployed this on Storage Stacking Saturday, then while debugging speeds tried the above method on one of the servers and got to ~4 Gbps per chassis-but didn’t find it worth it to swap everything out in pure labor because of the way we had set up some of our head nodes such that they were difficult to take out. Insofar as it is reasonably cheap to just do the above thing to start and we’ve tested it to work, you should probably do as we say, not as we did in this case! Network cards (NICs). We used Mellanox ConnectX-4 100GbE. Make sure they come in Ethernet mode and not Infiniband mode for ease of config. DAC (Direct Attach Copper) or AOC (Active Optical) cables, to connect the NICs in your head nodes to your switch and therefore the internet. You almost certainly want DACs if your racks are close together, as they are far more compatible with arbitrary networking equipment than AOCs. We would recommend that you find a supplier to sell you the CPU head nodes with the HBAs and NICs installed—there are a number of used datacenter / enterprise parts suppliers who are willing to do this. This is a substantial positive because it means that you don’t have to spend hours installing the HBAs/NICs yourself and can have a substantially higher degree of confidence in your operations. Serial cables—you’ll need these to connect to your switch! Optional but recommended: an Ethernet management network of some kind. If you can’t easily get ethernet, we’d recommend getting a wifi adapter like this and then a ethernet switch like this —it’s substantially easier to set up than the 100GbE, is a great backup for when that’s not working, and will allow you to do ~everything over SSH from the comfort of the office instead of in the datacenter. Datacenter Requirements 3.5 kW of usable power per cabinet, with 10 4U chassis + 1 2U (cabinets are 42U tall) 1 spare cabinet for the 1U or 2U 100GbE switch (you can obviously also just swap out one of the 4U chassis in another cabinet for the switch). 1 42U cabinet per 3 PB of storage A dedicated 100G connection (will come in as a fiber pair probably via QSFP28 LR4, but confirm with your datacenter provider before buying parts here!) Ideally physically near your office—there is a lot of value in being able to walk over and debug issues instead of e.g. dealing with remote hands services to get internet to the nodes. Some setup tips: Make sure to first properly configure your switch. Depending on your switch model this should be relatively straightforward—you’ll need to physically connect to the switch and then configure the specific port that your 100GbE is connected to (you’ll get a fiber cross-connect from your datacenter that you should plug into a QSFP28 transceiver. Make sure that you get a transceiver that is compatible in form with the ISP, probably LR4, and specifically branded with your switch brand, otherwise it is very unlikely to work). Depending on your ISP you might have to talk to them to make sure that you can get “light” through the fiber cables from both ends, which might involve rolling the fiber and otherwise making sure it’s working properly. If your switch isn’t working / you haven’t configured one before, I’d suggest trying to directly plug the fiber cable from the ISP into one of your 10 heap servers, making sure to buy a transceiver that is compatible with your NIC brand (e.g. Mellanox). Once you get it working from there, move over to your switch and get it working. Depending on your ISP you might have to talk to them to make sure that you can get “light” through the fiber cables from both ends, which might involve rolling the fiber and otherwise making sure it’s working properly. Once you can connect to the internet from your switch (simply ping 1.1.1.1 to check) you are ready to set up the netplans for the individual nodes. this is most easily done during the Ubuntu setup process, which will walk you through setting up internet for your cpu head nodes, but is also doable outside of that Once you have internet access to your nodes and have properly connected 1 cable to each DS4246, you should format & mount the drives on each node, test that all of them are properly working, and then you are ready to deploy any software you want. If you end up building a similar storage cluster based on this writeup we’d love to hear from you—we’re very curious what can be improved, both in our guidance and in the object-level process. You can reach us at [email protected] If you came away from this post excited about our work, we’d love to chat. We’re a research lab currently focused on pretraining models to use computers, with the long-term goal of building general models that can learn in-context and do arbitrary tasks while aligned with human values; we’re hiring top researchers and engineers to help us train these. If you’re interested in chatting, shoot us an email at [email protected].

Building a 30 PB storage cluster in the heart of SF

Share this article

Related Articles