Norway's 2 petabytes of Huawei flash storage and LLM training

Norway’s National Library is developing a large language model (LLM) that understands the Norwegian language and is using 2 PB of Huawei OceanStor Dorado flash storage in its AI training data pipeline.

Marius Husnes.

Marius Husnes, the Head of IT Platform at the library (Nasjonlbiblioteket) discussed the project at Huawei’s ID Forum 2026 in Paris, saying that no commercial LLM provider was developing a local (Norwegian) language LLM. He asserted that any country with its own language that did not have a sovereign LLM trained in that language was at a disadvantage as a globally trained, English-speaking LLM would not know about that country’s history, news and culture that was described in the local language.

Norway’s Ministry of Culture tasked the National Library with building a sovereign AI (LLM) as the library has the single largest digital collection of Norwegian books, newspapers, web pages and so forth in the country. Like many state libraries it is entitled to receive copies of every published book and broadcasted content. Its legal deposit mandate in this area extended beyond books, as it was duty-bound to collect and preserve all of Norway’s cultural heritage.

BANDF AD

An agreement with Norwegian newspapers permitted LLM training on copyrighted content and, Husnes said: ”No private company has this.”

The library was also well-placed to do this as it had been digitizing its collection since 2005 and had amassed 20 PB of unique data stored in 3-2-1 form (3 copies, 2 media types, 1 off-site), meaning some 60 PB overall. The digitization process for the raw text, sound, moving pictures, still images and web content involved much OCR scanning, and generated a lot of metadata, and also APIs for online access.

The bulk of the data was deposited in a digital disk plus tape archive, a preservation system. Husnes’ task was to get this data to the LLM training system. He said the bottlemeck was not compute; it was data quality, cleaning and pipeline throughput. There were two main processing stages. First there was in-house computation, using an Nvidia DGX H200 system, a 384 core CPU cluster and multiple Huawei OceanStor Dorado all-flash arrays, totalling 2 PB of flash capacity. This is low-latency storage for the data pipelines and training preparation.

BANDF AD

Husnes - training national LLM.

... continue reading