Last week, Jake Stid, a postdoctoral research associate at Michigan State University, announced Ground-Mounted Solar Energy in the United States (GM-SEUS). This is a 15K-array, 2.9M-panel dataset of utility and commercial-grade solar farms across the lower 48 states plus the District of Columbia. This dataset was constructed by a team of researchers including alumni from NOAA, NASA and the USGS. Below is a heatmap of the assets catalogued in this dataset. GM-SEUS is broken up into two datasets, one for arrays and another panels. Below you can see a solar farm with the array outlined in red and the panels covered purple. In this post, I'll explore GM-SEUS's Solar Farm dataset. My Workstation I'm using a 5.7 GHz AMD Ryzen 9 9950X CPU. It has 16 cores and 32 threads and 1.2 MB of L1, 16 MB of L2 and 64 MB of L3 cache. It has a liquid cooler attached and is housed in a spacious, full-sized Cooler Master HAF 700 computer case. The system has 96 GB of DDR5 RAM clocked at 4,800 MT/s and a 5th-generation, Crucial T700 4 TB NVMe M.2 SSD which can read at speeds up to 12,400 MB/s. There is a heatsink on the SSD to help keep its temperature down. This is my system's C drive. The system is powered by a 1,200-watt, fully modular Corsair Power Supply and is sat on an ASRock X870E Nova 90 Motherboard. I'm running Ubuntu 24 LTS via Microsoft's Ubuntu for Windows on Windows 11 Pro. In case you're wondering why I don't run a Linux-based desktop as my primary work environment, I'm still using an Nvidia GTX 1080 GPU which has better driver support on Windows and ArcGIS Pro only supports Windows natively. Installing Prerequisites I'll use GDAL 3.9.3 and a few other tools to help analyse the data in this post. $ sudo add-apt-repository ppa:ubuntugis/ubuntugis-unstable $ sudo apt update $ sudo apt install \ gdal-bin \ jq I'll use DuckDB v1.4.1, along with its H3, JSON, Lindel, Parquet and Spatial extensions, in this post. $ cd ~ $ wget -c https://github.com/duckdb/duckdb/releases/download/v1.4.1/duckdb_cli-linux-amd64.zip $ unzip -j duckdb_cli-linux-amd64.zip $ chmod +x duckdb $ ~/duckdb INSTALL h3 FROM community ; INSTALL lindel FROM community ; INSTALL json ; INSTALL parquet ; INSTALL spatial ; I'll set up DuckDB to load every installed extension each time it launches. $ vi ~/.duckdbrc .timer on .width 180 LOAD h3; LOAD lindel; LOAD json; LOAD parquet; LOAD spatial; The maps in this post were mostly rendered with QGIS version 3.44. QGIS is a desktop application that runs on Windows, macOS and Linux. The application has grown in popularity in recent years and has ~15M application launches from users all around the world each month. I used QGIS' Tile+ plugin to add basemaps from Esri to the maps in this post. Analysis-Ready Data I'll download a dataset containing the US CENSUS State codes. This will let me map the state ID in the arrays dataset to their state name. $ wget https://gist.github.com/a8dx/2340f9527af64f8ef8439366de981168/raw/81d876daea10eab5c2675811c39bcd18a79a9212/US_State_Bounding_Boxes.csv I'll download the ZIP file of deliverables for GM-SEUS. $ wget -O GMSEUS_v1_0.zip \ 'https://zenodo.org/records/14827819/files/GMSEUS_v1_0.zip?download=1' $ unzip GMSEUS_v1_0.zip I'll extract the projection used. This proj4 string will be used to below to re-project the data into EPSG:4326. $ gdalsrsinfo \ -o proj4 \ GMSEUS_v1_0/GPKG/GMSEUS_Arrays_Final.gpkg +proj=aea +lat_0=37.5 +lon_0=-96 +lat_1=29.5 +lat_2=45.5 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs I'll use DuckDB to clean up the values and produce both a geometry field and a bounding box for each feature in this dataset. This will make working with this dataset remotely, such as from AWS S3, much easier. $ ~/duckdb This following produced a ZStandard-compressed, spatially-sorted Parquet file of the arrays dataset. I dropped the Z dimension as it was unused. The unknown values have been turned into NULLs. The original GPKG file was 108 MB and the resulting Parquet file is 37 MB. COPY ( WITH a AS ( SELECT * EXCLUDE ( geom ), ST_FORCE2D ( ST_FLIPCOORDINATES ( ST_TRANSFORM ( geom , '+proj=aea +lat_0=37.5 +lon_0=-96 +lat_1=29.5 +lat_2=45.5 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs' , 'EPSG:4326' ))) geometry FROM ST_READ ( 'GMSEUS_v1_0/GPKG/GMSEUS_Arrays_Final.gpkg' ) ) SELECT a . * EXCLUDE ( geometry , tilt , tiltEst , instYr , instYrLT , effInit , avgAzimuth , avgLength , avgSpace , avgWidth ,), { 'xmin' : ST_XMIN ( ST_EXTENT ( geometry )), 'ymin' : ST_YMIN ( ST_EXTENT ( geometry )), 'xmax' : ST_XMAX ( ST_EXTENT ( geometry )), 'ymax' : ST_YMAX ( ST_EXTENT ( geometry )) } AS bbox , ST_ASWKB ( geometry ) geometry , CASE WHEN instYr :: INT = - 9999 THEN NULL ELSE instYr :: INT END AS instYr , CASE WHEN instYrLT :: INT = - 9999 THEN NULL ELSE instYrLT :: INT END AS instYrLT , CASE WHEN numRow :: INT = - 9999 THEN NULL ELSE numRow :: INT END AS numRow , CASE WHEN tilt :: INT = - 9999 THEN NULL ELSE tilt :: INT END AS tilt , CASE WHEN tiltEst :: INT = - 9999 THEN NULL ELSE tiltEst :: INT END AS tiltEst , CASE WHEN effInit :: INT = - 9999 THEN NULL ELSE effInit END AS effInit , CASE WHEN avgAzimuth :: INT = - 9999 THEN NULL ELSE avgAzimuth END AS avgAzimuth , CASE WHEN avgLength :: INT = - 9999 THEN NULL ELSE avgLength END AS avgLength , CASE WHEN avgSpace :: INT = - 9999 THEN NULL ELSE avgSpace END AS avgSpace , CASE WHEN avgWidth :: INT = - 9999 THEN NULL ELSE avgWidth END AS avgWidth , b . NAME state_name FROM a JOIN 'US_State_Bounding_Boxes.csv' b ON a . STATEFP = b . STATEFP ORDER BY HILBERT_ENCODE ([ ST_Y ( ST_CENTROID ( geometry )), ST_X ( ST_CENTROID ( geometry ))]:: double [ 2 ]) ) TO 'arrays.parquet' ( FORMAT 'PARQUET' , CODEC 'ZSTD' , COMPRESSION_LEVEL 22 , ROW_GROUP_SIZE 15000 ); The original GPKG file for the panels dataset was 1.1 GB and the resulting Parquet file is 334 MB. COPY ( WITH a AS ( SELECT * EXCLUDE ( geom ), ST_FORCE2D ( ST_FLIPCOORDINATES ( ST_TRANSFORM ( geom , '+proj=aea +lat_0=37.5 +lon_0=-96 +lat_1=29.5 +lat_2=45.5 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs' , 'EPSG:4326' ))) geometry FROM ST_READ ( 'GMSEUS_v1_0/GPKG/GMSEUS_Panels_Final.gpkg' ) ) SELECT * EXCLUDE ( geometry , rowSpace ), { 'xmin' : ST_XMIN ( ST_EXTENT ( geometry )), 'ymin' : ST_YMIN ( ST_EXTENT ( geometry )), 'xmax' : ST_XMAX ( ST_EXTENT ( geometry )), 'ymax' : ST_YMAX ( ST_EXTENT ( geometry )) } AS bbox , ST_ASWKB ( geometry ) geometry , CASE WHEN rowSpace :: INT = - 9999 THEN NULL ELSE rowSpace END AS rowSpace FROM a ORDER BY HILBERT_ENCODE ([ ST_Y ( ST_CENTROID ( geometry )), ST_X ( ST_CENTROID ( geometry ))]:: double [ 2 ]) ) TO 'panels.parquet' ( FORMAT 'PARQUET' , CODEC 'ZSTD' , COMPRESSION_LEVEL 22 , ROW_GROUP_SIZE 15000 ); Solar Arrays The arrays Parquet file has 15,017 rows. Below is an example record. $ echo "SELECT * EXCLUDE(bbox, geometry), bbox::JSON bbox FROM 'arrays.parquet' LIMIT 1" \ | ~/duckdb -json \ | jq -S . [ { "COUNTYFP" : "019" , "GCR1" : 0.6996 , "GCR2" : 0.614 , "STATEFP" : "45" , "Source" : "OSM" , "arrayID" : 2807 , "avgAzimuth" : 170.63 , "avgLength" : 47.76166666666666 , "avgSpace" : 3.003333333333333 , "avgWidth" : 4.776666666666666 , "bbox" : { "xmax" : -79.97229830431786 , "xmin" : -79.97325770533094 , "ymax" : 32.87833627192598 , "ymin" : 32.87808294640646 }, "capMW" : 0.246 , "capMWest" : 0.246 , "effInit" : 0.197963503102977 , "instYr" : 2021 , "instYrLT" : 2021 , "latitude" : 32.87818725544087 , "longitude" : -79.97276617375104 , "modType" : "c-si" , "mount" : "fixed_axis" , "nativeID" : "9324" , "newBound" : 1 , "numRow" : 6.0 , "numRow_1" : 6 , "state_name" : "South Carolina" , "tilt" : 30 , "tiltEst" : 30 , "totArea" : 1779.0 , "totRowArea" : 1244.93 , "version" : "v1.0" } ] Below are the field names, data types, percentages of NULLs per column, number of unique values and minimum and maximum values for each column. $ ~/duckdb SELECT column_name , column_type , null_percentage , approx_unique , min , max FROM ( SUMMARIZE FROM READ_PARQUET ( 'arrays.parquet' )) WHERE column_name != 'geometry' AND column_name != 'bbox' ORDER BY 1 ; ┌─────────────┬─────────────┬─────────────────┬───────────────┬─────────────────────┬────────────────────┐ │ column_name │ column_type │ null_percentage │ approx_unique │ min │ max │ │ varchar │ varchar │ decimal(9,2) │ int64 │ varchar │ varchar │ ├─────────────┼─────────────┼─────────────────┼───────────────┼─────────────────────┼────────────────────┤ │ COUNTYFP │ VARCHAR │ 0.00 │ 235 │ 001 │ 810 │ │ GCR1 │ DOUBLE │ 0.00 │ 5057 │ 0.1047 │ 1.0 │ │ GCR2 │ DOUBLE │ 0.00 │ 5013 │ 0.1245 │ 0.988 │ │ STATEFP │ VARCHAR │ 0.00 │ 49 │ 01 │ 56 │ │ Source │ VARCHAR │ 0.00 │ 6 │ CCVPV │ USPVDB │ │ arrayID │ BIGINT │ 0.00 │ 13155 │ 1 │ 15017 │ │ avgAzimuth │ DOUBLE │ 32.84 │ 4295 │ 25.0 │ 269.27 │ │ avgLength │ DOUBLE │ 39.79 │ 8358 │ 4.02 │ 449.5004 │ │ avgSpace │ DOUBLE │ 39.79 │ 8623 │ 0.024 │ 20.0 │ │ avgWidth │ DOUBLE │ 39.79 │ 9185 │ 0.67625 │ 29.80222222222222 │ │ capMW │ DOUBLE │ 0.00 │ 5280 │ 0.001250225184651 │ 1051.703 │ │ capMWest │ DOUBLE │ 0.00 │ 7863 │ 0.004 │ 3170.1 │ │ effInit │ DOUBLE │ 0.49 │ 39 │ 0.132210289727273 │ 0.205484167047619 │ │ instYr │ INTEGER │ 0.00 │ 24 │ 1985 │ 2024 │ │ instYrLT │ INTEGER │ 0.24 │ 17 │ 2009 │ 2023 │ │ latitude │ DOUBLE │ 0.00 │ 16986 │ 25.53796582594631 │ 48.99547137225406 │ │ longitude │ DOUBLE │ 0.00 │ 15656 │ -124.10440474967092 │ -67.15066374183608 │ │ modType │ VARCHAR │ 0.00 │ 3 │ c-si │ thin-film │ │ mount │ VARCHAR │ 0.00 │ 10 │ dual_axis │ unknown │ │ nativeID │ VARCHAR │ 0.00 │ 15141 │ 1 │ York Solar │ │ newBound │ BIGINT │ 0.00 │ 2 │ 0 │ 1 │ │ numRow │ DOUBLE │ 0.00 │ 1461 │ 0.0 │ 56782.0 │ │ numRow_1 │ INTEGER │ 0.00 │ 1117 │ 0 │ 56782 │ │ state_name │ VARCHAR │ 0.00 │ 57 │ Alabama │ Wyoming │ │ tilt │ INTEGER │ 55.46 │ 47 │ 0 │ 83 │ │ tiltEst │ INTEGER │ 55.46 │ 30 │ 10 │ 43 │ │ totArea │ DOUBLE │ 0.00 │ 13182 │ 54.0 │ 13735113.0 │ │ totRowArea │ DOUBLE │ 0.00 │ 15396 │ 44.97 │ 7223924.662 │ │ version │ VARCHAR │ 0.00 │ 1 │ v1.0 │ v1.0 │ ├─────────────┴─────────────┴─────────────────┴───────────────┴─────────────────────┴────────────────────┤ │ 29 rows 6 columns │ └────────────────────────────────────────────────────────────────────────────────────────────────────────┘ I'll generate a heatmap of the asset locations in this dataset. CREATE OR REPLACE TABLE h3_4_stats AS SELECT H3_LATLNG_TO_CELL ( bbox . ymin , bbox . xmin , 4 ) AS h3_4 , COUNT ( * ) num_buildings FROM READ_PARQUET ( 'arrays.parquet' ) WHERE bbox . xmin BETWEEN - 178 . 5 AND 178 . 5 GROUP BY 1 ; COPY ( SELECT ST_ASWKB ( H3_CELL_TO_BOUNDARY_WKT ( h3_4 ):: geometry ) geometry , num_buildings FROM h3_4_stats ) TO 'h3_4_stats.gpkg' WITH ( FORMAT GDAL , DRIVER 'GPKG' , LAYER_CREATION_OPTIONS 'WRITE_BBOX=YES' ); Normally I would produce a Parquet file as even with 10s of thousands of records it'll generate in seconds versus a minute or so with GPKG. But ArcGIS Pro 3.5 didn't want to open the Parquet file I generated. QGIS 3.44 was fine with it but I wanted to use Esri's Nova basemap for the rendering below. ArcGIS Pro 3.6 should be released sometime in the next few weeks so I'll re-examine this issue when it's out. Below is the relationship between the sources of data and the installation year. PIVOT 'arrays.parquet' ON Source USING COUNT ( * ) GROUP BY instYr ORDER BY instYr ; ┌────────┬───────┬───────┬───────────────┬───────┬───────┬────────┐ │ instYr │ CCVPV │ CWSD │ GMSEUSgeorect │ OSM │ SAM │ USPVDB │ │ int32 │ int64 │ int64 │ int64 │ int64 │ int64 │ int64 │ ├────────┼───────┼───────┼───────────────┼───────┼───────┼────────┤ │ 1985 │ 0 │ 0 │ 0 │ 0 │ 0 │ 1 │ │ 1986 │ 0 │ 0 │ 0 │ 1 │ 0 │ 0 │ │ 2002 │ 0 │ 0 │ 0 │ 0 │ 0 │ 1 │ │ 2005 │ 0 │ 0 │ 0 │ 26 │ 0 │ 0 │ │ 2006 │ 0 │ 0 │ 0 │ 2 │ 0 │ 1 │ │ 2007 │ 0 │ 0 │ 0 │ 44 │ 0 │ 5 │ │ 2008 │ 0 │ 0 │ 0 │ 58 │ 1 │ 11 │ │ 2009 │ 5 │ 0 │ 0 │ 10 │ 5 │ 19 │ │ 2010 │ 20 │ 0 │ 0 │ 71 │ 20 │ 37 │ │ 2011 │ 24 │ 0 │ 2 │ 193 │ 30 │ 102 │ │ 2012 │ 59 │ 0 │ 2 │ 267 │ 88 │ 157 │ │ 2013 │ 83 │ 0 │ 3 │ 259 │ 82 │ 209 │ │ 2014 │ 102 │ 0 │ 1 │ 335 │ 119 │ 291 │ │ 2015 │ 107 │ 3 │ 0 │ 532 │ 125 │ 320 │ │ 2016 │ 145 │ 1 │ 2 │ 564 │ 170 │ 412 │ │ 2017 │ 135 │ 0 │ 1 │ 661 │ 167 │ 476 │ │ 2018 │ 66 │ 34 │ 4 │ 644 │ 210 │ 414 │ │ 2019 │ 28 │ 39 │ 6 │ 467 │ 178 │ 453 │ │ 2020 │ 10 │ 75 │ 1 │ 437 │ 186 │ 496 │ │ 2021 │ 5 │ 33 │ 6 │ 406 │ 241 │ 446 │ │ 2022 │ 1 │ 173 │ 3 │ 231 │ 354 │ 166 │ │ 2023 │ 0 │ 0 │ 3 │ 176 │ 722 │ 134 │ │ 2024 │ 0 │ 0 │ 0 │ 31 │ 1571 │ 0 │ ├────────┴───────┴───────┴───────────────┴───────┴───────┴────────┤ │ 23 rows 7 columns │ └─────────────────────────────────────────────────────────────────┘ Below is the relationship between the mount and mod type. PIVOT 'arrays.parquet' ON modType USING COUNT ( * ) GROUP BY mount ORDER BY mount ; ┌─────────────┬───────┬───────┬───────────┐ │ mount │ c-si │ csp │ thin-film │ │ varchar │ int64 │ int64 │ int64 │ ├─────────────┼───────┼───────┼───────────┤ │ dual_axis │ 301 │ 18 │ 1 │ │ fixed_axis │ 6057 │ 32 │ 208 │ │ mixed │ 2 │ 0 │ 0 │ │ mixed_df │ 189 │ 7 │ 0 │ │ mixed_dfs │ 94 │ 0 │ 0 │ │ mixed_ds │ 38 │ 1 │ 0 │ │ mixed_fs │ 60 │ 0 │ 1 │ │ single_axis │ 2876 │ 11 │ 231 │ │ unknown │ 4885 │ 5 │ 0 │ └─────────────┴───────┴───────┴───────────┘ Below are the array capacity counts rounded to the neared 100 MW and broken down by source. WITH a AS ( SELECT Source , ROUND ( capMW / 100 ) * 100 AS capacity , COUNT ( * ) num_recs FROM 'arrays.parquet' GROUP BY 1 , 2 ) PIVOT a ON Source USING SUM ( num_recs ) GROUP BY capacity ORDER BY capacity ; ┌──────────┬────────┬────────┬───────────────┬────────┬────────┬────────┐ │ capacity │ CCVPV │ CWSD │ GMSEUSgeorect │ OSM │ SAM │ USPVDB │ │ double │ int128 │ int128 │ int128 │ int128 │ int128 │ int128 │ ├──────────┼────────┼────────┼───────────────┼────────┼────────┼────────┤ │ 0.0 │ 790 │ 356 │ 33 │ 5295 │ 4022 │ 3669 │ │ 100.0 │ NULL │ 2 │ NULL │ 67 │ 143 │ 350 │ │ 200.0 │ NULL │ NULL │ 1 │ 22 │ 49 │ 73 │ │ 300.0 │ NULL │ NULL │ NULL │ 17 │ 21 │ 49 │ │ 400.0 │ NULL │ NULL │ NULL │ 6 │ 13 │ 7 │ │ 500.0 │ NULL │ NULL │ NULL │ 4 │ 11 │ 2 │ │ 600.0 │ NULL │ NULL │ NULL │ 2 │ 3 │ NULL │ │ 700.0 │ NULL │ NULL │ NULL │ 1 │ 3 │ NULL │ │ 800.0 │ NULL │ NULL │ NULL │ NULL │ 2 │ 1 │ │ 900.0 │ NULL │ NULL │ NULL │ NULL │ 1 │ NULL │ │ 1000.0 │ NULL │ NULL │ NULL │ NULL │ 1 │ NULL │ │ 1100.0 │ NULL │ NULL │ NULL │ 1 │ NULL │ NULL │ ├──────────┴────────┴────────┴───────────────┴────────┴────────┴────────┤ │ 12 rows 7 columns │ └───────────────────────────────────────────────────────────────────────┘