A somewhat recurring problem I encounter in things I work on is the need to compute simplified geographic polygons, or more specifically, simplified hulls of geographic polygons. Here’s an overview on the currently used approach, maybe someone has pointers to better algorithms for this.
Coverage polygons
Geographic polygons are used in a few different places:
The Transport API Repository and consumers of it like KPublicTransport use them for describing areas covered by a public transport routing service, to automatically pick a suitable one at a given location.
The Emergency and Weather Aggregation Service uses them to describe areas affected by an emergency. Alerts that don’t carry explicit geometry but refer to an external dataset using a geographic code are particularly affected here, as those datasets (e.g. derived from OSM boundaries) are used for many different purpose and thus are often extremely detailed.
Meter-resolution geometry isn’t needed for any of those use-cases, hundreds of meters or even a kilometers are more than sufficient. And using a higher resolution does come at a cost, larger geometry needs to be stored and transferred, and rendering or doing intersection tests becomes significantly more expensive to compute.
Simplified coverage
Applying generic geometry simplification algorithms here isn’t ideal, as we would want a result that is still covering the original geometry, ie. a simplified hull. In the above use-cases covering more area is merely inefficient or maybe mildly annoying, while covering a smaller area can be catastrophic (e.g. someone potentially affected by an emergency not getting alerted).
A bounding box is the extreme case fulfilling that requirement and minimizing the resulting geometry, but we are obviously looking for a more reasonable trade-off between additionally covered area and geometry complexity.
Also, not all additionally covered area is equal here, covering extra area at land potentially impacts more people than covering extra area at sea. Or even more generally, additionally covered area hurts the more the more densely it is populated, with the sea just being the extreme (although not uncommon) case here. None of the below takes any of this into consideration though, but it’s one aspect where I’d expect some room for improving the result.
Douglas-Peucker
An common way to simplify polylines and polygons is the Douglas-Peucker algorithm. What this does is check whether for two points the maximum distance of any point in between to a line from the first to the last one is below a given threshold. If yes, all intermediate points can be discarded, otherwise this is recursively repeated separately for two parts split at the point with the maximum distance.
This is fairly easy to implement, and we have multiple copies of this around in different languages and for different polygon or polyline data structures, e.g. in Marble, KPublicTransport, KPublicAlerts, FOSS Public Alert Server and Transport API Repository.
For sufficiently small thresholds and relatively “smooth” geometries this works quite well, although on its own it doesn’t guarantee the result is a hull of the input.
Results however deteriorate when using a threshold in the same order of magnitude as there are features in the geometry, say a kilometer threshold on a geometry containing a highly detailed Fjord-like coastline.
2.6MB GeoJSON input with highly detailed complex coast lines.
Douglas-Peucker result with a threshold of ~1km (0.5% of the overall area width).
And if you push this further, it can also result in self-intersecting polygons, which is something we generally want to avoid here.
Poylgon offsetting
To deal with the fact that Douglas-Peucker doesn’t produce a hull, we could offset or buffer the polygon. That is, enlarging it by moving all points to the “outside” by a given amount.
Implementing that is a bit more involved, so this is usually done by using the Clipper2 library, which seems to be somewhat of the reference implementation for some of the more complex polygon manipulation algorithms.
There’s a few small surprises in there, such as it working with integer coordinates and typos in its API (“childs”), but beyond that it’s reasonably easy to use both from C++ and Python (examples in Marble, KPublicTransport, FOSS Public Alert Server and Transport API Repository).
Simplifying by offsetting
These two steps can also be combined in another way to obtain a simplified hull:
Offset the polygon first, by a value large enough that unwanted details will get merged together. E.g. at half their width narrow concave areas such as Fjords disappear that way, islands will get “swallowed” by nearby land, etc.
Apply Douglas-Peucker, with a threshold below the offset size. The result will then still be a hull of the original geometry, and since the previous step removed small details that would interfere we get a “smoother” result.
Apply a negative offset, to shrink the polygon closer to its original size again. The maximum value to use for that is the value the polygon was initially offset with minus half the threshold used for Douglas-Peucker.
This has proven to be very effective with polygons with highly detailed relatively small concave features, such as land polygons with a Fjord-heavy coastline. And as a welcome side-effect it often also automatically fixes self-intersecting geometry in the input data.
Simplified hull, reduced to just 17kB.
It’s not quite as effective on the inverse though, ie. geometry with fine-grained convex features, such as the sea polygon in the above example.
Truncating numbers
There’s a technical simplification orthogonal to the above also worth mentioning when using GeoJSON or other textual formats such as CAP. Printed out naively floating point numbers easily end up with 12 or 13 decimals. That’s enough for a sub-millimeter resolution, which is pointless when working with kilometer-scale features.
It’s therefore also useful to implement adaptive rounding and truncation of the coordinates based on the size of the overall geometry, resulting in just four to six decimals usually in our use-cases, which is a two- to three-times reduction in (textual) size (example).
This only helps with storage and transfer volume of course, the computational complexity of processing the geometry doesn’t change by this.
Can this be done better?
As this isn’t really such an uncommon problem, are there better ways to do this? Ie. are there ways to achieve a higher-quality geometry at the same size or ways to further reduce the size without compromising quality? Are there other algorithms and approaches worth trying?