Skip to content
Tech News
← Back to articles

It is incorrect to "normalize" // in HTTP URL paths

read original get URL Path Normalization Guide → more articles
Why This Matters

This article clarifies that collapsing double slashes '//' into a single slash '/' in HTTP URL paths is not considered normalization according to RFC 3986. Recognizing that empty path segments and double slashes are syntactically valid is crucial for proper URL handling, security, and resource identification in web development and API design.

Key Takeaways

(See discussion on Lobsters.)

Collapsing // to / inside an HTTP URL path is not normalization.

The URI syntax permits empty path segments

RFC 3986 defines the path component and the segment grammar in a way that allows for empty segments. A double slash is therefore syntactically meaningful. It represents a zero-length segment between two separators.

3.3. Path The path component contains data, usually organized in hierarchical form, that, along with data in the non-hierarchical query component (Section 3.4), serves to identify a resource within the scope of the URI’s scheme and naming authority (if any). The path is terminated by the first question mark ("?") or number sign ("#") character, or by the end of the URI. If a URI contains an authority component, then the path component must either be empty or begin with a slash ("/") character. If a URI does not contain an authority component, then the path cannot begin with two slash characters ("//"). In addition, a URI reference (Section 4.1) may be a relative-path reference, in which case the first path segment cannot contain a colon (":") character. The ABNF requires five separate rules to disambiguate these cases, only one of which will match the path substring within a given URI reference. We use the generic term “path component” to describe the URI substring matched by the parser to one of these rules. path = path-abempty ; begins with "/" or is empty / path-absolute ; begins with "/" but not "//" / path-noscheme ; begins with a non-colon segment / path-rootless ; begins with a segment / path-empty ; zero characters path-abempty = *( "/" segment ) path-absolute = "/" [ segment-nz *( "/" segment ) ] path-noscheme = segment-nz-nc *( "/" segment ) path-rootless = segment-nz *( "/" segment ) path-empty = 0<pchar> segment = *pchar segment-nz = 1*pchar segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" ) ; non-zero-length segment without any colon ":" pchar = unreserved / pct-encoded / sub-delims / ":" / "@" A path consists of a sequence of path segments separated by a slash ("/") character. A path is always defined for a URI, though the defined path may be empty (zero length). Use of the slash character to indicate hierarchy is only required when a URI will be used as the context for relative references. For example, the URI mailto:[email protected] has a path of “[email protected]”, whereas the URI foo://info.example.com?fred has an empty path. The path segments “.” and “..”, also known as dot-segments, are defined for relative reference within the path name hierarchy. They are intended for use at the beginning of a relative-path reference (Section 4.2) to indicate relative position within the hierarchical tree of names. This is similar to their role within some operating systems’ file directory structures to indicate the current directory and parent directory, respectively. However, unlike in a file system, these dot-segments are only interpreted within the URI path hierarchy and are removed as part of the resolution process (Section 5.2). Aside from dot-segments in hierarchical paths, a path segment is considered opaque by the generic syntax.

Because segment = *pchar , the empty string is a valid segment. Therefore, path-abempty = *( "/" segment ) allows a slash followed by an empty segment. Any transformation that collapses // to / removes a syntactically valid segment and thus changes the parsed sequence of segments.

HTTP uses RFC 3986 path grammar

HTTP (RFC 9110) uses the RFC 3986 path grammar for request targets.

4.1. URI References URI references are used to target requests, indicate redirects, and define relationships. The definitions of “URI-reference”, “absolute-URI”, “relative-part”, “authority”, “port”, “host”, “path-abempty”, “segment”, and “query” are adopted from the URI generic syntax. An “absolute-path” rule is defined for protocol elements that can contain a non-empty path component. (This rule differs slightly from the path-abempty rule of RFC 3986, which allows for an empty path, and path-absolute rule, which does not allow paths that begin with “//”.) A “partial-URI” rule is defined for protocol elements that can contain a relative URI but not a fragment component. URI-reference = <URI-reference, see [URI], Section 4.1> absolute-URI = <absolute-URI, see [URI], Section 4.3> relative-part = <relative-part, see [URI], Section 4.2> authority = <authority, see [URI], Section 3.2> uri-host = <host, see [URI], Section 3.2.2> port = <port, see [URI], Section 3.2.3> path-abempty = <path-abempty, see [URI], Section 3.3> segment = <segment, see [URI], Section 3.3> query = <query, see [URI], Section 3.4> absolute-path = 1*( "/" segment ) partial-URI = relative-part [ "?" query ]

4.2.1. http URI Scheme http-URI = "http" "://" authority path-abempty [ "?" query ] The origin server for an “http” URI is identified by the authority component, which includes a host identifier ([URI], Section 3.2.2) and optional port number ([URI], Section 3.2.3). If the port subcomponent is empty or not given, TCP port 80 (the reserved port for WWW services) is the default. The hierarchical path component and optional query component identify the target resource within that origin server’s namespace.

... continue reading