Goals
The URL standard takes the following approach towards making URLs fully interoperable:
-
Align RFC 3986 and RFC 3987 with contemporary implementations and obsolete them in the process. (E.g. spaces, other "illegal" code points, query encoding, equality, canonicalization, are all concepts not entirely shared, or defined.) URL parsing needs to become as solid as HTML parsing. [RFC3986] [RFC3987]
-
Standardize on the term URL. URI and IRI are just confusing. In practice a single algorithm is used for both so keeping them distinct is not helping anyone. URL also easily wins the search result popularity contest.
-
Supplanting Origin of a URI [sic]. [RFC6454]
-
Define URL’s existing JavaScript API in full detail and add enhancements to make it easier to work with. Add a new
URL
object as well for URL manipulation without usage of HTML elements. (Useful for JavaScript worker environments.)
As the editors learn more about the subject matter the goals might increase in scope somewhat.
1. URLs
A URL is a universal identifier.
A URL consists of components, namely a scheme, scheme data, username, password, host, port, path, query, and fragment.
http://username:password@example.com:8000/path?query#fragment
contains values for all URL components except
scheme data.
javascript:doSomething()
contains values for only
scheme and
scheme data.
A URL’s scheme is a string that identifies the type of URL and can be used to dispatch a URL for further processing after parsing. It is initially null.
A URL’s scheme data is a string holding the contents of a URL. It is initially null.
A URL’s scheme data will be null if its initial scheme is a relative scheme, and otherwise will be the only component other than scheme that differs from its initial value.
A URL’s username is a string identifying a user. It is initially the empty string.
A URL’s password is either null or a string identifying a user’s credentials. It is initially null.
A URL’s host is either null or a host. It is initially null.
A URL’s port is a string that identifies a networking port. It is initially the empty string.
A URL’s path is a list of zero or more strings holding data, usually identifying a location in hierarchical form. It is initially the empty list.
A URL’s query is either null or a string holding data. It is initially null.
A URL’s fragment is either null or a string holding data that can be used for further processing on the resource the URL’s other components identify. It is initially null.
A URL also has an associated
object that is either null or a
Blob
. It is initially null.
[FILEAPI]
At this point this is used primarily to support "blob
"
URLs, but others can be added going forward, hence "object".
A relative scheme is a scheme listed in the first column of the following table. A default port is a relative scheme’s optional corresponding port and is listed in the second column on the same row.
scheme | port |
---|---|
"ftp " |
"21 "
|
"file " |
|
"gopher " |
"70 "
|
"http " |
"80 "
|
"https " |
"443 "
|
"ws " |
"80 "
|
"wss " |
"443 "
|
A URL includes credentials if either its username is not the empty string or its password is non-null.
A URL can be designated as base URL.
A base URL is useful for the URL parser when the input is potentially a relative URL.
1.1. Authoring Requirements
Two types of parse errors are defined. Parse exceptions terminate parsing and must be implemented by all conforming implementations. By contrast, user agents are encouraged, but not required, to expose conformance errors somehow. If parsing a given input results in the detection of multiple conformance errors, user agents may chose to only report a subset of the errors detected.
A URL must be written as either a
relative URL or an
absolute URL, optionally followed by
"#
" and a
fragment.
An absolute URL must be a
scheme, followed by
":
", followed by either a
scheme-relative URL, if
scheme is a relative scheme, or
scheme data otherwise, optionally followed
by "?
" and a query.
A scheme must be one
ASCII alpha, followed by zero or more of
ASCII alphanumeric, "+
",
"-
", and ".
". A
scheme must be registered
....
The syntax of scheme data
depends on the scheme and is typically
defined alongside it. Standards must define
scheme data within the constraints of zero or
more URL units, excluding "?
".
A relative URL must be either a
scheme-relative URL, an
absolute-path-relative URL,
or a path-relative URL that
does not start with a scheme and
":
", optionally followed by a "?
" and
a query.
At the point where a relative URL is parsed, a base URL must be in scope.
A scheme-relative URL must be
"//
", optionally followed by
userinfo and "@
",
followed by a host, optionally followed
by ":
" and a port,
optionally followed by an
absolute-path-relative URL.
Userinfo must be a
username, optionally followed by a
":
" and a
password.
A username must be zero or more
URL units, excluding "/
",
":
, "?
", and "@
".
A password must be zero or more
URL units, excluding "/
",
"?
", and "@
".
A host must be either a domain,
or an IPv4 address,
or "[
" followed
by an IPv6 address followed by
"]
".
A domain must be a string that is a valid domain.
Textual representation of IPv4 address does not appear to be defined by an RFC. See Textual Representation of IPv4 and IPv6 Addresses for some history.
An IPv6 address is defined in the "Text Representation of Addresses" chapter of IP Version 6 Addressing Architecture. [RFC4291]
A port must be zero or more ASCII digits.
An
absolute-path-relative URL
must be "/
", followed by a
path-relative URL that does not
start with "/
".
A path-relative URL must be zero or
more path segments separated from each
other by a "/
". The first segment (if any) of a path-relative
URL must not contain a colon (U+003A
).
A path segment must be zero or more URL units,
excluding "/
" and "?
".
A query must be zero or more URL units.
A fragment must be zero or more URL units.
The URL code points are ASCII alphanumeric,
"!
",
"$
",
"&
",
"'
",
"(
",
")
",
"*
",
"+
",
",
",
"-
",
".
",
"/
",
":
",
";
",
"=
",
"?
",
"@
",
"_
",
"~
",
and code points in the ranges
U+00A0 to U+D7FF,
U+E000 to U+FDCF,
U+FDF0 to U+FFFD,
U+10000 to U+1FFFD,
U+20000 to U+2FFFD,
U+30000 to U+3FFFD,
U+40000 to U+4FFFD,
U+50000 to U+5FFFD,
U+60000 to U+6FFFD,
U+70000 to U+7FFFD,
U+80000 to U+8FFFD,
U+90000 to U+9FFFD,
U+A0000 to U+AFFFD,
U+B0000 to U+BFFFD,
U+C0000 to U+CFFFD,
U+D0000 to U+DFFFD,
U+E0000 to U+EFFFD,
U+F0000 to U+FFFFD,
U+100000 to U+10FFFD.
Code points higher than U+009F will be converted to percent-encoded bytes by the URL parser, except for code points appearing in fragments.
The URL units are URL code points and percent-encoded bytes.
1.2. Parsers
The URL parser takes a string input, optionally with a base URL base, and optionally with an encoding encoding override, and then runs these steps:
-
Let url be the result of running the basic URL parser on input with base, and encoding override as provided.
-
If url is failure, return failure.
-
If url’s scheme is not "
blob
", return url. -
If url’s scheme data is not in the blob URL store, return url. [FILEAPI]
-
Set url’s object to a structured clone of the entry in the blob URL store corresponding to url’s scheme data. [HTML]
-
Return url.
The basic URL parser takes a string input, optionally with a base URL base, optionally with an encoding encoding override, optionally with an URL url and a state override state override, and then runs these steps:
The encoding override argument is a legacy concept only relevant for
HTML. The url and state override arguments are only for
use by methods of objects implementing the URLUtils
interface.
[HTML]
When the url and state override arguments are not passed the basic URL parser returns either a URL or failure. If they are passed the algorithm simply modifies the passed url and can terminate without returning anything.
-
If url is not given:
-
Set url to a new URL.
-
If leading or trailing ASCII whitespace codepoints are present in the input:
-
Indicate a conformance error.
-
Remove leading and trailing ASCII whitespace from input.
-
-
-
If base is not given, set it a new URL with
scheme
set toabout
. -
If encoding override is not given, set it to utf-8.
- Invoke
url
on the input passing url, base and encoding override. -
Return url.
1.3. Parsing Rules
These railroad diagrams, as modified by the accompanying text, define grammar production rules for URLs. They are to be evaluated sequentially, first left-to-right then top-to-bottom, backtracking as necessary, until a complete match against the input provided is found.
Each rule defines a function that can be invoked individually. Rules can invoke one another.
Parsing a given input according to a railroad diagram produces a number of intermediate values that can be referenced individually as local variables within the function. The names of those local variables isn’t specified by this specification, instead the names used in the railroad diagrams are referenced.
If a given input doesn’t match the railroad diagram, failure is returned instead. If failure is returned by an alternative, evaluation continues with the next alternative.
The following conventions are used in descriptions of parsing logic for clarity and conciseness:
-
The phrase "
x
is present" is to be interpreted to mean "the grammar rule forx
matches some part of theinput
when thatinput
is parsed according to the given railroad diagram". Note that if the railroad diagram contains alternatives where multiple alternatives could potentially match the input, only the first matching input is considered to be present. -
The phrase "value of
x
" is to be interpreted to mean "the value returned by the functionx(input)
when the functionx
is passed some part of the originalinput
during the parsing of that originalinput
according to the given railroad diagram".
Note: extracting the railroad diagrams from this specification and interpreting them in isolation will produce incorrect results. In particular, the definitions provided in the prose modifies these parsing algorithms in important ways including returning failure and early termination.
1.3.1. url(input)
returns: { scheme, scheme-data, username, password, host, port, path, query, fragment }
- Parse
input
according to the above railroad diagram. - Let
result
be the value offile-url
,non-relative-url
, orrelative-url
depending on which is present. - If
query
is present, setresult.query
to the value ofquery
. - If
fragment
is present, setresult.fragment
to the value offragment
. - If
result.scheme
has a default port, and ifresult.port
is equal to that default, then delete theport
property fromresult
. - Return
result
.
1.3.2. file-url(input)
returns: { scheme, host, path }
"
" is to be matched case insensitively.file
- Parse
input
according to the above railroad diagram. - Let
result
be an empty object. - Three rows of production rules are defined for files, numbered from top
to bottom. Examples and evaluation instructions for each:
-
file:c:\foo\bar.html
- Set
result.scheme
to "file
". - Set
result.path
to the value ofpath
. - Remove the first element from
result.path
if it is an empty string and if there is a second element which has a non-empty value. - Construct a string using the ASCII alpha
following the first "
:
" in the input concatenated with a ":
". Prepend this string toresult.path
.
- Set
-
/C|\foo\bar
- Set
result.scheme
to "file
". - If the
host
is present, setresult.host
to the value ofhost
. - If the
host
is not present and no slashes precede thepath
in the input, then prependbase.path
minus the last element to theresult.path
. - Set
result.path
to the value ofpath
.
- Set
-
file:/example.com/
- Indicate a conformance error.
- Set
result.scheme
to "file
". - Set
result.path
to the value ofpath
. - Remove the first element from
result.path
if it is an empty string and if there is a second element which has a non-empty value. - Construct a string consisting of the code point following
the initial "
/
" (if any) in the production rule concatenated with a ":
". Prepend this string to theresult.path
array.
-
- Return
result
.
At the present time, file URLs are generally not
interoperable, and therefore are effectively implementation defined.
Furthermore, the parsing rules in this section have not enjoyed wide review,
and therefore are more likely to be subject to change than other parts of this
specification.
bug 27518
proposes to remove all normative definitions for file URLs that are not
known to be interoperable.
Bug 23550
and
bug 23717
suggest less drastic changes.
1.3.3. non-relative-url(input)
returns: { scheme, scheme-data }
javascript:alert("Hello, world!");
- Parse
input
according to the above railroad diagram. - If the value of
scheme
does not match any relative scheme then return failure. - Set
encoding override
to "utf-
".8
- Initialize
result
to be a JSON object withscheme
set to be the result returned byscheme
, andschemeData
set to the value ofscheme-data
. - Return
result
.
The resolution of
bug 26338
may change how encoding override is handled.
The resolution of
bug 27233
may add support for relative URLs for unknown schemes.
1.3.4. relative-url(input)
returns: { scheme, username, password, host, port, path }
- Parse
input
according to the above railroad diagram. - Four rows of production rules are defined for relative URLs, numbered
from top to bottom. Examples and evaluation instructions for each:
-
http://user:pass@example.org:
21
/foo/bar- If anything other than two forward solidus code points ("
//
") immediately follows the first colon in the input, indicate a conformance error. - Let
result
be the value ofauthority
. - If
non-file-relative-scheme
is present in the input, then setresult.scheme
to the value ofnon-file-relative-scheme
. - If
non-file-relative-scheme
is not present, then setresult.scheme
to the value ofbase.scheme
. - If
path
is present, setresult.path
to the value ofpath
.
- If anything other than two forward solidus code points ("
-
ftp:/example.com/
parsed using a base ofhttp://example.org/foo/bar
- If the value of
scheme
equalsbase.scheme
then return failure. - Indicate a conformance error.
- Let
result
be the value ofauthority
. - Set
result.scheme
to the value ofnon-file-relative-scheme
. - If
result.host
is either an empty string or contains a colon (U+003A
), then terminate parsing with a parse exception. - If
path
is present, setresult.path
to the value ofpath
.
- If the value of
-
http:foo/bar
- Indicate a conformance error.
- Let
result
be an empty object. - Set
result.scheme
to the value ofnon-file-relative-scheme
. - Set
result.scheme
to the value ofscheme
. - Set
result.host
tobase.host
. - Set
result.path
by the path concatenation ofbase.path
andpath
.
-
/foo/bar
- Let
result
be an empty object. - Set
result.scheme
tobase.scheme
. - Set
result.host
tobase.host
. - if the length of
path
is greater than zero, and the first segment inpath
contains a colon (U+003A
), indicate a conformance error. - Replace
result.path
with the path concatenation ofbase.path
andpath
.
- Let
-
- Return
result
.
1.3.5. non-file-relative-scheme(input)
returns: String
Schemes are to be matched against the input in a case insensitive manner.
- Parse
input
according to the above railroad diagram. - Set
encoding override
to "utf-
" if the scheme matches "8
wss
" or "ws
". - Return the scheme as a lowercased string.
The resolution of
bug 26338
may change how encoding override is handled.
1.3.6. scheme(input)
returns: String
A scheme consists of an ASCII alpha,
followed by zero or more ASCII alpha or any of the following
code points: hyphen-minus (U+002D
), plus sign (U+002B
) or full stop
(U+002D
).
Return the results as a lowercased string.
1.3.7. authority(input)
returns: { username, password, host, port }
- Parse
input
according to the above railroad diagram. - Let
result
be an empty object. - If
user
is present, setresult.username
to the value ofuser
. - If
password
is present, setresult.password
to the value ofpassword
. - Set
result.host
to the value ofhost
up to the first "@
" sign, if any. If no "@
" signs are present in the value ofhost
, then setresult.host
to the value ofhost
. - If one or more "
@
" signs are present in the value ofhost
, then perform the following steps:- Indicate a conformance error.
- Initialize
info
to the value of "%
" plus the remainder of the40
host
after the first "@
" sign. Replace all remaining "@
" signs ininfo
, with the string "%
".40
- If
password
is present, appendinfo
toresult.password
. - If
password
is not present anduser
is present, appendinfo
toresult.username
. - If
user
is not present, setresult.username
toinfo
.
- If
port
is present, setresult.port
to its value. - Return
result
.
1.3.8. user(input)
returns: String
Consume all code points until either
a solidus (U+002F
),
a reverse solidus (U+005C
),
a question mark (U+003F
),
a number sign (U+
),
a commercial at (0023
U+
),
a colon (0040
U+003A
),
or the end of string is encountered.
Return the cleansed result using the
default encode set.
1.3.9. password(input)
returns: String
Consume all code points until either
a solidus (U+002F
),
a reverse solidus (U+005C
),
a question mark (U+003F
),
a number sign (U+
),
a commercial at (0023
U+
),
or the end of string is encountered.
Return the cleansed result using the
default encode set.0040
1.3.10. host(input)
returns: String
- Parse
input
according to the above railroad diagram. - If
ipv6addr
is present, return the value ofipv6addr
. - If
ipv4addr
is present, return the value ofipv4addr
. - Let
result
be the characters matched by the railroad diagram. - If any
U+
,0009
U+000A
,U+000D
,U+200B
,U+
, or2060
U+FEFF
code points are present inresult
, remove those code points and indicate a conformance error. - Let
domain
be the result of host parsingresult
. If this results in a failure, terminate processing with a parse exception. If host parsing returned a value that was different than what was provided as input, indicate a conformance error. - Try parsing
domain
as anipv4addr
. If this succeeds, replacedomain
with the result. - Validate
domain
as follows:- split the string at
U+002E
(full stop) code points - If any of the pieces, other than the first one, are empty strings, indicate a conformance error.
- split the string at
- Return
domain
.
The resolution of
bug 25334
may change what codepoints are allowed in a domain.
The resolution of
bug 27266
may change the way domain names and trailing dots are handled.
1.3.11. ipv6addr(input)
returns: String
- Let
pre
be the set ofh16
values before the double colon, if present. - Let
post
be the remainingh16
value before the last value - Let
last
be the trailingh16
orls32
value. - If there are no consecutive colon code points in the input string,
indicate a parse exception and terminate processing unless there
are exactly six
h16
values and onels32
value. - If there are consecutive colon code points present in the input,
indicate a parse exception and terminate processing if the total
number of values (
h16
orls32
) is more than six. - Unless there is a
ls32
value present, indicate a parse exception and terminate processing if consecutive colon code points are present in the input or if there are more than onels32
value after the consecutive colons. - Append "
" values to0
pre
while the sum of the lengths of thepre
andpost
arrays is less than six. - Append a "
" value to0
pre
if nols32
item is present in the input and the sum of the lengths of thepre
andpost
array is seven. - Append
last
topre
. - Return '[' plus the ipv6
serialized value of
pre
as a string, plus ']'.
The resolution of
bug 27234
may add support for link-local addresses.
1.3.12. ipv4addr(input)
returns: String
- If any but the last
number
is greater or equal to256
, terminate processing with a parse exception. - If the last
number
is greater than or equal to256
to the power of (5
minus the number ofnumber
s present in the input), terminate processing with a parse exception. - Unless four
number
s are present, indicate a conformance error. - Let
n
be the lastnumber
. - If the first
number
is present, add its value times256
3
ton
. - If the second
number
is present, add its value times256
2
ton
. - If the third
number
is present, add its value times256
ton
. - Let
result
be an empty array. - Four times do the following:
- Prepend the value of
n
modulo256
toresult
. - Set
n
to the value of the integer quotient ofn
divided by256
.
- Prepend the value of
- Join the values in
result
with a Full Stop (U+002E
) code point, and return the results as a string.
The resolution of
bug 26431
may change this definition.
1.3.13. number(input)
returns: Integer
Three production rules, with uppercase and lowercase variants, are defined for numbers. Parse the values as hexadecimal, octal, and decimal integers respectively. Indicate a conformance error if the value is hexadecimal or octal. Return the result as an integer.
1.3.14. h16(input)
returns: String
Return up to four ASCII hex digits as a string.
1.3.15. ls32(input)
returns: String
Return four decimal 8
-bit pieces separated by full
stop code points as a string.
1.3.16. decimal-byte(input)
returns: String
Decimal bytes are a string of up to three decimal digits. If the results
converted to an integer are greater than 255
, terminate processing with
a parse exception.
1.3.17. port(input)
returns: String
- Consume all code points until either
a solidus (
U+002F
), a reverse solidus (U+005C
), a question mark (U+003F
), or the end of string is encountered. - Let
result
be the cleansed set of code points using null as the encode set. - Remove leading
U+
code points from0030
result
until either the leading code point is notU+
or0030
result
consists of exactly one code point. - If any code points in
result
are not ASCII digits:- If
input
was not set, terminate processing with a parse exception. - Truncate
result
starting with the first non-digit code point. - Indicate a conformance error.
- If
- Return the result as a string.
The resolution of
bug 26446
may change port from a string to a number.
1.3.18. path(input)
returns: Array of Strings
- If any of the path separators are a reverse solidus ("
\
"), indicate a conformance error. - Extract all the pathnames into an array. Process each name as follows:
- Cleanse the name using the default encode set as the encode set.
- If the name is "
.
" or "%2e
" (case insensitive), then process this name based on the position in the array:- If the position is other than the last, remove the name from the list.
- If the array is of length
1
, replace the entry with an empty string. - Otherwise, leave the entry as is.
- If the name is "
..
", ".%2e
", "%2e.
", or "%2e%2e
" (all to be compared in a case insensitive manner), then process this name based on the position in the array:- If the position is the first, then remove it.
- If the position is other than the last, then remove it and the one before it.
- If the position is the last, then remove it and the one before it, then append an empty string.
- Return the array.
The resolution of bug
24163
may change what code points to escape in the path.1.3.19. scheme-data(input)
returns: String
Consume all code points until either a question mark (
U+003F
), a number sign (U+
), or the end of string is encountered. Return the cleansed result using null as the encode set.0023
The resolution of bug
24246
may change what code points to escape in the scheme data.1.3.20. query(input)
returns: String
Consume all code points until either a number sign (
U+
) or the end of string is encountered. Return the cleansed result using the the result using the query encode set.0023
The resolution of bug
27280
may change how code points <0x20
are handled.1.3.21. fragment(input)
returns: String
- Let
result
be the remaining code points in the input. - If any
U+
characters are present in0000
result
, remove those characters fromresult
and indicate a conformance error. - If any characters in
result
are neither a Percent Sign (U+
) nor a URL code point, indicate a conformance error.0025
- Return the cleansed result using null as the encode set.
Unfortunately not using percent-encoding is intentional as implementations with majority market share exhibit this behavior.
The resolution of bug
26988
may add support for parsing URLs without decoding the fragment identifier.1.4. Setter Rules
URLUtils and URLUtilsReadOnly members invoke the following setter rules withurl
set to a non-null
value.1.4.1. set-protocol(input)
Set
url.scheme
to value returned byscheme
.1.4.2. set-username(input)
If
url.scheme_data
is not null, return.Set
url.username
to the percent encoded value using the username encode set.1.4.3. set-password(input)
If
url.scheme_data
is not null, return.Set
url.password
to the percent encoded value using the password encode set.1.4.4. set-host(input)
If
url.scheme_data
is not null, return.Set
url.host
to the value returned byhost
.If
port
is present, setresult.port
to its value.1.4.5. set-hostname(input)
If
url.scheme_data
is not null, return.Set
url.host
to the value returned byhost
.1.4.6. set-port(input)
If
url.scheme_data
is not null orurl.scheme
is "file
", return.If
url.scheme
has a default port, and ifport
is equal to that default, then set theport
property ofurl
to the empty string.Otherwise, set
url.port
to the value returned byport
.1.4.7. set-pathname(input)
If
url.scheme_data
is not null, return.Set
url.path
to the value returned bypath
.1.4.8. set-search(input)
Set
url.query
to the percent encoded value after the initial question mark (U+003F
), if any, using the query encode set.1.4.9. set-hash(input)
Set
url.fragment
to the percent encoded value after the initial number sign (U+
), if any, using the simple encode set0023
1.5. Common Functions
To percent encode a byte into a percent-encoded byte, return a string consisting of "
%
", followed by a double-digit, uppercase, hexadecimal representation of byte.To percent decode a byte sequence input, run these steps:
Using anything but a utf-8 decoder when the input contains bytes outside the range 0x00 to 0x7F might be insecure and is not recommended.
-
Let output be an empty byte sequence.
-
For each byte byte in input, run these steps:
-
If byte is not `
%
`, append byte to output. -
Otherwise, if byte is `
%
` and the next two bytes after byte in input are not in the ranges 0x30 to 0x39, 0x41 to 0x46, and 0x61 to 0x66, append byte to output. -
Otherwise, run these substeps:
-
Let bytePoint be the two bytes after byte in input, decoded, and then interpreted as hexadecimal number.
-
Append a byte whose value is bytePoint to output.
-
Skip the next two bytes in input.
-
-
-
Return output.
To utf-8 percent encode a code point, using an encode set, run these steps:
-
If code point is not in encode set, return code point.
-
Let bytes be the result of running utf-8 encode on code point.
-
Percent encode each byte in bytes, and then return them concatenated, in the same order.
The domain to ASCII function given a domain domain, runs these steps:
-
Let result be the result of running Unicode ToASCII with domain_name set to domain, UseSTD3ASCIIRules set to false, processing_option set to Transitional_Processing, and VerifyDnsLength set to false.
-
If result is a failure value, return failure.
-
Return result.
The domain to Unicode function given a domain domain, runs these steps:
-
Let result be the result of running Unicode ToUnicode with domain_name set to domain, UseSTD3ASCIIRules set to false.
-
Return result, ignoring any returned errors.
User agents are encouraged to report errors through a developer console.
A domain is a valid domain if these steps return success:
-
Let result be the result of running Unicode ToASCII with domain_name set to domain, UseSTD3ASCIIRules set to true, processing_option set to Nontransitional_Processing, and VerifyDnsLength set to true.
-
If result is a failure value, return failure.
-
Set result to the result of running Unicode ToUnicode with domain_name set to result, UseSTD3ASCIIRules set to true.
-
If result contains any errors, return failure.
-
Return success.
Ideally we define this in terms of a sequence of code points that make up a valid domain rather than through a whack-a-mole: bug 25334.
The host parser takes a string input and then runs these steps:
-
If input is the empty string, return failure.
-
Let domain be the result of utf-8 decode without BOM on the percent decoding of utf-8 encode on input.
-
Let asciiDomain be the result of running domain to ASCII on domain.
-
If asciiDomain is failure, return failure.
-
If asciiDomain contains one of U+0000, U+0009, U+000A, U+000D, U+0020, "
#
", "%
", "/
", ":
", "?
", "@
", "[
", "\
", and "]
", return failure. -
Return asciiDomain.
To cleanse a string given an encode set, run these steps:
- If any character in the string not a URL code point
or a percent sign (
U+0025
), indicate a conformance error. - If the name includes a percent sign (
U+0025
) that is not immediately followed by two hexadecimal characters, indicate a conformance error. - If any
U+0009
,U+000A
orU+000D
characters are present in the string, remove those characters and indicate a conformance error. - if the encode set is non-null,
first encode the
result using the
encoding override
, then percent encode that result using the provided encode set. - Return the result as a string.
To do path concatenation given a base array of path names, and a path array of names, run the following steps:
- If base is null, set base to an empty array. Otherwise make a local copy of the base array.
- If the first element on path is ".", remove this first element from path as well as the last element (if any) of base.
- If path.length is one, and the first and only element on the path is an empty string, set path to the value of base.
- Otherwise if path.length is greater than one, and the first element on the path is the empty string, remove the first element from the path.
- Otherwise, remove the last element (if any) of base and then prepend the values of the base array to the path array.
1.6. Serializers
The URL serializer takes a URL url, optionally an exclude fragment flag, and then runs these steps:
-
Let output be url’s scheme and "
:
" concatenated. -
If url’s scheme data is unset:
-
Append "
//
" to output. -
If url’s username is not the empty string or url’s password is non-null, run these substeps:
-
Append url’s host, serialized, to output.
-
If url’s port is not the empty string, append "
:
" concatenated with url’s port to output. -
Append "
/
" concatenated with the strings in url’s path (including empty strings), separated from each other by "/
" to output.
-
-
Otherwise, append url’s scheme data to output.
-
If url’s query is non-null, append "
?
" concatenated with url’s query to output. -
If the exclude fragment flag is unset and url’s fragment is non-null, append "
#
" concatenated with url’s fragment to output. -
Return output.
The host serializer takes null or a host host and then runs these steps:
-
If host is null, return the empty string.
-
If host is an IPv6 address, return "
[
", followed by the result of running the IPv6 serializer on host, followed by "]
". -
Otherwise, host is a domain or an IPv4 address, return host.
The IPv6 serializer takes an IPv6 address address and then runs these steps:
-
Let output be the empty string.
-
Let compress pointer be a pointer to the first 16-bit piece in the first longest sequences of address’s 16-bit pieces that are 0.
In
0:f:0:0:f:f:0:0
it would point to the second 0. -
If there is no sequence of address’s 16-bit pieces that are 0 longer than one, set compress pointer to null.
-
For each piece in address’s pieces, run these substeps:
-
If compress pointer points to piece, append "
::
" to output if piece is address’s first piece and append ":
" otherwise, and then run these substeps again with all subsequent pieces in address’s pieces that are 0 skipped or go the next step in the overall set of steps if that leaves no pieces. -
Append piece, represented as the shortest possible lowercase hexadecimal number, to output.
-
If piece is not address’s last piece, append "
:
" to output.
-
-
Return output.
This algorithm requires the recommendation from A Recommendation for IPv6 Address Text Representation. [RFC5952]
1.7. Origin
See origin’s definition in HTML for the necessary background information. [HTML]
A URL’s origin is the origin returned by running these steps, switching on URL’s scheme:
- "
blob
" -
Let url be the result of parsing URL’s scheme data.
If url is failure, return a new globally unique identifier. Otherwise, return url’s origin.
The origin of
blob:https://whatwg.org/d0360e2f-caee-469f-9a2f-87d5b0456f6f
is the tuple (https
,whatwg.org
,443
). - "
ftp
" - "
gopher
" - "
http
" - "
https
" - "
ws
" - "
wss
" -
Return a tuple consisting of URL’s scheme, its host, and its default port if its port is the empty string, and its port otherwise.
- "
file
" -
Unfortunate as it is, this is left as an exercise to the reader. When in doubt, return a new globally unique identifier.
- Otherwise
-
Return a new globally unique identifier.
2. APIs
[Constructor(USVString url, optional USVString base = "about:blank"), Exposed=(Window,Worker)] interface URL { static USVString domainToASCII(USVString domain); static USVString domainToUnicode(USVString domain); }; URL implements URLUtils; [NoInterfaceObject, Exposed=(Window,Worker)] interface URLUtils { stringifier attribute USVString href; readonly attribute USVString origin; attribute USVString protocol; attribute USVString username; attribute USVString password; attribute USVString host; attribute USVString hostname; attribute USVString port; attribute USVString pathname; attribute USVString search; attribute URLSearchParams searchParams; attribute USVString hash; }; [NoInterfaceObject, Exposed=(Window,Worker)] interface URLUtilsReadOnly { stringifier readonly attribute USVString href; readonly attribute USVString origin; readonly attribute USVString protocol; readonly attribute USVString host; readonly attribute USVString hostname; readonly attribute USVString port; readonly attribute USVString pathname; readonly attribute USVString search; readonly attribute USVString hash; };
Except where different objects implementing
URLUtilsReadOnly
are identical to objects implementingURLUtils
.Since all members are readonly and certain members from
URLUtils
are not exposed a number of potential optimizations is possible compared to objects implementingURLUtils
. These are left as an exercise to the reader.Specifications defining objects implementing
URLUtils
orURLUtilsReadOnly
must define a get the base algorithm, which must return the appropriate base URL for the object.Specifications defining objects implementing
URLUtils
may define update steps to make it possible for an underlying string (such as an attribute value) to be updated. The update steps are passed a string value for this purpose.An object implementing
URLUtils
orURLUtilsReadOnly
has an associated input (a string), query encoding (an encoding), query object (aURLSearchParams
object or null), and a url (a URL or null).Unless stated otherwise, query encoding is utf-8 and query object is null. The others follow from the set the input algorithm.
The associated query encoding is a legacy concept only relevant for HTML. [HTML]
Specifications defining objects implementing
URLUtils
orURLUtilsReadOnly
must use the set the input algorithms to set input, url, and query object. To set the input given input and optionally a url, run these steps:-
Otherwise, run these substeps:
-
Set url to null.
-
If input is null, set input to the empty string.
-
Otherwise, run these subsubsteps:
-
Set input to input.
-
Let url be the result of running the URL parser on input with base URL being the result of running get the base and query encoding as encoding override.
-
If url is not failure, set url to url.
-
-
-
Let query be url’s query if url is non-null, and the empty string otherwise.
-
If query object is null, set query object to a
new URLSearchParams
object using query, and then append the context object to query object’s list of url objects. -
Otherwise, set query object’s list to the result of parsing query.
To run the pre-update steps for an object implementing
URLUtils
, optionally given a value, run these steps:-
If value is not given, let value be the result of serializing the associated url.
-
Run the update steps with value.
2.1. Constructors
The
URL(url, base)
constructor, when invoked, must run these steps:-
Let parsedBase be the result of running the basic URL parser on base.
-
If parsedBase is failure, throw a
TypeError
exception. -
Set parsedURL to the result of running the basic URL parser on url with parsedBase.
-
If parsedURL is failure, throw a
TypeError
exception. -
Let result be a new
URL
object. -
Let result’s get the base return parsedBase.
-
Run result’s set the input given the empty string and parsedURL.
-
Return result.
To Basic URL parse a string into a URL without using a base URL, invoke the constructor with a single argument:
var input = "https://example.org/💩", url = new URL(input) url.pathname // "/%F0%9F%92%A9"
Alternatively you can use the base URL of a document through
baseURI
:var input = "/💩", url = new URL(input, document.baseURI) url.href // "https://url.spec.whatwg.org/%F0%9F%92%A9"
2.2.
URL
staticsThe
domainToASCII(domain)
static method, when invoked, must run these steps:-
Let asciiDomain be the result of invoking
host
with domain as input. -
If asciiDomain matches
ipv6addr
oripv4addr
or failure, return the empty string. -
Return asciiDomain.
The
domainToUnicode(domain)
static method, when invoked, must run these steps:-
Let asciiDomain be the result of invoking domainToASCII with domain as input.
-
Return the result of running domain to Unicode on asciiDomain.
Add domainToUI() which follows the UA conventions for when to use the Unicode representation?
2.3.
URLUtils
andURLUtilsReadOnly
membersThe
URLUtils
andURLUtilsReadOnly
interfaces are not exposed on the global object. They are meant to augment other interfaces, such asURL
.The
href
attribute’s getter must run these steps:-
Return the serialization of url.
The
href
attribute’s setter must run these steps:-
Let input be the given value.
-
If the
context object
is aURL
object, run these substeps:-
Let parsedURL be the result of running the basic URL parser on input with base URL being the result of running get the base.
-
If parsedURL is failure, throw a
TypeError
exception. -
Run set the input given the empty string and parsedURL.
-
-
Otherwise, run these substeps:
-
Run the set the input algorithm for input.
-
Run the pre-update steps with the input.
This means that if the
href
attribute is set to value that would cause the URL parser to return failure, that value is still passed through unchanged. This is one of those unfortunate legacy incidents.var a = document.createElement("a"), input = "https://test:test/" // invalid port makes the parser return failure a.href = test a.href === test // true
-
The
origin
attribute’s getter must run these steps:-
If url is null, return the empty string.
-
Return the Unicode serialization of url’s origin. [HTML]
It returns the Unicode rather than the ASCII serialization for compatibility with HTML’s
MessageEvent
feature. [HTML]The
protocol
attribute’s getter must run these steps:The
protocol
attribute’s setter must run these steps:-
If url is null, terminate these steps.
- Invoke
set-protocol
on the input with url as url. -
Run the pre-update steps.
The
username
attribute’s getter must run these steps:The
username
attribute’s setter must run these steps:-
If url is null, or its scheme data is set, terminate these steps.
-
Set url’s username to the empty string.
-
For each code point in username, utf-8 percent encode it using the username encode set, and append the result to url’s username.
-
Run the pre-update steps.
The
password
attribute’s getter must run these steps:The resolution of bug 27516 may remove the ability to access passwords from scripts.
The
password
attribute’s setter must run these steps:-
If url is null, or its scheme data is set, terminate these steps.
-
If password is the empty string, set url’s password to null.
-
Otherwise, run these substeps:
-
Set url’s password to the empty string.
-
For each code point in password, utf-8 percent encode it using the password encode set, and append the result to url’s password.
-
-
Run the pre-update steps.
The
host
attribute’s getter must run these steps:-
If url is null, return the empty string.
-
If port is the empty string, return host, serialized.
-
Return host, serialized, "
:
", and port concatenated.
The
host
attribute’s setter must run these steps:-
If url is null, or its scheme data is set, terminate these steps.
- Invoke
set-host
on the input with url as url. -
Run the pre-update steps.
The
hostname
attribute’s getter must run these steps:-
If url is null, return the empty string.
-
Return host, serialized.
The
hostname
attribute’s setter must run these steps:-
If url is null, or its scheme data is set, terminate these steps.
- Invoke
set-hostname
on the input with url as url. -
Run the pre-update steps.
The
port
attribute’s getter must run these steps:The
port
attribute’s setter must run these steps:-
If url is null, its scheme data is set, or its scheme is "
file
", terminate these steps. - Otherwise, invoke
set-port
on the input with url as url. -
Run the pre-update steps.
The
pathname
attribute’s getter must run these steps:-
If url is null, return the empty string.
-
If the scheme data is set, return scheme data.
-
Return "
/
" concatenated with the strings in path (including empty strings), separated from each other by "/
".
The
pathname
attribute’s setter must run these steps:-
If url is null, or its scheme data is set, terminate these steps.
-
Empty path.
- Invoke
set-pathname
on the input with url as url. -
Run the pre-update steps.
The
search
attribute’s getter must run these steps:-
If url is null, or its query is either null or the empty string, return the empty string.
-
Return "
?
" concatenated with query.
The
search
attribute’s setter must run these steps:-
If url is null, terminate these steps.
-
If the given value is the empty string, set query to null, empty query object’s list, run its update steps, and terminate these steps.
-
Let input be the given value with a single leading "
?
" removed, if any. -
Set query to the empty string.
- Invoke
set-search
on the input with url as url, and the associated query encoding as encoding override. -
Set query object’s list to the result of parsing input.
-
Run query object’s update steps.
The update steps of query object are run to ensure all url objects remain synchronized.
The
searchParams
attribute’s getter must return the query object.The
searchParams
attribute’s setter must run these steps:-
Let object be the given value.
-
Remove the context object from query object’s list of url objects.
-
Append the context object to object’s list of url objects.
-
Set query object to object.
-
Set query to the serialization of the query object’s list.
-
Run the pre-update steps.
The
hash
attribute’s getter must run these steps:-
If url is null, or its fragment is either null or the empty string, return the empty string.
-
Return "
#
" concatenated with fragment.
The
hash
attribute’s setter must run these steps:-
If url is null, or its scheme is "
javascript
", terminate these steps. -
If the given value is the empty string, set fragment to null, run the pre-update steps, and terminate these steps.
-
Let input be the given value with a single leading "
#
" removed, if any. -
Set fragment to the empty string.
- Invoke
set-hash
on the input with url as url. -
Run the pre-update steps.
2.4. Interface
URLSearchParams
[Constructor(optional (USVString or URLSearchParams) init = ""), Exposed=(Window,Worker)] interface URLSearchParams { void append(USVString name, USVString value); void delete(USVString name); USVString? get(USVString name); sequence<USVString> getAll(USVString name); boolean has(USVString name); void set(USVString name, USVString value); iterable<USVString, USVString>; stringifier; };
A
URLSearchParams
object has an associated list of name-value pairs, which is initially empty.A
URLSearchParams
object has an associated list of zero or more url objects, which is initially empty.URLSearchParams
objects always use utf-8 as encoding, despite the existence of concepts such as query encoding. This is to encourage developers to migrate towards utf-8, which they really ought to have done a long time ago now.To create a new
URLSearchParams
object, optionally using init, run these steps:-
Let query be a new
URLSearchParams
object. -
If init is the empty string or null, return query.
-
If init is a string, set query’s list to the result of parsing init.
-
If init is a
URLSearchParams
object, set query’s list to a copy of init’s list. -
Return query.
A
URLSearchParams
object’s update steps are to run these steps for each associated url object urlObject, in order:-
Set urlObject’s url’s query to the serialization of
URLSearchParams
object’s list. -
Run urlObject’s pre-update steps.
The
URLSearchParams(init)
constructor, when invoked, must return anew URLSearchParams
object using init if given.The
append(name, value)
method, when invoked, must run these steps:-
Append a new name-value pair whose name is name and value is value, to list.
-
Run the update steps.
The
delete(name)
method, when invoked, must run these steps:-
Remove all name-value pairs whose name is name from list.
-
Run the update steps.
The
get(name)
method, when invoked, must return the value of the first name-value pair whose name is name in list, and null if there is no such pair.The
getAll(name)
method, when invoked, must return the values of all name-value pairs whose name is name, in list, in list order, and the empty sequence otherwise.The
set(name, value)
method, when invoked, must run these steps:-
If there are any name-value pairs whose name is name, in list, set the value of the first such name-value pair to value and remove the others.
-
Otherwise, append a new name-value pair whose name is name and value is value, to list.
-
Run the update steps.
The
has(name)
method, when invoked, must return true if there is a name-value pair whose name is name in list, and false otherwise.The value pairs to iterate over are the list name-value pairs with the key being the name and the value the value.
The stringifier must return the serialization of the
URLSearchParams
object’s list.2.5. URL APIs elsewhere
A standard that exposes URLs, should expose the URL as a string (by serializing an internal URL). A standard should not expose a URL using a
URL
object.URL
objects are meant for URL manipulation. In IDL the USVString type should be used.The higher-level notion here is that values are to be exposed as immutable data structures.
If a standard decides to use a variant of the name "URL" for a feature it defines, it should name such a feature "url" (i.e. lowercase and with an "l" at the end). Names such as "URL", "URI", and "IRI" should not be used. However, if the name is a compound, "URL" (i.e. uppercase) is preferred, e.g. "newURL" and "oldURL".
The
EventSource
andHashChangeEvent
interfaces in HTML are examples of proper naming. [HTML]3.
application/x-www-form-urlencoded
The
application/x-www-form-urlencoded
format is a simple way to encode name-value pairs in a byte sequence where all bytes are in the 0x00 to 0x7F range.While this description makes
application/x-www-form-urlencoded
sound dated — and really, it is — the format is in widespread use due to its prevalence of HTML forms. [HTML]3.1.
application/x-www-form-urlencoded
parsingThe features provided by the
application/x-www-form-urlencoded
parser are mainly relevant for server-oriented implementations. A browser-based implementation only needs what theapplication/x-www-form-urlencoded
string parser requires.The
application/x-www-form-urlencoded
parser takes a byte sequence input, optionally with an encoding encoding override, optionally with a use _charset_ flag, and optionally with an isindex flag, and then runs these steps:-
If encoding override is not given, set it to utf-8.
-
If encoding override is not utf-8 and input contains bytes whose value is greater than 0x7F, return failure.
This can only happen if input was not generated through the serializer or
URLSearchParams
. -
Let sequences be the result of splitting input on `
&
`. -
If the isindex flag is set and the first byte sequence in sequences does not contain a `
=
`, prepend `=
` to the first byte sequence in sequences. -
Let pairs be an empty list of name-value pairs where both name and value hold a byte sequence.
-
For each byte sequence bytes in sequences, run these substeps:
-
If bytes is the empty byte sequence, run these substeps for the next byte sequence.
-
If bytes contains a `
=
`, then let name be the bytes from the start of bytes up to but excluding its first `=
`, and let value be the bytes, if any, after the first `=
` up to the end of bytes. If `=
` is the first byte, then name will be the empty byte sequence. If it is the last, then value will be the empty byte sequence. -
Otherwise, let name have the value of bytes and let value be the empty byte sequence.
-
Replace any `
+
` in name and value with 0x20. -
If use _charset_ flag is set, name is `
_charset_
`, run these substeps:-
Let result be the result of getting an encoding for value, decoded.
-
If result is not failure, unset use _charset_ flag and set encoding override to result.
-
-
Add a pair consisting of name and value to pairs.
-
-
Let output be an empty list of name-value pairs where both name and value hold a string.
-
For each name-value pair in pairs, append a name-value pair to output where the new name and value appended to output are the result of running encoding override’s decoder on the percent decoding of the name and value from pairs, respectively.
-
Return output.
3.2.
application/x-www-form-urlencoded
serializingThe
application/x-www-form-urlencoded
byte serializer takes a byte sequence input and then runs these steps:-
Let output be the empty string.
-
For each byte in input, depending on byte:
- 0x20
-
Append U+002B to output.
- 0x2A
- 0x2D
- 0x2E
- 0x30 to 0x39
- 0x41 to 0x5A
- 0x5F
- 0x61 to 0x7A
-
Append a code point whose value is byte to output.
- Otherwise
-
Append byte, percent encoded, to output.
-
Return output.
The
application/x-www-form-urlencoded
serializer takes a list of name-value pairs pairs, optionally with an encoding encoding override, and then runs these steps:-
If encoding override is not given, set it to utf-8.
-
Let output be the empty string.
-
For each pair in pairs, run these substeps:
-
Let outputPair be a copy of pair.
-
Replace outputPair’s name and value with the result of running encode on them using encoding override, respectively.
-
Replace outputPair’s name and value with their serialization.
-
If pair is not the first pair in pairs, append "
&
" to output. -
Append outputPair’s name, followed by "
=
", followed by outputPair’s value to output.
-
- Return output.
3.3. Hooks
The
application/x-www-form-urlencoded
string parser takes a string input, utf-8 encodes it, and then returns the result ofapplication/x-www-form-urlencoded
parsing it.4. Terminology
Some terms used in this specification are defined in the DOM, Encoding, IDNA, and Web IDL Standards. [DOM] [ENCODING] [IDNA] [WEBIDL]
The ASCII digits are code points in the range U+0030 to U+0039.
The ASCII hex digits are ASCII digits or are code points in the range U+0041 to U+0046 or in the range U+0061 to U+0066.
The ASCII alpha are code points in the range U+0041 to U+005A or in the range U+0061 to U+007A.
The ASCII alphanumeric are ASCII digits or ASCII alpha.
A percent-encoded byte is "
%
", followed by two ASCII hex digits. Sequences of percent-encoded bytes, after conversion to bytes, should not cause a utf-8 decoder to run into any errors.The simple encode set are all code points less than U+0020 (i.e. excluding U+0020) and all code points greater than U+007E.
The default encode set is the simple encode set and code points U+0020, '
"
', "#
", "<
", ">
", "?
", and "`
".The password encode set is the default encode set and code points "
/
", "@
", and "\
".The username encode set is the password encode set and code point "
:
".The query encode set is defined to be code points that are less than U+0021, greater than U+007E, or one of U+0022, U+0023, U+003C, U+003E, and U+0060.
A host is a network address in the form of a domain, an IPv4 address, or an IPv6 address.
A domain identifies a realm within a network.
An IPv4 address is a 32-bit identifier and for the purposes of this specification represented as an ordered list of four 8-bit pieces. [RFC791]
An IPv6 address is a 128-bit identifier and for the purposes of this specification is either represented as an ordered list of eight 16-bit pieces, or as a list of seven 16-bit pieces followed by a single 32-bit piece. [RFC4291]
Acknowledgments
There have been a lot of people that have helped make URLs more interoperable over the years and thereby furthered the goals of this standard. Likewise many people have helped making this standard what it is today.
With that, many thanks to Adam Barth, Albert Wiersch, Alexandre Morgaut, Arkadiusz Michalski, Behnam Esfahbod, Bobby Holley, Boris Zbarsky, Brandon Ross, Dan Appelquist, Daniel Bratell, David Håsäther, David Sheets, David Singer, Erik Arvidsson, Gavin Carothers, Geoff Richards, Glenn Maynard, Henri Sivonen, Ian Hickson, James Graham, James Manger, James Ross, Joshua Bell, Kevin Grandon, Larry Masinter, Mark Davis, Marcos Cáceres, Martin Dürst, Mathias Bynens, Michael Peick, Michael™ Smith, Michel Suignard, Peter Occil, Rodney Rehm, Roy Fielding, Santiago M. Mola, Simon Pieters, Simon Sapin, Tab Atkins, Tantek Çelik, Tim Berners-Lee, Vyacheslav Matva, and 成瀬ゆい (Yui Naruse) for being awesome!
This standard is written by Anne van Kesteren (Mozilla, annevk@annevk.nl) and Sam Ruby (IBM, rubys@intertwingly.net).
Per CC0, to the extent possible under law, the editors have waived all copyright and related or neighboring rights to this work.