HTML discovery: SGML entities and charsets
Drummond Reed
drummond.reed at cordance.net
Thu May 24 00:10:18 UTC 2007
>Peter Watkins wrote:
>
<snip>
>
>My concrete suggestion: replace the current language
>
>Other characters that would not be valid in the HTML document or that
cannot be represented in the document's character encoding MUST be escaped
using the percent-encoding (%xx) mechanism described in [RFC3986].
>
>with this:
>
>Any character in the href attributes MAY be represented as UTF-8 data
escaped using the percent-encoding (%xx) mechanism described in [RFC3986].
Characters with Unicode values greater than u007E MUST be represented as
UTF-8 data escaped using the percent-encoding (%xx) mechanism described in
[RFC3986]. For instance, the character "ä" (umlaut a, Unicode u00E4) MUST be
represented as a six-character string like "%C3%A4" as suggested by RFC2718.
Peter, I agree UTF-8 encoding before percent-encoding must be specified, as
otherwise you don't know how to interpret the percent-encoded characters.
However, since RFC 3987 (the IRI spec) already specifies UTF-8 encoding
before percent-encoding, couldn't we just specify it by reference to both
RFC 3986 and 3987, e.g.:
Any character in the href attributes MUST be a valid URI character as
specified by [RFC3886]. If any character outside the valid URI character set
is included, it MUST be encoded using the percent-encoding (%xx) mechanism
defined in section 2.1 of [RFC3986] after first being UTF-8 encoded as
specified in [RFC3987]. For instance, the character "ä" (umlaut a, Unicode
u00E4) MUST be represented as a six-character string like "%C3%A4" as
suggested by RFC2718.
=Drummond
More information about the specs
mailing list