It’s just data

Character Encoding and HTML Forms

Joe Gregorio: lacking any other indications, a browser will submit the data from a form using the same character encoding that the page is served in.

This mind blowing statement was embedded in an otherwise interesting article on Atom and Wiki's.  It has caused me to rethink how I serve pages on my weblog, and has caused me to begin the switch to utf-8.  Here's why:

Meanwhile, I've been receiving a lot of good input on my i18n survival guide; once the dust settles, this information will be factored in.

The best way to control how the web browser will send back data is to use the accept-charset attribute on the <form> element.  Without that attribute, all kinds of weird things can happen (eg. if the user forces the browser to use a non-default character encoding to display the page, the form might get submitted in that encoding).

Another nice thing is that if you set accept-charset="UTF-8", Internet explorer will send back curly quotes with the correct Unicode values rather than as latin1 control characters (which it will do if you ask for ISO-8859-1).

Posted by James at

Assuming you want fancy curly quotes and the like.  In some cases, for instance if you want to send email, you're going to want to convert your utf-8 data to iso-8859-1 or whatever the common encoding is for email in your language, you'll have to make sure that your utf-8->iso-8859-1 recoder can handle those characters.  Some recoders will replace such characters with ?, some will just fail to convert the whole thing, etc.

The main characters to look out for are the various hypens and dashes, the various special spaces, the ellipsis, and of course the fancy single and double quotes (including the lower quotes).

You can see these in utf-8 forms even without the accept-charset if people do things like copy & paste text from Word into a form in IE.

Posted by Brandon at

If you declare iso-8859-1, things not encodable in that scheme will be silently converted to Windows-1252 if they are encodable in that scheme, and only converted to NCRs if they aren't in 1252 either, in Mozilla. Apparently that evil behavior made sense for some situation at some time.

As to accept-charset, it allows a space-separated list of options, though apparently only Opera actually tells the server which it used. Unless, unless, this old 1999 bug still describes the current situation, and by adding a hidden form field with the name "{underscore}charset{underscore}" will really cause both Moz and IE to populate it with the charset they are actually using. Now that would be a useful, and incredibly hidden, thing, to actually know what you are getting.

(Meta: bloody wiki-like markup. What about people who need to say {underscore}word{underscore} but are too lazy to look up the entity for the underscore character?)
(Mo-meta: preview is rather newline-happy: after one preview, a blank line between paragraphs becomes three blank lines, after another preview, it's up to seven blank lines.)

Posted by Phil Ringnalda at

<cite>If you declare iso-8859-1 (a common encoding covering western Europe and Latin countries)</cite>

That not correct. iso-8859-1, or Latin1, does cover the Western European languages, but not Latin coutries, whatever you call that. Romanian, like Polish, Hungarian, Turkish and other non-Western but Latin alphabet languages are not well suporrted by Latin1.

Posted by Gabriel Radic at

James, apparently IE only respects accept-encoding in some rather limited circumstances.

Brandon, here is a list of problem characters.

Phil, I first saw NCR's in Moz when I tried some of the original extended ASCII characters such as ♥.  With the configuration I had tried, IE would send the characters as single bytes.  I'm not sure which behavior I like least.  Oh, and I've fixed the whitespace problem, but the underscore problem is more problematic.  Oh, and don't try entering &#95; or &#x5F; as I will simply escape those.

Gabriel, thanks.  I'll try to be more precise in the future.

Posted by Sam Ruby at


No está de más volver de vez en cuando por aquí; menos aún si el motivo es cierto movimiento en...... [more]

Trackback from kusor dhtml weblog


Formular-Layout Variationen

Wieder einmal bin ich auf zwei Seiten gestossen, die sich mit Layout von Web-Formularen auseinandersetzen: Form Layout Experimente des Man in Blue sowi... [more]

Trackback from Timo Gnambs


On my plate

On my plate, in my browser tabs: /~distler/blog/files/ Musings: MTStripControlChars Sketchbook: m[iA]cro: On NoHTMLEntities and application/xhtml+xml Sam Ruby: Character Encoding......

Excerpt from phil ringnalda dot com at

Unicode and weblogs

Hossein Derakhshan:  Spread the meme  Please test your clients, servers, comments, and feeds. Hossein Derakhshan:  I'm doing my part.  It took only a few lines of code for me to convert my weblog over to utf-8 (plus changing the content type in a few... [more]

Trackback from Sam Ruby


Suddenly, it all makes sense...

I've been struggling with the problem of encoding and HTML forms for quite some time now. I should have read this post on Sam Ruby's site, who also wrote the i18n survival guide. It's really simple once you set the...... [more]

Trackback from Notes from my terminal


ⓐⓢⓒⓘⓘ ⓢⓣⓤⓟⓘⓓ ⓠⓤⓔⓢⓣⓘⓞⓝ ⒢⒠⒯ ⒜ ⒮⒯⒰⒫⒤⒟ ⒜⒩⒮⒤. ;-)

Posted by Ned at

Google, Atom, SixApart, and Longhorn

Thanks, the link has been updated. As for the problems with the comment script please accept my apologies. I know how to fix the problem, but I am actually spending my time on migrating from Bulu to pyblosxom....

Excerpt from Google, Atom, SixApart, and Longhorn at

Anne van Kesteren : Character Encoding and HTML Forms - Important, read it please...

Excerpt from HotLinks - Level 1 at

Sam Ruby: Character Encoding and HTML Forms

Sam Ruby- Character Encoding and HTML this site | 2 links...

Excerpt from blogdex - the weblog diffusion index at

Sam Ruby: Character Encoding and HTML Forms


Excerpt from at

Sam Ruby: Character Encoding and HTML Forms


Excerpt from Linklog at

Sam Ruby: Character Encoding and HTML Forms

Tags: howto, html, tip, character, entity, encode, forms, unicode, i18n, utf-8, imported...

Excerpt from Ma.gnolia: Recent Bookmarks Tagged With "entities" or Similar at

Codificació en formularis

Els formularis HTML permeten determinar la codificació de les dades enviades mitjançant l’atribut accept-charset. Aquest atribut, tal com diu l’especificació, permet declarar una llista de codificacions permeses (llista separada per...

Excerpt from a.css at

Sam Ruby: Character Encoding and HTML Forms


Excerpt from at

HTM Forms and Character Encoding Detection

Basically, not much has changed with HTML form processing and input character encoding detection in the last 10 years. It’s still a mess. Some browsers are consistent in using the form data encoding that matches the one in the meta tag within...

Excerpt from James' World at

Zeichenkodierung bei Formulardaten

Problemstellung: Die Daten eines Formulares, welches in eine ISO-8859-1 enkodierte Seite eingebettet ist, soll UTF-8 kodiert an eine (andere) URL gesendet werden. Browser submitten in der Regel ihre Formulardaten stets in dem Encoding, in welchem...

Excerpt from michi bloggt! (das spannende Leben eines Technischen Direktors) at

Sam Ruby: Character Encoding and HTML Forms


Excerpt from at

Thanks everyone. This page has been incredibly useful.

Posted by anonymous at

Why you should absolutely go utf8 with your web pages

Why you should absolutely go utf8 with your web pages...

Excerpt from ab·surd·li at

Thanks a lot! accept-charset="ISO-8859-1" helped!

Posted by Kirill at

Sam Ruby: Character Encoding and HTML Forms


Excerpt from Delicious/andreeagheorghe at

HTML encoding of form inputs

I suppose this is common knowledge amongst professional web developers but I just discovered myself that if a user enters characters into a HTML form input that is not representable in the character set of the page the form is in, browsers will...

Excerpt from The Other Kelly Yancey at

Add your comment