Getting weird characters like Â instead of or â€™? Most likely there is a Character set problem. It can occur when a MySQL and PHP are upgraded or when data has been incorrectly stored or the application is sending an incorrect (or missing) character set to the browser. PHP doesn’t yet support UTF-8 natively in its numerous string handling functions (version 6 will when released).
The Short of it…
1. Don’t use
mysql_real_escape_string(). Be careful using strlen(), it may count the bytes and no the characters.
2. Send a utf-8 header from php before you send any of the page’s content:
header("Content-type: text/html; charset=utf-8");
3. As soon as you connect to mysql, do a
mysql_query("SET NAMES 'utf8'"); to set the connection’s encoding to utf-8, which is often necessary in php/mysql apps.
4. You want this meta tag in the <head> section to be absolutely safe:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
5. Good luck!
The long of it…Why is it happening?
First, it’s handy to know a bit about UTF-8. Skip this if you’re already familiar.
UTF-8 uses one or more 8-bit bytes to store a single character, unlike ASCII and friends which use only one byte per character. It is more space-efficient than its cousins (UTF-16, UTF-32) when the majority of the characters can be encoded as a single byte, as is the case with most English text, but with the added benefit that you can still store any character under the sun should you need to. It uses the most significant bits of each byte as continuation bits (to signify that the following byte(s) form part of the same character). It is for this reason that improperly-displayed UTF-8 results in weird characters.
UTF-8 is backwards-compatible with ASCII — all characters up to 127 are identical in both encodings. This at least makes English text legible if the UTF-8 is interpreted incorrectly as ASCII or ISO 8859 character sets. However, it’s these incorrect interpretations that cause the odd characters to appear.
Unfortunately, PHP doesn’t yet support UTF-8 natively in its numerous string handling functions (version 6 will when released), but that doesn’t mean you can’t work with it — you just have to be a bit careful. Let’s take strlen() for example: with plain ASCII text, strlen() returns the number of characters in a string. It does this by counting the number of bytes used to hold the data. It doesn’t know about (and cannot detect) UTF-8 and will blindly count the number of bytes, not the actual number of characters. Hence, the presence of any multibyte characters in your string will give you an incorrect length.
A problem you will inevitably face is when a user takes advantage of another application to create some text which gets pasted into your HTML form and submitted. Microsoft Word, for example, uses Unicode internally and converts characters like quotes and dashes into “smart quotes” and em- and en-dashes automatically. These are typographically correct, but the symbols lie outside the ASCII character set so when copied and pasted, the text is sent as UTF-8 and you end up with multibyte characters all over the place. If you store this text and later send it back to a browser without informing it that you are sending UTF-8, extra characters will appear.
Want more help?
[maxbutton id=”1″ ]