Everything You Need to Know About Character Encoding

On this page…

    In addition to DTDs, one of the other important first things you need to know about making web pages is the importance & usage of Character Encoding.

    Purpose

    Setting the character encoding tells web browsers what language, and therefore what writing system and characters, you’re using on the webpage.

    Some Character Encodings

    There are lots of different character encodings you could potentially use on your webpages. In this section, I’ll look at the biggies you should know.

    US-ASCII

    Around since 1960, the American Standard Code for Information Interchange (ASCII, pronounced askee) is based on the English alphabet, along with some other characters, giving a total of 128:

    The following figure shows the 128 characters found in ASCII:

    ASCII Character Set

    ASCII doesn’t provide for any special characters—like the Euro (€), anything that’s not English, or any formatting (nothing bold or italic)—so it’s often called plain text.

    Needless to say, don’t use ASCII as your character encoding—it’s way too limited!

    ISO-8859-1

    ISO-8859-1 is a standardized character encoding.

    The ISO part stands for International Standards Organization, the same group that has determined standards for

    8859-1 is the number of the ISO standard (in this case, for a particular character encoding)

    ISO-8859-1 is also known as

    ISO-8859-1 is a common character encoding on the Web. It contains:

    You can see those additional characters in the figure below:

    iso-8859-1-character-set.gif

    ISO-8859-1 used to be the recommended character entity for webpages, but that time is long gone. Instead, use UTF-8, discussed next.

    In addition to ISO-8859-1, by the way, there are many other ISO-8859 encodings, including these:

    UTF-8

    UTF-8 (8-bit Unicode Transformation Format) is a newer standard that dates from 1992-1993. Basically, it can encompass every character in every language in the world: more than 107,000 characters found in 90 writing systems.

    Because it is so comprehensive, UTF-8 is now widely recommended & steadily becoming the standard way to represent text in files, email, webpages, & software.

    How To Specify the Character Encoding

    There are several ways you can tell web browsers what character encoding your webpages are using.

    Web Server

    If your web server is set up to include the character encoding in the HTTP Content-Type header (hidden information that is transferred back and forth between a web browser & a web server), then you don’t need to add anything to your web pages. Instead, the following information is in the HTTP Content-Type header the web server sends out to browsers:

    Content-Type: text/html; charset=UTF-8
    

    Keep in mind that this would only work if:

    How do you know if these are true?

    Since the webpages we’re creating in class are on your local computer and not on a server, you’ll need to use the next method: an HTML META Element.

    HTML META Element

    In your webpage, you insert a META element like this inside the HEAD element:

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    

    This META element appears very early in your code, even before the TITLE element, so the browser knows how to render the text that your users see.

    What You Should Use

    In this class, you should use a character encoding META element (again, since your webpages are on your local computer and not hosted on a web server). Which one you use, though, depends upon your DTD.

    HTML 4.01

    For HTML 4.01, use this META element:

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    

    Your code will then look like this:

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
    <html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <title>Test</title>
    </head>
    

    HTML5

    For HTML5, use this META element:

    <meta charset="utf-8">
    

    Your code will then look like this:

    <!DOCTYPE html>
    <html>
    <head>
        <meta charset="UTF-8">
        <title>Test</title>
    </head>
    

    A Note on DTDs & Character Encoding

    Before HTML 4.0 (1998), only characters using ISO-8859-1 were supported.

    If you want to use Chinese, Cyrillic, Greek, Hebrew, Arabic, or other non-Latin characters, you have to use HTML 4.0 or later (but you should be doing that anyway!).

    Further Information

    IANA. “Character Sets”. Internet Assigned Numbers Authority (2007). http://www.iana.org/assignments/character-sets.

    Korpela, Jukka “Yucca”. “A tutorial on character code issues”. IT and communication (2009). http://www.cs.tut.fi/~jkorpela/chars.html.

    Spolsky, Joel. “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)”. Joel on Software (2003). http://www.joelonsoftware.com/articles/Unicode.html.

    Wikipedia. “ASCII”. Wikipedia (2010). http://en.wikipedia.org/wiki/Ascii.

    Wikipedia. “Character encodings in HTML”. Wikipedia (2010). http://en.wikipedia.org/wiki/Character_encodings_in_HTML.

    Wikipedia. “ISO/IEC 8859-1”. Wikipedia (2010). http://en.wikipedia.org/wiki/Iso-8859-1.

    Wikipedia. “Unicode”. Wikipedia (2010). http://en.wikipedia.org/wiki/Unicode.

    Wikipedia. “Universal Character Set”. Wikipedia (2010). http://en.wikipedia.org/wiki/Universal_Character_Set.

    WebSanity Top Secret