On this page…
In addition to DTDs, one of the other important first things you need to know about making web pages is the importance & usage of Character Encoding.
Purpose
Setting the character encoding tells web browsers what language, and therefore what writing system and characters, you’re using on the webpage.
Some Character Encodings
There are lots of different character encodings you could potentially use on your webpages. In this section, I’ll look at the biggies you should know.
US-ASCII
Around since 1960, the American Standard Code for Information Interchange (ASCII, pronounced askee) is based on the English alphabet, along with some other characters, giving a total of 128:
- 94 printable characters (a, A, 1, +)
- 33 non-printing control characters (most of which are now obsolete)
- 1 space
The following figure shows the 128 characters found in ASCII:
ASCII doesn’t provide for any special characters—like the Euro (€), anything that’s not English, or any formatting (nothing bold or italic)—so it’s often called plain text.
Needless to say, don’t use ASCII as your character encoding—it’s way too limited!
ISO-8859-1
ISO-8859-1 is a standardized character encoding.
The ISO part stands for International Standards Organization, the same group that has determined standards for
- CD-ROMs & DVD-ROMs
- Film speed
- Paper sizes
- Screw threads
- Water-resistant watches
- Bicycle tires
- Shoe sizes
8859-1 is the number of the ISO standard (in this case, for a particular character encoding)
ISO-8859-1 is also known as
- Latin alphabet No. 1
- ISO Latin 1
ISO-8859-1 is a common character encoding on the Web. It contains:
- all the characters found in ASCII,
- the various accented characters and letters needed for writing Western European languages (like French & Spanish),
- along with some special characters.
You can see those additional characters in the figure below:
ISO-8859-1 used to be the recommended character entity for webpages, but that time is long gone. Instead, use UTF-8, discussed next.
In addition to ISO-8859-1, by the way, there are many other ISO-8859 encodings, including these:
- ISO 8859-2: Central & East European
- ISO 8859-3: South European, Maltese & Esperanto
- ISO 8859-4: North European
- ISO 8859-5: Cyrillic
- ISO 8859-6: Arabic
- ISO 8859-7: Modern Greek
- ISO 8859-8: Hebrew & Yiddish
- ISO 8859-9: Turkish
- ISO 8859-10: Nordic (Lappish, Inuit, Icelandic)
- ISO 8859-11: Thai
- ISO 8859-13: Baltic Rim
- ISO 8859-14: Celtic
- ISO 8859-16: South-Eastern Europe
UTF-8
UTF-8 (8-bit Unicode Transformation Format) is a newer standard that dates from 1992-1993. Basically, it can encompass every character in every language in the world: more than 107,000 characters found in 90 writing systems.
Because it is so comprehensive, UTF-8 is now widely recommended & steadily becoming the standard way to represent text in files, email, webpages, & software.
How To Specify the Character Encoding
There are several ways you can tell web browsers what character encoding your webpages are using.
Web Server
If your web server is set up to include the character encoding in the HTTP Content-Type header (hidden information that is transferred back and forth between a web browser & a web server), then you don’t need to add anything to your web pages. Instead, the following information is in the HTTP Content-Type header the web server sends out to browsers:
Content-Type: text/html; charset=UTF-8
Keep in mind that this would only work if:
- Your webpages are hosted and served via a web server, and
- Your web server is configured to send the HTTP Content-Type header
How do you know if these are true?
- Ask your hosting provider.
- Use the Live HTTP Headers extension for Firefox to view the hidden information transferred back and forth between web servers and web browsers.
- Use your browser's built-in developer tools.
- Check out some of the services & suggestions found at Checking HTTP Headers.
Since the webpages we’re creating in class are on your local computer and not on a server, you’ll need to use the next method: an HTML META Element.
HTML META Element
In your webpage, you insert a META element like this inside the HEAD element:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
This META element appears very early in your code, even before the TITLE element, so the browser knows how to render the text that your users see.
What You Should Use
In this class, you should use a character encoding META element (again, since your webpages are on your local computer and not hosted on a web server). Which one you use, though, depends upon your DTD.
HTML 4.01
For HTML 4.01, use this META element:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
Your code will then look like this:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Test</title>
</head>
HTML5
For HTML5, use this META element:
<meta charset="utf-8">
Your code will then look like this:
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Test</title>
</head>
A Note on DTDs & Character Encoding
Before HTML 4.0 (1998), only characters using ISO-8859-1 were supported.
If you want to use Chinese, Cyrillic, Greek, Hebrew, Arabic, or other non-Latin characters, you have to use HTML 4.0 or later (but you should be doing that anyway!).
Further Information
IANA. “Character Sets”. Internet Assigned Numbers Authority (2007). http://www.iana.org/assignments/character-sets.
Korpela, Jukka “Yucca”. “A tutorial on character code issues”. IT and communication (2009). http://www.cs.tut.fi/~jkorpela/chars.html.
Spolsky, Joel. “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)”. Joel on Software (2003). http://www.joelonsoftware.com/articles/Unicode.html.
Wikipedia. “ASCII”. Wikipedia (2010). http://en.wikipedia.org/wiki/Ascii.
Wikipedia. “Character encodings in HTML”. Wikipedia (2010). http://en.wikipedia.org/wiki/Character_encodings_in_HTML.
Wikipedia. “ISO/IEC 8859-1”. Wikipedia (2010). http://en.wikipedia.org/wiki/Iso-8859-1.
Wikipedia. “Unicode”. Wikipedia (2010). http://en.wikipedia.org/wiki/Unicode.
Wikipedia. “Universal Character Set”. Wikipedia (2010). http://en.wikipedia.org/wiki/Universal_Character_Set.