This page was written by Steven J. DeRose, and was last updated on 2005-10-20.
This table is from pages 88-89 of my The SGML FAQ book, which is really more about XML than SGML... The book is published by Kluwer Academic Publishers. Besides giving the historical and formal rationale that underlies XML, people tell me the SGML primer in Appendix A is a very accessible introduction to SGML, and perhaps the most compact.
I tried to put a lot of little touches like this in there, since I'm forever referring to such information myself, and it can be a pain to find; having such reference information close at hand can be pretty useful.
FPIs are used in SGML and XML to identify resources by name, not location: much like URNs. They use a limited character set, for portability. Other systems concerned for portability make similar restrictions.
URLs on the Web use a character set defined in IETF RFC 1738 "Uniform Resource Locators (URL)" (and several other RFCs that update or extend URLs). The characters allowed are A-Z, a-z, 0-9, space, plus these ten special characters:
- $ _ . + ! * , ( )
Some additional characters (;/?:@=&) are reserved in some particular URL schemes but may be used otherwise. Some other characters are ruled out because they are used in other places in HTML files, even though they should not pose a conflict in URLs (for example, "<" and ">" pose no conflict so long as URLs are quoted as they should be). Other characters need to be expressed using a mechanism similar to SGML character references: "%" plus the hexadecimal code for the desired character.
The Text Encoding Initiative did extensive testing of what characters survive transfer through the many kinds of machines on the Internet without any special layers of re-encoding to hide them (say, as plain email content), and established a set of "safe" characters that is documented in the TEI Guidelines, Section 4.3. That set, called the ISO 646 subset (ISO 646 is the international standard the underlies ASCII), includes a-z, A-Z, 0-9, and these nineteen special characters:
" % & ' ( ) * + , - . / : ; < = > ? _
A comparison of these sets is shown below. None of these sets permits characters outside of the printable 7-bit range of characters defined in ISO 646. "+" indicates permissible characters; "~" indicates characters that are only reserved in some URL schemes, and may otherwise be usable in URLs.
The names shown for characters below are taken from ISO 8859-1: Latin Alphabet Number 1 and ISO 646: Information Processing -- ISO 7-bit coded character set for information interchange (except that in those standards, they're all upper case). ISO 646 specifies that the number sign (decimal 35) may also represent the pound sterling sign. Likewise, decimal code 36 may represent either the dollar sign, or the "neutral currency sign".
Those characters shown in red must be escaped in certain situations when they occur in HTML, XML, or SGML files.
Character | Name | Decimal | Hex | Octal | FPI | URL | TEI subset | HTML entity |
---|---|---|---|---|---|---|---|---|
BS | backspace | 08 | 0x08 | o010 | - | - | - | - |
TAB | horizontal tabulation | 09 | 0x09 | o011 | - | - | - | - |
LF | line feed | 10 | 0x0A | o012 | + | - | - | - |
CR | carriage return | 13 | 0x0D | o015 | + | - | - | - |
ESC | escape | 27 | 0x1B | o033 | - | - | - | - |
SPACE | SPACE | 32 | 0x20 | o040 | + | - | - | - |
! | exclamation mark | 33 | 0x21 | o041 | - | + | - | - |
" | quotation mark | 34 | 0x22 | o042 | - | - | + | quot |
# | number sign | 35 | 0x23 | o043 | - | - | - | - |
$ | dollar sign | 36 | 0x24 | o044 | - | + | - | - |
% | percent sign | 37 | 0x25 | o045 | - | (escape) | + | - |
& | ampersand | 38 | 0x26 | o046 | - | ~ | + | amp |
' | apostrophe | 39 | 0x27 | o047 | + | + | + | - |
( | left parenthesis | 40 | 0x28 | o050 | + | + | + | - |
) | right parenthesis | 41 | 0x29 | o051 | + | + | + | - |
* | asterisk | 42 | 0x2A | o052 | - | + | + | - |
+ | plus sign | 43 | 0x2B | o053 | + | + | + | - |
, | comma | 44 | 0x2C | o054 | + | - | + | - |
- | hyphen, minus sign | 45 | 0x2D | o055 | + | + | + | - |
. | full stop | 46 | 0x2E | o056 | + | + | + | - |
/ | solidus | 47 | 0x2F | o057 | + | ~ | + | - |
0 | digit 0 | 48 | 0x30 | o060 | + | + | + | - |
1 | digit 1 | 49 | 0x31 | o061 | + | + | + | - |
2 | digit 2 | 50 | 0x32 | o062 | + | + | + | - |
3 | digit 3 | 51 | 0x33 | o063 | + | + | + | - |
4 | digit 4 | 52 | 0x34 | o064 | + | + | + | - |
5 | digit 5 | 53 | 0x35 | o065 | + | + | + | - |
6 | digit 6 | 54 | 0x36 | o066 | + | + | + | - |
7 | digit 7 | 55 | 0x37 | o067 | + | + | + | - |
8 | digit 8 | 56 | 0x38 | o070 | + | + | + | - |
9 | digit 9 | 57 | 0x39 | o071 | + | + | + | - |
: | colon | 58 | 0x3A | o072 | + | ~ | + | - |
; | semicolon | 59 | 0x3B | o073 | - | ~ | + | - |
< | less-than sign | 60 | 0x3C | o074 | - | - | + | lt |
= | equals sign | 61 | 0x3D | o075 | + | ~ | + | - |
> | greater-than sign | 62 | 0x3E | o076 | - | - | + | gt |
? | question mark | 63 | 0x3F | o077 | + | ~ | + | - |
@ | commercial at | 64 | 0x40 | o100 | - | ~ | - | - |
A | capital letter A | 65 | 0x41 | o101 | + | + | + | - |
B | capital letter B | 66 | 0x42 | o102 | + | + | + | - |
C | capital letter C | 67 | 0x43 | o103 | + | + | + | - |
D | capital letter D | 68 | 0x44 | o104 | + | + | + | - |
E | capital letter E | 69 | 0x45 | o105 | + | + | + | - |
F | capital letter F | 70 | 0x46 | o106 | + | + | + | - |
G | capital letter G | 71 | 0x47 | o107 | + | + | + | - |
H | capital letter H | 72 | 0x48 | o110 | + | + | + | - |
I | capital letter I | 73 | 0x49 | o111 | + | + | + | - |
J | capital letter J | 74 | 0x4A | o112 | + | + | + | - |
K | capital letter K | 75 | 0x4B | o113 | + | + | + | - |
L | capital letter L | 76 | 0x4C | o114 | + | + | + | - |
M | capital letter M | 77 | 0x4D | o115 | + | + | + | - |
N | capital letter N | 78 | 0x4E | o116 | + | + | + | - |
O | capital letter O | 79 | 0x4F | o117 | + | + | + | - |
P | capital letter P | 80 | 0x50 | o120 | + | + | + | - |
Q | capital letter Q | 81 | 0x51 | o121 | + | + | + | - |
R | capital letter R | 82 | 0x52 | o122 | + | + | + | - |
S | capital letter S | 83 | 0x53 | o123 | + | + | + | - |
T | capital letter T | 84 | 0x54 | o124 | + | + | + | - |
U | capital letter U | 85 | 0x55 | o125 | + | + | + | - |
V | capital letter V | 86 | 0x56 | o126 | + | + | + | - |
W | capital letter W | 87 | 0x57 | o127 | + | + | + | - |
X | capital letter X | 88 | 0x58 | o130 | + | + | + | - |
Y | capital letter Y | 89 | 0x59 | o131 | + | + | + | - |
Z | capital letter Z | 90 | 0x5A | o132 | + | + | + | - |
[ | left square bracket | 91 | 0x5B | o133 | - | - | - | - |
\ | reverse solidus | 92 | 0x5C | o134 | - | - | - | - |
] | right square bracket | 93 | 0x5D | o135 | - | - | - | - |
^ | circumflex | 94 | 0x5E | o136 | - | - | - | - |
_ | underscore | 95 | 0x5F | o137 | - | + | - | - |
` | grave | 96 | 0x60 | o140 | - | - | - | - |
a | small letter a | 97 | 0x61 | o141 | + | + | + | - |
b | small letter b | 98 | 0x62 | o142 | + | + | + | - |
c | small letter c | 99 | 0x63 | o143 | + | + | + | - |
d | small letter d | 100 | 0x64 | o144 | + | + | + | - |
e | small letter e | 101 | 0x65 | o145 | + | + | + | - |
f | small letter f | 102 | 0x66 | o146 | + | + | + | - |
g | small letter g | 103 | 0x67 | o147 | + | + | + | - |
h | small letter h | 104 | 0x68 | o150 | + | + | + | - |
i | small letter i | 105 | 0x69 | o151 | + | + | + | - |
j | small letter j | 106 | 0x6A | o152 | + | + | + | - |
k | small letter k | 107 | 0x6B | o153 | + | + | + | - |
l | small letter l | 108 | 0x6C | o154 | + | + | + | - |
m | small letter m | 109 | 0x6D | o155 | + | + | + | - |
n | small letter n | 110 | 0x6E | o156 | + | + | + | - |
o | small letter o | 111 | 0x6F | o157 | + | + | + | - |
p | small letter p | 112 | 0x70 | o160 | + | + | + | - |
q | small letter q | 113 | 0x71 | o161 | + | + | + | - |
r | small letter r | 114 | 0x72 | o162 | + | + | + | - |
s | small letter s | 115 | 0x73 | o163 | + | + | + | - |
t | small letter t | 116 | 0x74 | o164 | + | + | + | - |
u | small letter u | 117 | 0x75 | o165 | + | + | + | - |
v | small letter v | 118 | 0x76 | o166 | + | + | + | - |
w | small letter w | 119 | 0x77 | o167 | + | + | + | - |
x | small letter x | 120 | 0x78 | o170 | + | + | + | - |
y | small letter y | 121 | 0x79 | o171 | + | + | + | - |
z | small letter z | 122 | 0x7A | o172 | + | + | + | - |
{ | left curly bracket | 123 | 0x7B | o173 | - | - | - | - |
| | vertical bar | 124 | 0x7C | o174 | - | - | - | - |
} | right curly bracket | 125 | 0x7D | o175 | - | - | - | - |
~ | tilde | 126 | 0x7E | o176 | - | - | - | - |
The "upper half", or characters from 128-255 (decimal), are less dependable -- many different "code pages" exist, assigning different characters to different places in that range. Neither Windows™ nor Mac™ generally uses the standard "Latin-1" set (defined by ISO 8859-1). In UTF-8 Unicode files, even those standard Latin-1 characters require special encoding. You can find the Windows character set here; the Mac character set here, and Latin-1 here. Microsoft provides a number of code pages here, and pointers to many other charts are here.
The list of HTML entities is defined here.
Unix jargon names for the various characters also exist.
IETF RFC 2152, "UTF-7: A Mail-Safe Transformation Format of Unicode", defines an emailable transformation of Unicode, and discusses issues of character safety in detail.
The Unicode Standard defines UTF-8, a byte-oriented encoding form of Unicode. For details, see Section 2.5 ÒEncoding FormsÓ and Section 3.9 "Unicode Encoding Forms".
About.com provides a page listing the safe characters to use in email, and has the excellent good sense to cite the TEI "ISO 646 subset" list. Bravo.
Back to home page of Steve DeRose or The Bible Technologies Group. or The Bible Technologies Group Working Groups. Or, contact me via email