A Red and Green Compass Rose, logo for the Compass DeRose Guide Series.The Compass DeRose Guide to which characters are safe in URLs, email, and XML public identifiers

This page was written by Steven J. DeRose, and was last updated on 2005-10-20.



This table is from pages 88-89 of my The SGML FAQ book, which is really more about XML than SGML... The book is published by Kluwer Academic Publishers. Besides giving the historical and formal rationale that underlies XML, people tell me the SGML primer in Appendix A is a very accessible introduction to SGML, and perhaps the most compact.

I tried to put a lot of little touches like this in there, since I'm forever referring to such information myself, and it can be a pain to find; having such reference information close at hand can be pretty useful.

FPIs are used in SGML and XML to identify resources by name, not location: much like URNs. They use a limited character set, for portability. Other systems concerned for portability make similar restrictions.

URLs on the Web use a character set defined in IETF RFC 1738 "Uniform Resource Locators (URL)" (and several other RFCs that update or extend URLs). The characters allowed are A-Z, a-z, 0-9, space, plus these ten special characters:

     - $ _ . + ! * , ( )

Some additional characters (;/?:@=&) are reserved in some particular URL schemes but may be used otherwise. Some other characters are ruled out because they are used in other places in HTML files, even though they should not pose a conflict in URLs (for example, "<" and ">" pose no conflict so long as URLs are quoted as they should be). Other characters need to be expressed using a mechanism similar to SGML character references: "%" plus the hexadecimal code for the desired character.

The Text Encoding Initiative did extensive testing of what characters survive transfer through the many kinds of machines on the Internet without any special layers of re-encoding to hide them (say, as plain email content), and established a set of "safe" characters that is documented in the TEI Guidelines, Section 4.3. That set, called the ISO 646 subset (ISO 646 is the international standard the underlies ASCII), includes a-z, A-Z, 0-9, and these nineteen special characters:

     " % & ' ( ) * + , - . / : ; < = > ? _

A comparison of these sets is shown below. None of these sets permits characters outside of the printable 7-bit range of characters defined in ISO 646. "+" indicates permissible characters; "~" indicates characters that are only reserved in some URL schemes, and may otherwise be usable in URLs.

The names shown for characters below are taken from ISO 8859-1: Latin Alphabet Number 1 and ISO 646: Information Processing -- ISO 7-bit coded character set for information interchange (except that in those standards, they're all upper case). ISO 646 specifies that the number sign (decimal 35) may also represent the pound sterling sign. Likewise, decimal code 36 may represent either the dollar sign, or the "neutral currency sign".

Those characters shown in red must be escaped in certain situations when they occur in HTML, XML, or SGML files.

CharacterName DecimalHex Octal FPIURLTEI subsetHTML entity
BS backspace 08 0x08 o010 - - - -
TAB horizontal tabulation 09 0x09 o011 - - - -
LF line feed 10 0x0A o012 + - - -
CR carriage return 13 0x0D o015 + - - -
ESC escape 27 0x1B o033 - - - -
SPACE SPACE 32 0x20 o040 + - - -
! exclamation mark 33 0x21 o041 - + - -
" quotation mark 34 0x22 o042 - - + quot
# number sign 35 0x23 o043 - - - -
$ dollar sign 36 0x24 o044 - + - -
% percent sign 37 0x25 o045 - (escape) + -
& ampersand 38 0x26 o046 - ~ + amp
' apostrophe 39 0x27 o047 + + + -
( left parenthesis 40 0x28 o050 + + + -
) right parenthesis 41 0x29 o051 + + + -
* asterisk 42 0x2A o052 - + + -
+ plus sign 43 0x2B o053 + + + -
, comma 44 0x2C o054 + - + -
- hyphen, minus sign 45 0x2D o055 + + + -
. full stop 46 0x2E o056 + + + -
/ solidus 47 0x2F o057 + ~ + -
0 digit 0 48 0x30 o060 + + + -
1 digit 1 49 0x31 o061 + + + -
2 digit 2 50 0x32 o062 + + + -
3 digit 3 51 0x33 o063 + + + -
4 digit 4 52 0x34 o064 + + + -
5 digit 5 53 0x35 o065 + + + -
6 digit 6 54 0x36 o066 + + + -
7 digit 7 55 0x37 o067 + + + -
8 digit 8 56 0x38 o070 + + + -
9 digit 9 57 0x39 o071 + + + -
: colon 58 0x3A o072 + ~ + -
; semicolon 59 0x3B o073 - ~ + -
< less-than sign 60 0x3C o074 - - + lt
= equals sign 61 0x3D o075 + ~ + -
> greater-than sign 62 0x3E o076 - - + gt
? question mark 63 0x3F o077 + ~ + -
@ commercial at 64 0x40 o100 - ~ - -
A capital letter A 65 0x41 o101 ++ + -
B capital letter B 66 0x42 o102 ++ + -
C capital letter C 67 0x43 o103 ++ + -
D capital letter D 68 0x44 o104 ++ + -
E capital letter E 69 0x45 o105 ++ + -
F capital letter F 70 0x46 o106 ++ + -
G capital letter G 71 0x47 o107 ++ + -
H capital letter H 72 0x48 o110 ++ + -
I capital letter I 73 0x49 o111 ++ + -
J capital letter J 74 0x4A o112 ++ + -
K capital letter K 75 0x4B o113 ++ + -
L capital letter L 76 0x4C o114 ++ + -
M capital letter M 77 0x4D o115 ++ + -
N capital letter N 78 0x4E o116 ++ + -
O capital letter O 79 0x4F o117 ++ + -
P capital letter P 80 0x50 o120 ++ + -
Q capital letter Q 81 0x51 o121 ++ + -
R capital letter R 82 0x52 o122 ++ + -
S capital letter S 83 0x53 o123 ++ + -
T capital letter T 84 0x54 o124 ++ + -
U capital letter U 85 0x55 o125 ++ + -
V capital letter V 86 0x56 o126 ++ + -
W capital letter W 87 0x57 o127 ++ + -
X capital letter X 88 0x58 o130 ++ + -
Y capital letter Y 89 0x59 o131 ++ + -
Z capital letter Z 90 0x5A o132 ++ + -
[ left square bracket 91 0x5B o133 - - - -
\ reverse solidus 92 0x5C o134 - - - -
] right square bracket 93 0x5D o135 - - - -
^ circumflex 94 0x5E o136 - - - -
_ underscore 95 0x5F o137 - + - -
` grave 96 0x60 o140 - - - -
a small letter a 97 0x61 o141 + + + -
b small letter b 98 0x62 o142 ++ + -
c small letter c 99 0x63 o143 ++ + -
d small letter d 1000x64 o144 ++ + -
e small letter e 1010x65 o145 ++ + -
f small letter f 1020x66 o146 ++ + -
g small letter g 1030x67 o147 ++ + -
h small letter h 1040x68 o150 ++ + -
i small letter i 1050x69 o151 ++ + -
j small letter j 1060x6A o152 ++ + -
k small letter k 1070x6B o153 ++ + -
l small letter l 1080x6C o154 ++ + -
m small letter m 1090x6D o155 ++ + -
n small letter n 1100x6E o156 ++ + -
o small letter o 1110x6F o157 ++ + -
p small letter p 1120x70 o160 ++ + -
q small letter q 1130x71 o161 ++ + -
r small letter r 1140x72 o162 ++ + -
s small letter s 1150x73 o163 ++ + -
t small letter t 1160x74 o164 ++ + -
u small letter u 1170x75 o165 ++ + -
v small letter v 1180x76 o166 ++ + -
w small letter w 1190x77 o167 ++ + -
x small letter x 1200x78 o170 ++ + -
y small letter y 1210x79 o171 ++ + -
z small letter z 1220x7A o172 ++ + -
{ left curly bracket 1230x7B o173 - - - -
| vertical bar 1240x7C o174 - - - -
} right curly bracket 1250x7D o175 - - - -
~ tilde 1260x7E o176 - - - -

The "upper half", or characters from 128-255 (decimal), are less dependable -- many different "code pages" exist, assigning different characters to different places in that range. Neither Windows™ nor Mac™ generally uses the standard "Latin-1" set (defined by ISO 8859-1). In UTF-8 Unicode files, even those standard Latin-1 characters require special encoding. You can find the Windows character set here; the Mac character set here, and Latin-1 here. Microsoft provides a number of code pages here, and pointers to many other charts are here.

The list of HTML entities is defined here.

Unix jargon names for the various characters also exist.


IETF RFC 2152, "UTF-7: A Mail-Safe Transformation Format of Unicode", defines an emailable transformation of Unicode, and discusses issues of character safety in detail.

The Unicode Standard defines UTF-8, a byte-oriented encoding form of Unicode. For details, see Section 2.5 ÒEncoding FormsÓ and Section 3.9 "Unicode Encoding Forms".

About.com provides a page listing the safe characters to use in email, and has the excellent good sense to cite the TEI "ISO 646 subset" list. Bravo.


Back to home page of Steve DeRose or The Bible Technologies Group. or The Bible Technologies Group Working Groups. Or, contact me via email fix the punctuation).