Funny File Names

 

Funny File Names ... and the Ugly URLs that Love Them .. I mean, that
Locate Them...

by Mitch Marks


http://cuip.uchicago.edu/~mitchell/funny-file-names/name%09with%09tabs.shtm
Funny File Names
(and the Ugly URLs that go with them)
(This file has the name "name with tabs.shtm" on the server. What URL is showing in your browser's location or web-address window?)


We quite rightly urge people to choose short and simple filenames for the files carrying their web pages, and to avoid spaces and special punctuation within the filenames. It's a bit of a fib, however, if we say the reason is that "a Unix server can't have filenames like that." Such names (within limits!) aren't illegal, they're just hard to deal with, either on the server or in URLs used to refer to them.

For an illustration of the range of allowable names, take a look at a listing of the folder this page comes from: http://cuip.uchicago.edu/~mitchell/funny-file-names/ .

Some of those characters (space, newline, tab, control characters) are illegal, however, within URLs. Others are either strictly or by convention used only at particular places within URLs (tilde, question mark, ampersand), or always stand for something special (percent). If we've ended up for some reason stuck with a file whose name is oddly made in one or more of those ways, and a URL needs to make use of the filename, but the URL can't contain those characters (or not in precisely the way they appear in the file), then how can we link to those files? Or are they just inaccessible?

The answer is that a URL can contain an encoding for just about any character, well beyond the possibilities when there is no encoding and no way to refer to a character except by using it. The encoding is always a percent sign (%) followed by two hexadecimal digits. A hexadecimal digit is either a regular digit, in the range 0-9, or a letter from near the alphabet, in the range of a-f.

When used in the percentsign-encoding in a URL, hex digits that are letters can be either upper or lower case. Even when the code begins with a zero, both hex digits are used. So for each encoded character there are exactly three characters together representing it in the encoding: a percent sign and then two hexadecimal digits.

The two-digit hex codes are given in the third and seventh columns in the table near the end of this page -- the ones headed 'Hex'. With the percent sign to signal that an encoding is being used, the way to encode a space in a file's name into a URL for that file would be as '%20', since the table shows 20 as the hex for SPACE. Similarly, a TAB gets encoded as '%09', a tilde as '%7E', and a percent sign itself as '%25'. (A percent sign in a URL always signals the encoding is going on, and never simply stands for a percent sign in the file name.)

If you've gotten stuck with some files with funny names, and are making a link to one of them in some other page, your web editor should handle the URL percent encoding on its own, if you select the file to link to through some "pick file" popup in the editor program. So understanding the encoding is mostly just going to help you understand what's going on with that, and is not something you're going to need to use actively. Still, now that you know the system, you'll be able to handle it if some percent-encoded URL needs tinkering with. (Of course, a better way to solve that kind of problem would be to go back and rename those files to something less tricky.) You should also feel enabled to overrule an over-strenuous editing program which insists on percent-encoding

If you go back to the file listing of this directory, http://cuip.uchicago.edu/~mitchell/funny-file-names/, when you click on one of the files you can see in your browser's location bar or web-address space what the encoding is that it has arranged to handle that file by. If a similar encoding were in a link, that would provide a way to lnk to the oddly-named file in a portable way. You can check whether the encoding shown by your browser accords with what you could manually construct using the table. Here are some of them:

http://cuip.uchicago.edu/~mitchell/funny-file-names/name%20with%20spaces.htm for the file called "name with spaces.shtm"


http://cuip.uchicago.edu/~mitchell/funny-file-names/name%09with%09tabs.shtm for the file called "name with tabs.shtm"


http://cuip.uchicago.edu/~mitchell/funny-file-names/name%25with%25percents.shtm for the file called "name%with%percents.shtm".

We've seen that filenames can have percent signs. Sometimes these files get created unintentionally when some web editor or ftp program "sees" an encoded HTTP URL and mistakenly takes that to be the file name to upload and create on the server. Thus, if the user named the file "First Page.html", and the editor properly encoded it as "First%20Page.html" for linkage purposes, but then somehow the publish or ftp program (maybe aided and abetted by the user copying something that shouldn't be copied there) uploads the file but gives it the name "First%20Page.html". But that's not the file name that will be sought by the URL "First%20Page.html"! The URL "First%20Page.html" get the file "First Page.html" because a percent sign in a URL is always for encoding. Exercise for the reader: What would be the proper encoded URL for a file named "First%20Page.html".

To get really fanciful, a percent-encoding is allowed anywhere in the file-path part of a URL, even if not needed. So (though there's no earthly reason to) we could if we wanted make a link spelled out as http://cuip.uchicago.edu/%77%69%74%2f%32%30%30%31 .(Can you predict before clicking where that goes?)


Here is that promised table:

ASCII is the American Standard Code for Information Interchange. It is a 7-bit
code. Many 8-bit codes (such as ISO 8859-1, the Linux default character set) contain
ASCII as their lower half. The international counterpart of ASCII is known as ISO
646.

The following table contains the 128 ASCII characters.

Oct Dec Hex Char Oct Dec Hex Char
------------------------------------------------------------
000 0 00 NUL '\0' 100 64 40 @
001 1 01 SOH 101 65 41 A
002 2 02 STX 102 66 42 B
003 3 03 ETX 103 67 43 C
004 4 04 EOT 104 68 44 D
005 5 05 ENQ 105 69 45 E
006 6 06 ACK 106 70 46 F
007 7 07 BEL 107 71 47 G
010 8 08 BS 110 72 48 H
011 9 09 HT (tab) 111 73 49 I
012 10 0A LF (newline) 112 74 4A J
013 11 0B VT 113 75 4B K
014 12 0C FF 114 76 4C L
015 13 0D CR (return 115 77 4D M
016 14 0E SO 116 78 4E N
017 15 0F SI 117 79 4F O
020 16 10 DLE 120 80 50 P
021 17 11 DC1 121 81 51 Q
022 18 12 DC2 122 82 52 R
023 19 13 DC3 123 83 53 S
024 20 14 DC4 124 84 54 T
025 21 15 NAK 125 85 55 U
026 22 16 SYN 126 86 56 V
027 23 17 ETB 127 87 57 W
030 24 18 CAN 130 88 58 X
031 25 19 EM 131 89 59 Y
032 26 1A SUB 132 90 5A Z
033 27 1B ESC 133 91 5B [
034 28 1C FS 134 92 5C \ '\\'
035 29 1D GS 135 93 5D ]
036 30 1E RS 136 94 5E ^
037 31 1F US 137 95 5F _
040 32 20 SPACE 140 96 60 `
041 33 21 ! 141 97 61 a
042 34 22 " 142 98 62 b
043 35 23 # 143 99 63 c
044 36 24 $ 144 100 64 d
045 37 25 % 145 101 65 e
046 38 26 & 146 102 66 f
047 39 27 ' 147 103 67 g
050 40 28 ( 150 104 68 h
051 41 29 ) 151 105 69 i
052 42 2A * 152 106 6A j
053 43 2B + 153 107 6B k
054 44 2C , 154 108 6C l
055 45 2D - 155 109 6D m
056 46 2E . 156 110 6E n
057 47 2F / 157 111 6F o
060 48 30 0 160 112 70 p
061 49 31 1 161 113 71 q
062 50 32 2 162 114 72 r
063 51 33 3 163 115 73 s
064 52 34 4 164 116 74 t
065 53 35 5 165 117 75 u
066 54 36 6 166 118 76 v
067 55 37 7 167 119 77 w
070 56 38 8 170 120 78 x
071 57 39 9 171 121 79 y
072 58 3A : 172 122 7A z
073 59 3B ; 173 123 7B {
074 60 3C < 174 124 7C |
075 61 3D = 175 125 7D }
076 62 3E > 176 126 7E ~
077 63 3F ? 177 127 7F DEL


(To get a listing like that while you're logged on to the server via telnet, type "man ascii".)
Answer to exercise: "First%2520Page.html"

 

 

 

 

 

The contents of the Web Institute Web Site, including the On-Line Curriculum, Web Tank, and Session Notes, are Copyright 1999-2001, Graham School of General Studies, University of Chicago. No one may print, copy, or otherwise reproduce these materials without the express written permission of the Director of the Web Institute for Teachers or the Dean of the Graham School. All rights reserved.