doc/catdoc.1.in

   1 .TH catdoc 1  "Version @catdoc_version@" "MS-Word reader"
   2 .SH NAME
   3 catdoc \- reads MS-Word file and puts its content as plain text on standard output
   4 .SH SYNOPSIS
   5
   6 .BR catdoc " [" -vlu8btawxV "] [" -m "
   7 .IR number ]
   8 [
   9 .B -s
  10 .IR charset ]
  11 [
  12 .B -d
  13 .IR charset ]
  14 [
  15 .B -f
  16 .IR output-format ]
  17 .I file
  18
  19 .SH DESCRIPTION
  20
  21 .B catdoc
  22 behaves much like
  23 .BR cat (1)
  24 but it reads MS-Word file and produces human-readable text on standard output.
  25 Optionally it can use
  26 .BR latex (1)
  27 escape sequences for characters which have special meaning for LaTeX.
  28 It also makes some effort to recognize MS-Word tables, although it never
  29 tries to write correct headers for LaTeX tabular environment. Additional
  30 output formats, such is HTML can be easily defined.
  31 .PP
  32 .B catdoc
  33 doesn't attempt to extract formatting information other than tables from
  34 MS-Word document, so different output modes means mainly that different
  35 characters should be escaped and different ways used to represent characters,
  36 missing from output charset. See CHARACTER SUBSTITUTION below
  37
  38 .PP
  39 .B catdoc
  40 uses internal
  41 .BR unicode (4)
  42 representation of text, so it is able to convert texts when charset in
  43 source document doesn't match charset on target system.
  44 See CHARACTER SETS below.
  45 .PP
  46 If no file names supplied,
  47 .B catdoc
  48 processes its standard input unless it is terminal. It is unlikely that
  49 somebody could type Word document from keyboard, so if
  50 .B catdoc
  51 invoked without arguments and stdin is not redirected, it prints brief
  52 usage message and exits.
  53 Processing of standard input (even among other files) can be forced using
  54 dash '-' as file name.
  55 .PP
  56 By default,
  57 .B catdoc
  58 wraps lines which are more than 72 chars long and separates paragraphs by
  59 blank lines. This behavior can be turned of by
  60 .B -w
  61 switch. In
  62 .I wide
  63 mode
  64 .B  catdoc prints each paragraph as one long line, suitable for import into
  65 word processors which perform word wrapping theirselves.
  66
  67
  68 .SH OPTIONS
  69 .TP 8
  70 .B -a
  71 - shortcut for -f ascii. Produces ASCII text as output.
  72 Separates table columns with TAB
  73 .TP 8
  74 .B -b
  75 - process broken MS-Word file. Normally,
  76 .B catdoc checks if first 8 bytes
  77 of file is Microsoft OLE signature. If so, it processes file, otherwise
  78 it just copies it to stdin. It is intended to use
  79 .B catdoc
  80 as filter for viewing all files with
  81 .I .doc
  82 extension.
  83 .TP 8
  84 .BI -d charset
  85 - specifies destination charset name. Charset file has format described in
  86 CHARACTER SETS below and should have
  87 .B .txt
  88 extension  and reside in
  89 .B catdoc library directory ( @libdir@/catdoc). By default, current
  90 locale charset is used if langinfo support compiled in.
  91 .TP 8
  92 .BI -f format
  93 - specifies output format as described in CHARACTER SUBSTITUTION below.
  94 .B catdoc
  95 comes with two output formats - ascii and tex. You can add your own if you
  96 wish.
  97 .TP 8
  98 .B  -l
  99 Causes
 100 .B catdoc
 101 to list names of available charsets to the stdout and exit successfully.
 102 .TP 8
 103 .BI -m number
 104 Specifies right margin for text  (default 72).
 105 .B -m 0
 106 is equivalent to
 107 .B -w
 108 .TP 8
 109 .BI -s charset
 110 Specifies source charset. (one used in Word document), if Word document
 111 doesn't contain UTF-16  text. When reading rtf documents, it is
 112 typically not necessary, because rtf documents contain ansicpg
 113 specification. But it can be set wrong by Word (I've seen RTF documents
 114 on Russian, where cp1252 was specified). In this case this option would
 115 take precedence over charset, specified in the document. But
 116 source_charset statement in the configuration file have less priority
 117 than charset in the document.
 118 .TP 8
 119 .B -t
 120 - shortcut for
 121 .B -f tex
 122  converts all printable chars, which have special meaning for
 123 .BR LaTeX (1)
 124 into appropriate control sequences. Separates table columns by
 125 .BR &.
 126 .TP 8
 127 .B -u
 128 - declares that Word  document  contain  UNICODE   (UTF-16) representation
 129 of text (as some Word-97 documents). If catdoc fails to correct  Word document
 130 with  default charset,   try    this  option.
 131 .TP 8
 132 .B -8
 133 - declares is Word document is 8 bit. Just in case that catdoc
 134  recognizes file format incorrectly.
 135 .TP 8
 136 .B -w
 137 disables word wrapping. By default
 138 .B catdoc
 139 output is splitted into lines not longer than 72 (or  number, specified by
 140 -m  option)   characters and paragraphs
 141 are separated by blank line. With this option each paragraph is one
 142 long line.
 143 .TP 8
 144 .B -x
 145 causes catdoc to output unknown UNICODE character as \\xNNNN, instead
 146 of question marks.
 147 .TP 8
 148 .B -v
 149 causes catdoc to print some useless information about word document
 150 structure to stdout before actual start of text.
 151 .TP 8
 152 .B -V
 153 outputs catdoc version
 154
 155 .SH CHARACTER SETS
 156 When processing MS-Word file
 157 .B catdoc
 158 uses information about two character sets, typically different
 159  -  input and output. They are stored in plain text files in
 160 .B catdoc
 161 library directory. Character set files should contain two whitespace-separated
 162 hexadecimal numbers - 8-bit code in character set and 16-bit Unicode code.
 163 Anything from hash mark to end of line is ignored, as well as blank lines.
 164
 165 .B catdoc
 166 distribution includes some of these character sets. Additional character set
 167 definitions, directly usable by
 168 .B catdoc
 169 can be obtained from ftp.unicode.org. Charset files have
 170 .B .txt
 171 suffix, which shouldn't be specified in command-line or configuration
 172 files.
 173 .PP
 174 Note that
 175 .B catdoc
 176 is distributed with Cyrillic charsets as default. If you are not
 177 Russian, you probably don't want it, an should reconfigure catdoc at
 178 compile time or in runtime configuration file.
 179 .PP
 180 When dealing with documents with charsets other than default, remember
 181 that Microsoft never uses ISO charsets. While letters in, say cp1252 are
 182 at the same position as in ISO-8859-1, some punctuation signs would be
 183 lost, if you specify ISO-8859-1 as input charset. If you use cp1252,
 184 catdoc would deal with those signs as described in CHARACTER
 185 SUBSTITUTION below.
 186
 187 .SH CHARACTER SUBSTITUTION
 188 .B catdoc
 189 converts  MS-Word file into following internal Unicode representation:
 190 .TP 4
 191 1. Paragraphs are separated by ASCII Line Feed symbol (0x000A)
 192 .TP 4
 193 2. Table cells within row are separated by ASCII Field Separator symbol
 194 (0x001C)
 195 .TP 4
 196 3. Table rows are separated by ASCII Record Separator (0x001E)
 197 .TP 4
 198 4. All printable characters, including whitespace are represented with their
 199 respective UNICODE codes.
 200 .PP
 201 This UNICODE representation is subsequently converted into 8-bit text in
 202 target character set using following four-step algorithm:
 203 .TP 4
 204 1. List of special characters is searched for given Unicode character.
 205 If found, then appropriate multi-character sequence is output instead of
 206 character.
 207 .TP 4
 208 2. If there is an equivalent in target character set, it is output.
 209 .TP 4
 210 3. Otherwise, replacement list is searched and, if there is multi-character
 211 substitution for this UNICODE char, it is output.
 212 .TP 4
 213 4. If all above fails, "Unknown char" symbol (question mark) is output.
 214 .PP
 215 Lists of special characters and list of substitution are character
 216 set-independent, because special chars should be escaped regardless of their
 217 existence in target character set  (usually, they are parts of US-ASCII, and
 218 therefore exist in any character set) and replacement list is searched only
 219 for those characters, which are not found in target character set.
 220 .PP
 221 These lists are stored in
 222 .B catdoc
 223 library directory in files with prefix of format name. These files have
 224 following format:
 225 .PP
 226 Each line can be either comment (starting with hash mark) or contain
 227 hexadecimal UNICODE value, separated by whitespace from string, which
 228 would be substituted instead of it. If string contain no whitespace it
 229 can be used as is, otherwise it should be enclosed in single or double
 230 quotes. Usual backslash sequences like
 231 .IR '\en' , '\et'
 232 can be used in these string.
 233
 234
 235 .SH RUNTIME CONFIGURATION
 236 Upon startup catdoc reads its system-wide configuration file (
 237 .B catdocrc in
 238 .B catdoc
 239 library directory) and then
 240 user-specific configuration file
 241 .BR ${HOME}/.catdocrc.
 242 .PP
 243 These files can contain following directives:
 244 .TP 8
 245 .BI "source_charset = " charset-name
 246 Sets default source charset, which would be used if no
 247 .B -s
 248 option specified. Consult configuration of nearby windows
 249 workstation to find one you need.
 250 .TP 8
 251 .BI "target_charset = "  charset-name
 252  Sets default output charset. You probably know, which one you use.
 253 .TP 8
 254 .BI "charset_path = "  directory-list
 255 colon-separated list of directories, which are searched for charset files.
 256 This allows you to install additional charsets in your home directory.
 257 If first directory component of path is ~ it is replaced by contents of
 258 .B HOME
 259 environment variable.
 260 On MS-DOS platform, if directory name starts with %s, it is replaced
 261 with directory of executable file. Empty element in list (i.e. two
 262 consequitve colons) is considered current directory.
 263 .TP 8
 264 .BI "map_path = " directory-list
 265 colon-separated list of directories, which are searched for special character
 266 map and replacement map.
 267 Same substitution rules as in
 268 .B charset_path
 269 are applied.
 270 .TP 8
 271 .BI "format = " "format name"
 272 Output format which would be used by default.
 273 .B catdoc
 274 comes with two formats -
 275 .BR ascii " and " tex
 276 but nothing prevents you from writing your own format (set two map files -
 277 special character map and replacement map).
 278 .TP 8
 279 .BI "unknown_char = " "character specification"
 280 sets character to output instead of unknown Unicode character (default '?')
 281 Character specification can have one of two form - character enclosed in
 282 single quotes or hexadecimal code.
 283 .TP 8
 284 .BI "use_locale =" "(yes|no)"
 285 Enables or disables automatic selection of output charset (default
 286 .BR yes ),
 287  based on
 288 system locale settings (if enabled at compile time). If automatic
 289 detection is enabled, than output charset settings in the configuration
 290 files (but not in the command line) are ignored, and current system
 291 locale charset is used instead. There are no automatic choice of input
 292 charset, based of locale language, because most modern Word files (since
 293 Word 97) are Unicode anyway
 294
 295 .SH BUGS
 296
 297 Doesn't handle
 298 fast-saves properly. Prints footnotes as separate paragraphs at the end of
 299 file, instead of producing correct LaTeX commands. Cannot distinguish
 300 between empty table cell and end of table row.
 301
 302
 303
 304 .SH "SEE ALSO"
 305
 306 .BR xls2csv (1),
 307 .BR cat (1),
 308 .BR strings (1),
 309 .BR utf (4),
 310 .BR unicode (4)
 311
 312 .SH AUTHOR
 313
 314 V.B.Wagner <vitus@45.free.net>