The Introducing AUB Document 1. What is aub? More and more people are posting binary files to usenet these days. Some of these binaries are executables and audio data; a majority seem to be pictures of various things, typically landscapes, movie stars and naked people. Because of limitations in the type data that usenet can accommodate, binaries must be encoded into text, and because binary files are commonly very large relative to text files usenet was designed to handle, they frequently must be broken up into pieces. Programs have been developed which take a given binary, encode it, and automatically post it in pieces with descriptive subject lines. When this data arrives at a remote site, users see subject lines that look something like this: 12011 roadkill03.gif, part 1/4 12012 roadkill03.gif, part 3/4 12013 More pictures of tatooed children, please... 12014 Re: roadkill02.gif -- I love the way the eyes bulge out 12015 roadkill03.gif, part 4/4 12016 roseanne_nude.jpg, part 02 of 02 12017 Only BINARIES should be posted here, GOD DAMMIT 12018 roadkill03.gif, part 2/4 12019 HI, I'M BIFF!!!! THESE PIX ARE WAY COOL!!!! 12020 roseanne_nude.jpg, part 01 of 02 While the process of encoding and splitting up binaries for posting to usenet is relatively straightforward, the process of retrieving, sorting, and decoding the pieces (which do not necessarily arrive in order) at receiving sites is less straightforward, tedious, time consuming, and very prone to human error. aub, which stands for "assemble usenet binaries", automates this reassembly process for you. aub is intended for use in newsgroups to which binaries are posted exclusively. When run, it accesses news articles via either a disk-based news spool directory, or via an NNTP news server, determines whether or not any new binaries have appeared in selected newsgroups since the last time it was run, and if so, retrieves, organizes and decodes them, depositing them in a configurable location. This process requires no human intervention once aub has been configured. aub also keeps track of binaries which it has seen some, but not all, of the pieces of. It remembers how to find these old pieces, so that when new, previously missing pieces arrive at your site, it will build the entire binary the next time it is run. It also remembers which binaries it has already seen all of the pieces of already, so that it does not waste time rebuilding the same binaries over and over again. aub was created as a time saver; too many people at too many sites were spending way too much time manually unpacking binary files. Its ability to identify and assemble binary images depends on people posting images with subject lines that observe (loosely) established conventions. aub's recognition capabilities have been significantly improved since the earliest release. 2. How does aub work? aub looks for subject lines containing strings like: N of N N / N N N N | N where N is any number composed of one or more digits, and white space is optional. Once it sees such a line, it tries to figure out a name for the binary by looking at the rest of the subject line. These names are relevant only to aub's internal functioning; when unpacked, binaries are named according to the information they were encoded with. However, it's important that, whatever internal name aub decides on for the binary, that name be recognizable in the subject lines of all pieces. aub ignores all news articles with null subject lines and subject lines that begin with "Re:" regardless of other content. aub uses two files which are maintained in each user's home directory. One is $HOME/.aubconf, which is a configuration file that allows you to customize aub's behavior. See section 5 for a detailed explanation of the structure of configuration files. The other file is $HOME/.aubrc. You should never need to modify this file; aub creates it and maintains it. It's used to keep track of what articles in which groups aub has resolved already, and what articles aub believes to be pieces of binaries that it hasn't seen all of the pieces of yet. 3. What do I need on my system to run aub? You will need Larry Wall's perl interpreter. Older versions of aub also required David Mack's uumerge program; this functionality has since been folded into aub for the sake of speed. perl is available via anonymous FTP from uunet.uu.net, tut.cis.ohio-state.edu, and jpl-decvax.jpl.nasa.gov. Your machine must also have access to news, either via the NNTP NNTP protocol, or by being able to open raw news files on a disk somewhere. Previous versions of aub required that your news access be NNTP-based; this restriction has since been lifted. 4. How do I install aub? There's really only one thing that you might need to configure. aub is a perl script. The first line of the program looks like this: #!/usr/local/bin/perl This appears to tell your shell where to find the perl interpreter. If the path of perl on your system is something else, you'll need to change this line, or create a link called /usr/local/bin/perl which points to where your perl executable actually resides. If you need to change this, you'll probably see a message like: 'aub: Bad address.' when you try to run aub. 5. How do I configure aub? Older versions of aub made use of a configuration file which was normally called $HOME/.aubinit. But few interesting customizations could be accomplished with .aubinit files, because the configuration language was so primitive. The configuration language has been redesigned to allow much greater flexibility. Old .aubinit files will no longer work, or be recognized by aub (except inasmuch as aub will notice them and point out to you that you need to create a new configuration file if you don't already have one.) The new configuration file for aub should be called $HOME/.aubconf. Configuration files are line-oriented; each line is processed separately. If any line contains the '#' character, aub concludes that the character begins a comment, and discards the comment character and everything one the line that follows it. If for some reason you need to put a '#' character in your configuration file and do not want it to be interpreted as beginning a comment, you'll have to escape it by preceding it with a backslash character, e.g. '\#'. Each non-blank line in a configuration file must begin with a keyword recognized by aub. The case of keywords is not significant. As far as aub is concerned, "keyword", "KEYWORD", "Keyword" and "KeYWorD" all mean the same thing. Some keywords require arguments; some require no arguments appear, and some permit varialbe numbers of arguments. If aub sees keywords it doesn't understand in your .aubconf file, it will complain to you about them. One of the keywords aub understands is the GROUP keyword. It's used to tell aub that you want to decode binaries from the newsgroup(s) which appear as argument(s) to the keyword. For example: GROUP alt.binaries.pictures.misc GROUP alt.binaries.pictures.misc alt.binaries.pictures.fractals Every configuration file must contain at least one GROUP keyword to be correct. In general, aub understands two types of keywords. One type is called 'position insensitive', which means that the keyword will have the same effect no matter where in the configuration file it appears. The other type is called 'position sensitive', which means that the keyword means something different when it appears before any GROUP keywords than it does when it appears after any given GROUP keyword. One such position sensitive keyword is the DIRectory keyword. This keyword is used to tell aub what directory to put binaries it decodes in. ("DIRectory" is spelled the way it is because only the 'DIR' part needs to appear in a configuration file for aub to recognize it. In fact, aub will interpret any keyword beginning with the letters 'DIR' as being an instance of the DIRectory keyword.) When a position sensitive keyword appears _before_ any GROUP keyword, the keyword is interpreted as being the default for all groups that appear later. When a position sensitive keyword appears _after_ any GROUP keyword, it is interpreting as applying *only* to that group, overriding any previous default which may have been established via use of the same keyword, or by the value of environment variables (see section 8.) Position sensitive keywords appearing after a GROUP keyword which lists multiple groups are applied only to the last group listed, not to all groups appearing on the group line. For example, the following three configuration files are equivalent: # Sample .aubconf file no. 1 -- basic example # dir /tmp/aub # Default directory group alt.binaries.pictures.misc # Process these group alt.binaries.pictures.fractals # two groups # Sample .aubconf file no. 2 -- multiple group usage, mixed case # DiR /tmp/aub # Default directory gRoUp alt.binaries.pictures.misc alt.binaries.pictures.fractals # Sample .aubconf file no. 3 -- does not use defaults # group alt.binaries.pictures.misc directory /tmp/aub group alt.binaries.pictures.fractals direct-to /tmp/aub # 'dir' is all you need The following three configuration files are also equivalent, though not equivalent to the previous three: # Sample .aubconf file no. 4 -- explicit placement of binaries # group alt.binaries.pictures.misc dir /tmp/aub/misc group alt.binaries.pictures.fractals dir /tmp/aub/fractals # Sample .aubconf file no. 5 -- explicit and default placement # dir /tmp/aub/misc # Default directory group alt.binaries.pictures.misc # Use default directory group alt.binaries.pictures.fractals dir /tmp/aub/fractals # Override default # Sample .aubconf file no. 6 -- explicit and default placement revisited # dir /tmp/aub/fractals # Default directory group alt.binaries.pictures.misc dir /tmp/aub/fractals # Override default group alt.binaries.pictures.fractals # Use default directory The configuration file: # Sample .aubconf file no. 7 -- invalid # group alt.binaries.pictures.misc dir /tmp/aub group alt.binaries.pictures.fractals # No good is invalid, because no directory for aub to place binaries decoded from the newsgroup alt.binaries.pictures.fractals is specified. The DIRectory keyword is unique in this regard; there must be some use of the keyword that enables aub to figure out where to put binaries for every group specified, or it will refuse to run. The easiest way to deal with this is to always establish a default directory by using the DIRectory keyword somewhere before any groups appear. Other position sensitive keywords are available. DESCription This keyword causes aub to extract text from what it thinks is the text portion of posted articles, and append it to the file you specify. This is useful if you're interested in reading the text that describes what all the binaries aub is unpacking are about. A maximum of 60 lines per binary extracted will be put into the file you indicate. Each description is prepended with the name of the decoded binary it refers to, and the group that binary was decoded from. HOOK This keyword enables you to select which binaries aub decodes using your own software. If the HOOK keyword is specified, aub will invoke the argument program and supply it with subject line of the first piece of a binary that it can potentially decode via standard input. If the program returns true (zero), aub will decode the binary. If the program returns false (non-zero), aub will skip decoding the binary, and continue processing. It is not (yet) possible to specify arguments to the user program. For example, the following sample program returns true if standard input contains the string ".gif" (case insignificant), and false otherwise. #!/usr/local/bin/perl # # /tmp/sample_aub_hook: a simple, sample hook program # $sl = ; # Get standard input exit(0) if ($sl =~ m/.gif/i); # Contains ".gif" exit(1); # Didn't see ".gif" Suppose this program were attached to aub via the configuration line: hook /tmp/sample_aub_hook Then aub would only decode binaries containing the string '.gif'. You can write hook programs in any language you choose. POSTprocess ... This keyword enables you to postprocess binaries whose names end in the string (you can list any number of these suffixes on a single line in the configuration file.) Case is not significant in . Before a POSTprocess keyword can appear, must first be defined using the DEFine keyword, which is position insensitive. The format of the DEFine keyword is DEFine may be any string. It's recommended that you stick to alphanumerics. is any UNIX command, with arguments. Simple substitutions are performed on before it's executed in conjunction with the existenece of a POSTprocess keyword and the appearance of a binary whose filename ends in one of the suffixes listed as arguments to the POSTprocess keyword. This all makes perfect sense but is a little difficult to explain. The following example should make things much clearer. Consider the following configuration file: # Sample aub configuration file demonstrating use of a postprocessor # dir /tmp/aubdir define jpg2gif djpeg -G $f > $h_.gif postprocess jpg2gif .jpg .jpeg group alt.binaries.pictures.misc The first line tells aub that it should decode binaries into the directory /tmp/aubdir. The second line defines a postprocessor for aub. The name of the postprocessor is specified as "jpg2gif". The third line says that the postprocessor will be invoked whenever a binary with a name ending in '.jpg' or '.jpeg' is decoded. The fourth line specifies the group that binaries are to be decoded from. Suppose the binary full_moon.jpeg is decoded from alt.binaries.pictures.misc. The binary name "full_moon.jpeg" can be thought of as consisting of three parts; the head part -- everything before the last '.' character -- the '.' character itself, and the tail part -- everything after the last '.' character. aub uses the abbreviations '$h', '$t', and '$f' to refer to the head part, tail part, and entire filename, respectively. (If no '.' character appears in the name of a decoded binary, $h equals $f, the entire name of the binary, and $t is empty.) Because the binary name "full_moon.jpeg" ends in ".jpeg", one of the arguments specified on line two of the sample configuration file, aub invokes the postprocessor "jpg2gif". aub substitutes the appropriate values for '$f' and '$h', in this case, "full_moon.jpeg" and "full_moon" into the postprocessor definition, and executes the resulting UNIX command, which in this case is 'djpeg -G full_moon.jpeg > full_moon_.gif' Assuming that you have the djpeg program on your machine (this software is available via anonymous FTP from ftp.uu.net under the graphics/jpeg directory), this command will cause the .jpeg file to be automatically converted into a similarly named .gif file when it is decoded. A few more examples, again, based on the configuration file above Filename of decoded binary $h $t $f ------------------------------------------------------------------------------ crescent_moon.jpg crescent_moon jpg crescent_moon.jpg big.dog.gif big.dog gif big.dog.gif Filename of decoded binary Postprocessed Reason ------------------------------------------------------------------------------ crescent_moon.jpg yes $f ends in '.jpg' big.dog.gif no $f doesn't end in '.jpg' or in '.jpeg' Filename of decoded binary UNIX command executed ------------------------------------------------------------------------------ crescent_moon.jpg djpeg -G crescent_moon.jpg > crescent_moon_.gif big.dog.gif (none executed) We could have easily have written: define jpg2gif djpeg -G $f > $h_.gif ; rm -f $f to cause aub to remove the old .jpeg version of the binary after converting it to .gif format. I've added the extra underscore character in this example to decrease the chance that djpeg, when it runs, will clobber another binary which aub already unpacked with the name "full_moon.gif" or "cresecent_moon.gif". Postprocessor definitions that can't be executed for some reason may cause you (and aub) some problems at run time. The following keywords are, like DEFine, position independent: NNTP This tells aub that your news access is NNTP-based, and that it should use the specified host as an NNTP server. SPOOL This tells aub that your news access is based on access to raw news files, and that is the root of the news spool tree. A single configuration file may not contain both the NNTP and SPOOL keywords. If neither the NNTP keyword nor the SPOOL keyword appear in your configuration file, aub will assume your news access is via NNTP and use your NNTPSERVER environment variable, if it is defined, to decide what server to connect to. If your NNTPSERVER environment variable is not defined, aub will try to figure out where you normally read news from. If it can't do that, it will ask you to supply the information. If you ever change the mechanism by which you access news, or the server you read news on, you'll need to remove the .aubrc file that aub maintains to keep track of what groups you have and have not read. Otherwise, because articles are numbered differently on different servers, aub will get hopelessly confused. (It's possible, though not recommended, to switch seamlessly back and forth between NNTP and SPOOL access to news on the same host.) This is probably the only time you'll ever want to tamper with a .aubrc file. DEBUG Sets the default debugging level aub runs at to N. N must be a non-negative integer. Debugging level 0 is the default; when run at debugging level zero, aub produces no output unless it runs into serious problems. Setting the debugging level to 1 will tell you about what aub is doing. Setting the debugging level to 2 will tell you even more about what aub is doing. Setting the debugging level to 3 or higher will show you more than you ever wanted to know. RECognize ... The recognition code (the part of aub that identifies binaries) maintains a list of common suffixes that it uses to recognize binaries while it scans subject lines. For example, many binaries have names ending in ".gif", so ".gif" is on aub's internal list of hints. The RECognize keyword allows you to add suffixes to this internal list of hints. Use this capability sparinging. You can really give aub a coronary by saying something like 'rec a b c d e f g ...'. Doing something foolish like that will cause your aub to lose the ability to assemble things that it would otherwise have been able to. The current list of common suffixes aub maintains is: ".gif", ".jpg", ".jpeg", ".gl", ".zip", ".au", ".zoo", ".exe", ".dl", ".snd", ".mpg", ".mpeg", ".tiff", ".lzh", ".wav" NOXHDR This keyword is meaningful only if your news access is NNTP-based. It will cause aub to not use the XHDR command to access the subject lines of news articles, even if the NNTP server you're using has XHDR capability. If the same keyword appears multiple times, and the second appearance is not a position sensitive override of some established default, then aub ignores the second instance of the keyword. 7. How do I use aub? After you've built your configuration file, just run 'aub'. If this is the first time you've run aub since v1.1, you may want to undefine any AUB-related environment variables you had set. These variables are interpreted differently now. See section 8. You will not need to remove your .aubrc file, but your .aubinit file is no longer useful and you'll probably want to get rid of it once you've created .aubconf. If this is the first time you've run any version of aub, ever, you may want to use the '-c' command line option. Or you may not...see section 9. 8. Environment variables used by aub. $AUBDIR Sets the default directory binaries are unpacked into. Equivalent to specifying a DIRectory keyword before any GROUP keywords. Will override any DIRectory keyword appearing before any GROUP keyword, but not those appearing after a GROUP keyword. $AUBDESC Analogous to $AUBDIR $AUBHOOK Analogous to $AUBDIR $NNTPSERVER Specifies an NNTP server to use for news access if no NNTP keyword appears in the configuration file. If an NNTP keyword does appear, $NNTPSERVER is ignored. Note that $AUBGROUPS is no longer used as of version 2.0.3. If aub doesn't seem to be doing what you'd expect it to do based on your .aubconf file, it could be because your environment variables are causing defaults you've established there to be ignored. 9. Command line options supported by aub: -c 'Catch-up' mode; aub will bring its internal pointers (and your .aubrc file) up to date, but will not actually generate any binaries. This is useful when you run aub for the first time; it keeps it from generating megabytes and megabytes, as it scans old news articles. -n 'No-checkpoint' mode; prohibits aub from updating its internal pointers (your .aubrc file). This option is primarily useful only during debugging. -dn 'Debug' mode; sets the debugging level to N. This overrides the debugging level set in the configuration file, except that 'aub -d0' does not work...this is a bug. -M Causes aub to print the long form of the documentation (this document.) -m Causes aub to print a summary of the documentation. -C Lists significant changes since that last major release of aub. 10. What do I do if I have problems installing or configuring aub? See if you can figure out what the problem is. I've only set aub up on my local system, so it's possible you could have problems I haven't foreseen. If you really can't get it to work, try talking to a friend who knows systems programming and administration type stuff. Offer your friend food -- systems people especially like dim sum and Heineken. You could also send me mail. Whether or not I answer your mail will depend a lot on how busy I am. Sorry, but I have an obligation to get work done promptly for my client, who's paying me for my time. I can't really deal with supporting aub on the side for the entire net. Also, if your problem has to do with peculiarities of your local site, there may not be a lot I can do about it. 11. What else do I need to know? In order to guarantee proper administration of the .aubrc file, you can only run one instance of aub at a time. In this respect aub is similar to most newsreaders. The first time you run aub over a given group, if you choose not to use the -c option, it may take a long time to run. This is because it's looking at all of the articles in the group, and building lots of binaries. After you run it for the first time, it only needs to look at new stuff in the group. Things will go much faster after that. If aub assembles two binaries with the same name, and wants to store them in the same place, it will compare them to see whether or not they're identical. If they are identical, it will discard the newer copy. If they're not identical, it will append '+' characters as necessary to the name of the second binary until the name is unique. aub checkpoints its progress in the .aubrc file after processing each group. This keeps it from having to start all over again if it dies of a signal, expired CPU time limit, etc... aub takes liberties with changing around the names of binaries that it doesn't particularly like. It may rename binaries to be called "Mangled" if people post things that are supposed to be unpacked to "." or "..", or something equally obnoxious, for instance. It will drop the leading "." off of binaries called ".something", and relativize pathnames so that your binaries always wind up in the directories you want them in. It's unfriendly to run aub so often that you occupy too much of your news server's time. It's pronounced "oww-buh", as in "S(au)di", not "awe-buh", as in "sl(aw)". This software is offered as-is, with no guarantees or promises made by me whatsoever. I disclaim all responsibility for loss or damage caused by the program. Mark Stantz stantz@sierra.stanford.edu stantz@sgi.com 8/92