HttpGet Documentation - Version 2.1.0

Copyright 2001-2003, David G. Holm, Berrien Springs, Michigan, USA.

HttpGet is a dual-mode Java application that gets a web page from a web host
and saves it with certain changes, allowing you to save a static copy of a web
page for later viewing, allowing you to download the specific files that the
web page references when you go to view the page. HttpGet is intended for use
with web sites with a regular publication schedule for files or images who's
names vary with each publication. HttpGet works with both Java 1 and Java 2.


Command Line Interface Mode:

Java 1 Syntax: jre -cp .;HttpGet.jar Main host [path [port [test]]]

Java 2 Syntax: java -jar HttpGet.jar host [path [port [test]]]

Where host is the host part of the URL to get, path is the path part of the
URL to get (the default is /), port is the host port (the default is 80), and
test generates debug output and gets the web page without translation. Here's
an example that saves the web page at http://www.sluggy.com/daily.php to file
using Java 2:

	java -jar HttpGet.jar www.sluggy.com /daily.php > sluggy.html

Unless the optional test command line parameter is used, HttpGet makes the
following changes to the downloaded web page:

 1) If the web page does not have a BASE HREF tag, then HttpGet adds one to
    the HEAD section using the combined host and page values. For the earlier
    example, the tag ends up as: <BASE HREF="http://www.sluggy.com/daily.php">

 2) All SCRIPT sections are removed from the downloaded page.

 3) All onload, onclose, and onexit event names are removed from all BODY and
    FRAMESET tags.


Graphical User Interface Mode:

Java 1 Syntax: jre -cp .;HttpGet.jar;swing.jar Main

Java 2 Syntax: java -jar HttpGet.jar

When HttpGet starts up in GUI mode, it loads the contents of 'schedule.dat'
into a four-column table with the headings "Name" (a descriptive name for a
web page), "Address" (the address of a web page), "Port" (the web server port
number to use, with a default of 80), "Schedule" (see below for details), and
"Status" (one of "Setup", "Connect", "Fetch", "Convert", "Saving", "Done",
"Stopping", "Stopped", or "Failed".). The table will be empty if the file
'schedule.dat' does not exist. There are six (6) buttons, named Fetch, Add,
Change, Save, Delete, and Exit, with a selection status field located between
the Delete and Exit buttons.

The "Schedule" field is a positional field representing the seven days of the
week, starting with Sunday.  An "X" (or an "x") in any position indicates that
the web site is scheduled for retrieval on that day. Use any other characters
(other than a space, because the field is space trimmed when it is saved and
retrieved) to indicate an unscheduled day. You do not have to fill the field
out to seven characters (e.g., "X" means retrieve on Sunday only and "_X_X_X"
means to retrieve on Monday, Wednesday, and Friday).

The Fetch button fetches the contents of the chosen web page(s) or all of the
web pages if none are chosen. The web pages are written to numbered files in
the 'html' subdirectory (which must exist - HttpGet will not create it). The
numbers correspond to the position of the web page in the table. For example,
the fifth web page in the table will be written to 'html/04.htm' (if chosen,
or if all scheduled pages are being saved). If no web pages are chosen, or if
all of the web pages are chosen, then HttpGet also creates a file named
'all.html' in the main directory, with links to all of the files in the 'html'
subdirectory. This file is designed to be used with the 'index.html' file that
is included in the HttpGet.ZIP archive and sets up two frames: A left frame
for the 'all.html' list and a right frame for each web page (this frame is
initially loaded with the file 'help.html'). If you stop a fetch of all web
sites before fetch completes, then the 'all.html' file will only have links
to the web pages prior to the one that shows the "Stopped" status. A total of
five (5) passes are made through the table. On the first pass, an attempt is
made to fetch each (or each chosen) web page. On subsequent passes, only the
failed web sites are attempted. This maximizes the chances of fetching all
(or all chosen) web pages successfully.

The Add button adds one or more blank rows to the table. If one or more rows
are selected, a blank row is inserted ahead of each selected row. Otherwise,
a blank row is appended to the end of the table. Double click on the name
column to add a site name (for example, "Sluggy Freelance"). Double click on
the address column (or tab over to it) to add a web page address (for example,
"www.sluggy.com/daily.php"). You do not need to include an "http://" prefix.

The Change button brings up a dialog that lets you to edit the Name, Address,
and Port values for the highlighted web page (or the first one if more than
one is highlighted). In addition to the three fields, this dialog has an OK
button and a Cancel button. The OK button accepts the changes without any
prompting, unless the port value is not an integer, in which case a warning
dialog is displayed and the change dialog remains open. The Cancel button
gets rid of any changes without prompting, but if you use the window close
button after changes have been made to the web page settings, a save prompt
is displayed.

The Save button saves the contents of the table to the file 'schedule.dat',
using an intermediate file named 'schedule.da@' in order to reduce the risk
of data loss. (HttpGet first saves the schedule to the intermediate file,
then deletes the original file, and then finally renames the intermediate
file to the original file name).

The Delete button deletes all of the selected rows. There is no confirmation,
because the deletion isn't permanent until you use the Save button.

The selection status field displays how many table rows are selected.

The Exit button exits the HttpGet program, unless the table has been changed,
in which case a confirmation prompt is displayed. Choosing Yes exits without
saving the changes. Choosing No keeps the program running.

Note: For Fetch, Add, and Delete, multiple consecutive and non-consecutive
selections are possible. Use Click and Shift+Click to select the first (or
only) consecutive range. Use Ctrl+Click for lone non-consecutive selections.
Use Ctrl+Click and Ctrl+Shift+Click for additional consecutive ranges.

The columns can be moved around and the window can be resized, but the new
positions and sizes are not saved and the window always starts up with the
same initial size and column positions.

Double-clicking on a row has the same effect as selecting that one row and
then clicking on the Fetch button.

Right-clicking on a row has the same effect as selecting that one row, but
has the additional effect of bringing up a context menu, from which you can
choose any one of Add, Change, Delete, and Fetch, which results in the same
action as selecting the one row and then clicking on the corresponding button.


Command Line Interface Schedule Mode:

Java 1 Syntax: jre -cp .;HttpGet.jar Main -schedule

Java 2 Syntax: java -jar HttpGet.jar host -schedule

When HttpGet starts up in CLI schedule mode, it loads the contents of the
file 'schedule.dat' and processes it as if you were in the GUI mode and had
activated the "Fetch" button with no web pages selected, which fetches only
the web pages scheduled to be fetched today. But instead of operating in GUI
mode, the program operates in CLI mode and sends all status messages to the
stdout device.


CSV format description for schedule.dat file:

The schedule.dat file is a comma-separated values file, with quote marks around
each data field, commas separating consecutive data fields, and either a single
LF character or a CR LF character pair terminating each data record. Each record
is one entry in the schedule table when using the GUI mode. The first data field
is the name of the web site. The second data field is the URL of the web site,
the third data field is the optional port used to access the web site (if a port
value is not specified, then port 80 is used). The fourth data field is the web
site access schedule (see the descripton of the "Schedule" field in the GUI mode
section for details). The fifth data field is the last status that was assigned
to the web site. When adding a new record to the file for use with the command
line interface schedule mode, leave this data field blank.