WebAwk 0.1

This is a proof-of-concept of a tool to automate web browsing / data collection.  Briefly, it works like AWK except that instead of working on files and lines it works on HTML pages and hyperlinks.  "It works like AWK" is literally true insofar as the code generated by a2p "works like awk."  Perl source is here.  (Sorry, had to use .txt extension to make freeservers.com happy.)
 
For example, here is a WebAwk script that counts references to off-site URLs:

url !~ base_url { external++; }
url ~ base_url { add_links(); }
END { print external; }

WebAwk is invoked as
webawk <base_url> <base_path> [--proxy] [--verbose] [AWK options]

Note that AWK normally takes its program from the command line; -f is required to specify a file.  WebAwk behaves in the same way.

As a debugging aid WebAwk currently prints resulting Perl code to stdout instead of executing it directly.

Variables available to WebAwk scripts include

Functions include As an example of this last function, the following WebAwk script "mirrors" a site:

{ save(); }
url ~ base_url { add_links(); }
 

Requirements

To Do

Please send any questions or comments to jbe28@email.byu.edu.  I'm particularly interested in ideas along the lines of, "It looks like it would be really easy to do such-and-such with WebAwk if only the following additional functionality was added . . ."
 

Note to proxy users

Libwww-perl's proxy functionality is based on environment variables.  If you need to run WebAwk with --proxy, first set the variable http_proxy.  E.g. here at BYU I have to "export http_proxy=http://proxy.byu.edu:80" (where :80 specifies the port).
 

Acknowledgements