Would you like to make this site your homepage? It's fast and easy...
Yes, Please make this my home page!
WebAwk 0.1
This is a proof-of-concept of a tool to automate web browsing / data collection.
Briefly, it works like AWK except that instead of working on files and
lines it works on HTML pages and hyperlinks. "It works like AWK"
is literally true insofar as the code generated by a2p "works like awk."
Perl source is here. (Sorry, had to
use .txt extension to make freeservers.com happy.)
For example, here is a WebAwk script that counts references to off-site
URLs:
url !~ base_url { external++; }
url ~ base_url { add_links(); }
END { print external; }
WebAwk is invoked as
webawk <base_url> <base_path> [--proxy] [--verbose] [AWK options]
Note that AWK normally takes its program from the command line; -f is
required to specify a file. WebAwk behaves in the same way.
As a debugging aid WebAwk currently prints resulting Perl code to stdout
instead of executing it directly.
Variables available to WebAwk scripts include
-
base_url: the URL the script was initially invoked on
-
base_path: root of saved data tree
-
url: current URL being processed
-
linked_from: parent of current URL
-
content: the actual data corresponding to the current URL
Functions include
-
add_links: add links in the current page to the list to be processed
-
save: save the current page to local storage
As an example of this last function, the following WebAwk script "mirrors"
a site:
{ save(); }
url ~ base_url { add_links(); }
Requirements
-
libwww-perl (available at cpan.org)
-
high-speed access to the Internet is very much recommended since WebAwk
currently loads "content" for each URL regardless of whether it is actually
needed. This is inconvenient when images, for example, are concerned.
To Do
-
Fix the content-loading scheme
-
Force Perl to free unused memory: currently, memory usage of a WebAwk-generated
program can grow quite large. (As in, "large enough to start your
machine swapping like crazy and make all your other processes very unhappy.")
Do not run WebAwk on a large site unattended; you have been warned. :)
-
Integrate WebAwk's options with AWK's. Seems confusing, having two
sets of argument-handlers.
-
Get my machine's power supply fixed so I can do the above. (Anyone
want to donate a libwww-perl enabled account? :)
Please send any questions or comments to jbe28@email.byu.edu. I'm
particularly interested in ideas along the lines of, "It looks like it
would be really easy to do such-and-such with WebAwk if only the following
additional functionality was added . . ."
Note to proxy users
Libwww-perl's proxy functionality is based on environment variables.
If you need to run WebAwk with --proxy, first set the variable http_proxy.
E.g. here at BYU I have to "export http_proxy=http://proxy.byu.edu:80"
(where :80 specifies the port).
Acknowledgements
-
WebAwk's awk handling is all from Tom Christansen's "awk.tchrist" wrapper
to a2p (copyright (c) Tom Christiansen 1999)
-
Much of the URL processing code is loosely based on WebMirror 1.0 by Felix
von Leitner. Also available at CPAN if you look hard enough.