webawk

	Sign In Sign-Up

WebAwk 0.1

This is a proof-of-concept of a tool to automate web browsing / data collection. Briefly, it works like AWK except that instead of working on files and lines it works on HTML pages and hyperlinks. "It works like AWK" is literally true insofar as the code generated by a2p "works like awk." Perl source is here. (Sorry, had to use .txt extension to make freeservers.com happy.)

For example, here is a WebAwk script that counts references to off-site URLs:

url !~ base_url { external++; }
url ~ base_url { add_links(); }
END { print external; }

WebAwk is invoked as
webawk <base_url> <base_path> [--proxy] [--verbose] [AWK options]

Note that AWK normally takes its program from the command line; -f is required to specify a file. WebAwk behaves in the same way.

As a debugging aid WebAwk currently prints resulting Perl code to stdout instead of executing it directly.

Variables available to WebAwk scripts include

base_url: the URL the script was initially invoked on
base_path: root of saved data tree
url: current URL being processed
linked_from: parent of current URL
content: the actual data corresponding to the current URL

Functions include

add_links: add links in the current page to the list to be processed
save: save the current page to local storage

As an example of this last function, the following WebAwk script "mirrors" a site:

{ save(); }
url ~ base_url { add_links(); }

Requirements

libwww-perl (available at cpan.org)
high-speed access to the Internet is very much recommended since WebAwk currently loads "content" for each URL regardless of whether it is actually needed. This is inconvenient when images, for example, are concerned.

To Do

Fix the content-loading scheme
Force Perl to free unused memory: currently, memory usage of a WebAwk-generated program can grow quite large. (As in, "large enough to start your machine swapping like crazy and make all your other processes very unhappy.") Do not run WebAwk on a large site unattended; you have been warned. :)
Integrate WebAwk's options with AWK's. Seems confusing, having two sets of argument-handlers.
Get my machine's power supply fixed so I can do the above. (Anyone want to donate a libwww-perl enabled account? :)

Please send any questions or comments to jbe28@email.byu.edu. I'm particularly interested in ideas along the lines of, "It looks like it would be really easy to do such-and-such with WebAwk if only the following additional functionality was added . . ."

Note to proxy users

Libwww-perl's proxy functionality is based on environment variables. If you need to run WebAwk with --proxy, first set the variable http_proxy. E.g. here at BYU I have to "export http_proxy=http://proxy.byu.edu:80" (where :80 specifies the port).

Acknowledgements

WebAwk's awk handling is all from Tom Christansen's "awk.tchrist" wrapper to a2p (copyright (c) Tom Christiansen 1999)
Much of the URL processing code is loosely based on WebMirror 1.0 by Felix von Leitner. Also available at CPAN if you look hard enough.