How to Extract URLs from a Web Page using Shell Scipt

Posted on 8:09 PM by Bharathvn

A straightforward shell script application of lynx is to extract a list of URLs on a given web page, which can be quite helpful in a variety of situations.

The Code

# getlinks - Given a URL, returns all of its internal and
# external links.
if [ $# -eq 0 ] ; then
echo "Usage: $0 [-d|-i|-x] url" >&2
echo "-d=domains only, -i=internal refs only, -x=external only" >&2
exit 1

if [ $# -gt 1 ] ; then
case "$1" in
-d) lastcmd="cut -d/ -f3 | sort | uniq"
-i) basedomain="http://$(echo $2 | cut -d/ -f3)/"
lastcmd="grep \"^$basedomain\" | sed \"s|$basedomain||g\" | sort | uniq"
-x) basedomain="http://$(echo $2 | cut -d/ -f3)/"
lastcmd="grep -v \"^$basedomain\" | sort | uniq"
*) echo "$0: unknown option specified: $1" >&2; exit 1
lastcmd="sort | uniq"

lynx -dump "$1" | \
sed -n '/^References$/,$p' | \
grep -E '[[:digit:]]+\.' | \
awk '{print $2}' | \
cut -d\? -f1 | \
eval $lastcmd

exit 0

How It Works
When displaying a page, lynx shows the text of the page, formatted as best it can, followed by a list of all hypertext references, or links, found on that page. This script simply extracts just the links by using a sed invocation to print everything after the "References" string in the web page text, and then processes the list of links as needed based on the user-specified flags.

The one interesting technique demonstrated by this script is the way the variable lastcmd is set to filter the list of links that it extracts according to the flags specified by the user. Once lastcmd is set, the amazingly handy eval command is used to force the shell to interpret the content of the variable as if it were a command, not a variable.

Running the Script
By default, this script outputs a list of all links found on the specified web page, not just those that are prefaced with http:. There are three optional command flags that can be specified to change the results, however: -d produces just the domain names of all matching URLs, -i produces a list of just the internal references (that is, those references that are found on the same server as the current page), and -x produces just the external references, those URLs that point to a different server.

The Results
A simple request is a list of all links on a specified website home page:

$ getlinks

Another possibility is to request a list of all domain names referenced at a specific site. This time let's first use the standard Unix tool wc to check how many links are found overall:

$ getlinks | wc -l

Amazon has 136 links on its home page. Impressive! Now, how many different domains does that represent? Let's generate a full list with the -d flag:

$ getlinks -d

As you can see, Amazon doesn't tend to point anywhere else. Other sites are different, of course. As an example, here's a list of all external links in my weblog:

$ getlinks -x