![]() |
Type in a command, or "ls dictionary" to search all commands for "dictionary", etc.
|
NAME
scrape - display a snippet of text parsed from a web page.
SYNOPSIS
scrape -tokens TOKENS -dirs DIRECTIONS -url URL [options]
DESCRIPTION
Mandatory arguments:
-tokens <token_list_string>
A list of tokens (delimited by spaces) which the user has
determined to be sufficient to consistently point the
parser to the exact location of the required data. For
example
-tokens <title> </title>
would return all characters between these two HTML tags.
-dirs <direction_list_string>
A list of directions associated with each token. 0
instructs the parser to look in the forward direction,
while 1 instructs the parser to look in the reverse
direction. In the above example, we would have
-dirs 0 0
since we will look for <title> from the beginning of
the file in the forward direction, and then continue
in the forward direction to find </title>. However,
we could also specify
-tokens </title> <title -dirs 0 1
which would first find </title> and then reverse search
for <title. This is useful if an HTML tag has attributes
and you therefore cannot assume that the ">" in <title>
will be present.
-url <url>
The full URL to the desired web page using the same format
regardless of which HTTP variable-sending method is needed.
Example: -url http://www.site.com?var1=${a}&var2=${b}
Example: -url http://www.site2.com?var=%s
Example: -url http://www.site3.com/%s.html
The website name and any HTTP variables must be separated
by a "?" and subsequent HTTP variables by a "&".
Optional arguments:
-method <http_method = get>
If not included, the variables will be sent using the HTTP
GET method. If set to anything else (including of course,
post) the variables will be sent using the HTTP POST method.
-textonly <display_method = 1>
If not included, the scraped text will be returned as a
simple string of text, easily fed into other YubNub functions.
This is the default value of 1. If set to anything else,
scrape will return the scraped text to a "mock" YubNub
command line (see the defw, defn, and postalcode commands.)
-debug <debugger = 0>
If not included, no debugging information will be returned.
If set to 1, some debugging information will be returned.
This may help you see why a certain scrape is not working.
Forthcoming arguments:
-a character offset argument
-a length argument
-a word offset argument
-a word delimiter argument
-a numwords argument
EXAMPLE
1. The qpostal command sends the user to the Canada Post site
which displays the requested postal code:
qpostal -n 1708 -s charles -t court -c val caron -p ON
To scrape this postal code from this web page, you would
have to examine the HTML of the above Canada Post site, and
identify various tokens that will consistently guide the
parser to the postal code string. One could define a command
called qpostal_lite in this way:
scrape -tokens >Postal Code< <tr> </tr> tblcell <br> >
-dirs 0 0 0 1 0 1 -method post
-url http://www.canadapost.ca/...
(see the yndesturl variable of the
qpostal command for the full URL)
Typing:
qpostal_lite -n 1708 -s charles -t court -c val caron -p ON
will now return a simple string to your browser, which can
now be piped into other commands which expect a postal code
as an argument: cbc_pc {qpostal_lite ...}
2. The wikt commands sends the user to the Wiktionary site
which displays the requested definition:
wikt eon
To scrape this definition from this page, you would again
examine the HTML of the above Wiktionary site, and
identify various tokens that will consistently guide the
parser to the definition string. This is often difficult.
The following command gets the first def'n, which is not
always expected. Define wikt_lite as:
scrape -tokens </ol> <ol> -dirs 0 1 -textonly 0
-url http://en.wiktionary.org/wiki/%s
Typing wikt_lite eon will now return the definition to a "mock"
YubNub command line prompt. If the textonly switch is dropped,
the definition is returned to the browser, and can also be
piped into other YubNub commands expecting a string as an
argument: fspell {wikt_lite eon}
NOTES
This command is very beta at the moment. Please bear with
me. Comments and suggestions are welcome at the YubNub Google
group (type yubgroup at the YubNub command line.)
AUTHOR
Sean O'Hagan