NAME

    HTML::TokeParser::CSSSelect::Content - easily get textual content from HTML with CSS selectors.


SYNOPSIS

    use HTML::TokeParser::CSSSelect::Content;
    
    my $parser = HTML::TokeParser::CSSSelect::Content->new(
        \$html_string,
    );
    
    my $parser2 = HTML::TokeParser::CSSSelect::Content->new(
        $html_file,
        options => {
            skip_inline => 1,
            skip_nested => 3,
            start_after => { seen => '#navigation' },
            stop_after  => { seen3 => '/table' },
        },
    );
    
    my $all_text = $parser->fetch();
    
    my @info_paragraphs = $parser->fetch('#content .info p');
    
    my $data_ref = $parser2->fetch(
        {
            col1 => 'table.data td.name',
            col2 => 'table.data td.syntax pre',
            col3 => 'table.data td.desc, table.data td.desc *',
        }
    );
    
    print join "\n--\n", @info_paragraphs;
    
    printf (
        "%5s %5s %10s\n",
        $data_ref->{col1}[ $_ ],
        $data_ref->{col2}[ $_ ],
        $data_ref->{col3}[ $_ ]
    ) for 0 .. $#{ $data_ref->{col1} };


DESCRIPTION

HTML::TokeParser::CSSSelect::Content - a module to easily fetch textual content from HTML using Cascading Style Sheets selectors.

It can be configured to exclude data from inline-level elements that it comes across as well as to skip a certain nesting depth ( see 'OPTIONS' section for ->new() method ).

Parsing of data after ``seeing'' a certain element and stopping after seeing another is also possible.


METHODS

new

    my $p = HTML::TokeParser::CSSSelect::Content->new( \$html_string );
    my $p2 = HTML::TokeParser::CSSSelect::Content->new( $html_file );

HTML::TokeParser::CSSSelect::Content uses (NOT inherits) the HTML::TokeParser::Simple manpage. The arguments passed to the new() method will be passed to the new() method of HTML::TokeParser::Simple object, thus the new() method accepts the same parameters with one special options hashref.

    my $p = HTML::TokeParser::CSSSelect::Content->new(
        $html_file,
        options => {
            skip_inline => 1,
            start_after => { seen => '#navigation' },
            stop_after  => { seen3 => '/table' },
        },
    );

OPTIONS HASHREF

The options hashref is where you configure the way your parser would behave. The available options are as follows:

skip_inline
    ->new( options => { skip_inline => 1 } );

Defaults to false

when skip_inline option is set to a true value the parser will NOT include text from inline-level elements. In other words, if your HTML has:

    <p>Foo <em>Bar</em> <a href="#beer">Beers</a> </p>

The normal behaviour of fetching with selector 'p' would produce text:

    'Foo Bar Beers'

Whereas with skip_inline option set to a true value the result would be:

    'Foo'

(if you are wondering about trailing spaces, see the keep_space option)

skip_nested
    ->new( options => { skip_nested => 2 } );
    ->new( options => { skip_nested => 3 } );

Defaults to undef

This option tells the parser to not include content of nested elements.

When set to undef (the default value) all nested elements are included. Otherwise it accepts a positive integer indicating a nesting level to skip. Consider a following HTML snippet:

    <div class="foo">
         <!-- nest level 0 here -->
        <div>
            <!-- nest level 1 here -->
            <p>
                <!-- nest level 2 here -->
                Bar
            </p>
        </div>
        <!-- nest level 0 here -->
        <div>
            <!-- nest level 1 here -->
            <p>Foo</p> <!-- nest level 2 inside the <p> -->
            <div>
                <!-- nest level 2 here -->
                <p>Beer <em>Bars</em></p> <!-- Beer is on 3 and Bars is on level 4! -->
            </div>
        </div>
    </div>

And consider the parsing to be done with:

    my @data = $p->fetch('.foo');

By default, @data will contain one element, which will be a string 'Bar Foo Beer Bars'

If skip_nested option is set to value 0 or 1, the resulting string would be empty.

If skip_nested option is set to value 2, the resulting string would be 'Bar Foo'

If skip_nested option is set to value 3, the resulting string would be 'Bar Foo Beer'.

If skip_nested options is set to value 4 or anything higher, the resulting string would end up being 'Bar Foo Beer Bars'

Think of the value representing the number of elements the parser will see after matching your selector until it sees the text.

Important Note: If skip_nested would be set to value 0, fetching the data with fetch('.foo *') (note the universal selector there) would result in @data containing four elements: 'Foo', 'Bar', 'Beer', 'Bars'. But if skip_nested would be left at the default undef, @data would contain two elements, a string 'Foo' and a string 'Bar Beer Bars'

The call to new will croak if value of skip_nested is defined and doesn't match /^\d+$/

start_after
    ->new( options => { start_after => { seen => '#foo'   } } );
    ->new( options => { start_after => { seen2 => '/#bar' } } );
    ->new( options => {
                        start_after => {
                            seen => '#foo',
                            seen2 => '/.bar',
                            seen3 => 'pre,p',
                            seen4 => [ 'span,b' ],
                            seen5 => {
                                'and => 'em',
                                'or' => 'div'
                            },
                        },
            },
    );

Defaults to undef, meaning start at the beginning.

The start_after option tells the parser when to start getting data. It takes a hashref as a value. If set to undef or an empty hashref parser will start from the beginning of the document.

Currently only the seen option (but see postfix trick below) is implemented. The value is a CSS selector which can be prefixed with a '/' (slash) to indicate an end tag (see is_end_tag() method in the HTML::TokeParser::Simple manpage) Otherwise it is considered to be a start tag (see is_start_tag method in the HTML::TokeParser::Simple manpage)

The seen key may be postfixed with a number to indicate the number of the occurances after which to start parsing. In other words

    { start_after => { seen => 'div.foo' } }

Will start parsing after the first occurance of <div class=``foo''> whereas

    { start_after => { seen2 => 'div.foo' } }

Will start parsing after the second occurance of <div class=``foo''> (in other words: <div class=``foo''>This won't get parsed</div><div class=``foo''>But THIS will</div>)

    { start_after => { seen3 => '/div.foo' } }

And this will start parsing after the third closing </div> tag for <div class=``foo''>

It is well possible to combine selectors and ``seen'' postifixes. For example:

    start_after => {
        seen  => 'div,p',
        seen2 => '.foo',
    }

Would start parsing after coming across either <div> or <p> element or seeing an element with class=``foo'' twice

It is possible to make an ``and'' statements from that.

If the value to the seen key is an arrayref it will be treated as ``and''. If the value to the seen statement is a hashref, the 'and' key should contain a selector to be joined with an 'AND' and the 'or' key should contain a selector to be joined with an 'OR'. The following set of examples explains the concept a bit clearer:

    start_after => { seen  => [ 'div, p, pre'] }
    # start parsing after coming across <div> AND <p> AND <pre>
    
    start_after => { seen => [ 'div, /.foo, pre' ] }
    # start parsing after coming across opening <div> AND closing
    # tag for an element with class="foo" AND opening tag for <pre>
    start_after => { seen => 'div', seen2 => [ 'pre,p' ] }
    # start parsing after coming across <div> OR seeing BOTH <pre>
    # AND <p> twice.
    
    start_after => { seen => [ 'div' ], seen2 => [ 'pre,p' ] }
    # start parsing after coming across a <div> AND seeing BOTH
    # <pre> AND <p> twice
    
    start_after => {
        seen  => { 'and' => 'div,p', 'or' => 'pre' },
        seen2 => [ 'span' ],
        seen3 => 'em',
    }
    # start parsing after satisfying EITHER ONE of following:
    # * - seen three <em>
    # * - seen one <pre>
    # * - seen one <div>, one <p> AND two <span>s

To further explain this concept: if you disregard the number of times the parser need to come across a certain element, all the values inside arrayrefs and 'and' keys may be put into one big list and joined with an 'AND' and the selectors which are not in an arrayref as well as selectors in the 'or' keys will be joined with an 'OR'.

stop_after
    ->new( options => { stop_after => { seen => '#foo'   } } );
    ->new( options => { stop_after => { seen2 => '/.bar' } } );
    ->new( options => {
                        stop_after => {
                            seen => '#foo',
                            seen2 => '/.bar',
                            seen3 => 'pre,p',
                            seen4 => [ 'span,b' ],
                            seen5 => {
                                'and => 'em',
                                'or' => 'div'
                            },
                        },
            },
    );

Defaults to undef, meaning finish at the end.

This option tells the parser when to stop parsing. It is exactly the same as start_after option (see above) only indicates the end of parse.

    ->new( options => { stop_after => { seen => '.foo' } } );

Stop after seeing <div class=``foo''>

    ->new( options => { stop_after => { seen2 => '.foo' } } );

Stop after seeing two <div class=``foo''>

    ->new( options => { stop_after => { seen3 => '/.foo' } } );

Stop after seeing three closing </div> tags for <div class=``foo''>

keep_space
    ->new( options => { keep_space => 1 } );

Defaults to false.

When keep_space option is set to a false value, any preceeding or trailing white-space will be stripped. Thus, if we have:

    <div id="foo"><p> Foo </p><p> Bar </p> </div>

And we fetch it with:

    my @data = $p->fetch('#foo *');

The @data will contain two elements 'Foo' and 'Bar'

However, the same fetch call with keep_space option set to a true value will result in @data containing ' Foo ' and ' Bar '.

Note, if the above fetch would be called in scalar context (see description for fetch method) and separator option (see below) would be left at its default (a single space):


    my $data = $p->fetch('#foo *');

The scalar ($data) would end up containing string 'Foo Bar ' (three spaces between words Foo and Bar, two from actual data and one is from sparator) when keep_space option is set to a true value. Think of it as simple join on the list context result.

separator
    ->new( options => { separator => "\n" } );

Defaults to a single space character (' ')

This is IMO a useless option which I included for the sake of completeness. When fetch method is called in scalar context and its argument is a scalar (see fetch method for details) the return value is made with join on the data which you would get if to call fetch in list context. The separator option specifies with what to join with. The following two examples print the exact same result.

    my $p->new( \$html, options => { separator => "\n" } );
    my $data = $p->fetch('p');
    print $data;
    
    # Is exactly the same as
    
    my $p->new( \$html );
    my @data = $p->fetch('p');
    my $data = join "\n", @data;
    print $data;

(of course, one would choose not to be so explicit, this is just an example)

fetch

    my $all_text = $p->fetch();
    my @rows = $p->fetch('tr *');
    # possibility, but better written as fetch('ol,ul')
    # or even fetch('li')
    my @lists = $p->fetch( [ 'ul', 'ol' ] );
    my $data_ref = $p->fetch(
        {
            name => 'td.name',
            code => 'td.code',
            desc => 'td.description',
        }
    );

This method fetches the content from the document. It accepts an optional argument which can be either a scalar, an arrayref or a hashref.

Calling without arguments

Without any arguments, fetch will strip all HTML code from the document and return all textual content as a scalar.

Scalar argument

When argument is a scalar it is treated as CSS selector. The textual content of matching elements will be returned by the method (make sure to check out skip_inline and skip_nested options for the new method). In list context it will return a list of strings containing text of matching elements starting with the furthermost ancestor. In other words calling fetch('div') on:

    <div>
        Foo
        <div>Bar</div>
    </div>

Will return a list with one element, a string 'Foo Bar'. It is possible to control the way the returning list is populated using skip_nested option in new method. As in using:

    $p->new( \$html, options { skip_nested => 0 } );
    @data = $p->fetch('div');

With $html containing the HTML snippet above, would return a two item list, containing strings 'Foo' and 'Bar'.

Arrayref argument

If argument to fetch is an arrayref, each item will be joined with a comma (','). Thus the following two are equivalent:

    my @selectors = qw( pre li p );
    
    my @data = $p->fetch( \@selectors );
    # same as
    my @data = $p->fetch( join q|,|, @selectors);

Context

When fetch with a scalar or arrayref argument is called in a scalar context it will return a string which will be parsed data joined with whatever separator option in new method is set to, see description of separator option for more details and an example.

Performance

Performance note: each time fetch is called an entire HTML document will be parsed from start to end. Thus if you want to split data depending on which selector matched consider using fetch with a hashref argument (see below) instead of calling fetch several times with different selectors.

Hashref argument

When fetch is called with a hashref argument it will return a hashref. The argument is of a contruct where keys are the ``data cell'' names and values are CSS selectors for the elements to get the data from. Consider the following HTML snippet:

    <p>Foo</p>
    <div>
        <p>Bar</p>
        <p>Beer</p>
    </div>

We would like to differentiate between the <p> elements inside the <div>. First, we will set skip_nested to value 0 (see OPTIONS HASHREF description in the new method), so we would not get the nested <p>s inside the outer <p>'s data.

    my $p = HTML::TokeParser::CSSSelect->new(
        \$html,
        options => { skip_nested => 0 },
    );
    my $data_ref = $p->fetch(
        {
            outer => 'p',
            inner  => 'div p',
        }
    );

Now $data_ref will be a hashref with two keys: outer and inner. The value for outer will be a string 'Foo' and the value for inner will be a string 'Bar Beer'

Note: data cells are independant, thus when using the following code:

    my $data_ref = $p->fetch( { col1 => 'div', col2 => 'div' } );

$data_ref will contain two keys, col1 and col2, with the exactly the same values.


ACCESSORS MUTATORS

All options from the OPTIONS HASHREF for the new method can be accessed and manipulated using the following accessors mutators:

skip_inline
    my $are_we_skipping_inline = $p->skip_inline();
    $p->skip_inline(1);

When called with an argument, sets skip_inline option to that value. See OPTIONS_HASHREF for the new method for description.

Returns current value of skip_inline option.

skip_nested
    my $skip_level = $p->skip_nested();
    $p->skip_nested(3);

When called with an argument sets skip_nested option to that value. See OPTIONS_HASHREF for the new method for description.

Returns current value of skip_nested option.

start_after
    my $start_after_ref = $p->start_after();
    $p->start_after( { seen => '#foo' } );

When called with an argument (which should be a hashref) sets start_after option to that value. See OPTIONS_HASHREF for the new method for description.

Returns current value of start_after option.

stop_after
    my $stop_after_ref = $p->stop_after();
    $p->stop_after( { seen => '/.bar', seen2 => '.beer' } );

When called with an argument (which should be a hashref) sets stop_after option to that value. See OPTIONS_HASHREF for the new method for description.

Returns current value of stop_after option.

keep_space
    my $do_keep_space = $p->keep_space();
    $p->keep_space(1);

When called with an argument sets keep_space option to that value. See OPTIONS_HASHREF for the new method for description.

Returns current value of keep_space option.

separator
    my $separator = $p->separator();
    $p->separator("\n -- \n");

When called with an argument sets separator option to that value. See OPTIONS_HASHREF for the new method for description.

Returns current value of separator option.


AUTHORS

Zoffix Znet