PHP Doku:: Recursive patterns - regexp.reference.recursive.html

Recursive patterns

Consider the problem of matching a string in parentheses, allowing for unlimited nested parentheses. Without the use of recursion, the best that can be done is to use a pattern that matches up to some fixed depth of nesting. It is not possible to handle an arbitrary nesting depth. Perl 5.6 has provided an experimental facility that allows regular expressions to recurse (among other things). The special item (?R) is provided for the specific case of recursion. This PCRE pattern solves the parentheses problem (assume the PCRE_EXTENDED option is set so that white space is ignored): \( ( (?>[^()]+) | (?R) )* \)

First it matches an opening parenthesis. Then it matches any number of substrings which can either be a sequence of non-parentheses, or a recursive match of the pattern itself (i.e. a correctly parenthesized substring). Finally there is a closing parenthesis.

This particular example pattern contains nested unlimited repeats, and so the use of a once-only subpattern for matching strings of non-parentheses is important when applying the pattern to strings that do not match. For example, when it is applied to (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() it yields "no match" quickly. However, if a once-only subpattern is not used, the match runs for a very long time indeed because there are so many different ways the + and * repeats can carve up the subject, and all have to be tested before failure can be reported.

The values set for any capturing subpatterns are those from the outermost level of the recursion at which the subpattern value is set. If the pattern above is matched against (ab(cd)ef) the value for the capturing parentheses is "ef", which is the last value taken on at the top level. If additional parentheses are added, giving \( ( ( (?>[^()]+) | (?R) )* ) \) then the string they capture is "ab(cd)ef", the contents of the top level parentheses. If there are more than 15 capturing parentheses in a pattern, PCRE has to obtain extra memory to store data during a recursion, which it does by using pcre_malloc, freeing it via pcre_free afterwards. If no memory can be obtained, it saves data for the first 15 capturing parentheses only, as there is no way to give an out-of-memory error from within a recursion.

(?1), (?2) and so on can be used for recursive subpatterns too. It is also possible to use named subpatterns: (?P>name) or (?P&name).

If the syntax for a recursive subpattern reference (either by number or by name) is used outside the parentheses to which it refers, it operates like a subroutine in a programming language. An earlier example pointed out that the pattern (sens|respons)e and \1ibility matches "sense and sensibility" and "response and responsibility", but not "sense and responsibility". If instead the pattern (sens|respons)e and (?1)ibility is used, it does match "sense and responsibility" as well as the other two strings. Such references must, however, follow the subpattern to which they refer.

The maximum length of a subject string is the largest positive number that an integer variable can hold. However, PCRE uses recursion to handle subpatterns and indefinite repetition. This means that the available stack space may limit the size of a subject string that can be processed by certain patterns.

2 BenutzerBeiträge:
- Beiträge aktualisieren...

jonah at nucleussystems dot com
22.12.2010 3:18


An unexpected behavior came up that introduced a very hard-to-track bug in some code I was working on.  It has to do with the preg_match_all PREG_OFFSET_CAPTURE flag.  When you capture the offset of a sub-match, it's offset is given _relative_ to it's parent.  For example, if you extract the value between < and > recursively in this string:



<this is a <string>>



You will get an array that looks like this:



Array

(

    [0] => Array

    (

        [0] => Array

        (

            [0] => <this is a <string>>

            [1] => 0

        )

        [1] => Array

        (

            [0] => this is a <string>

            [1] => 1

        )

    )

    [1] => Array

    (

        [0] => Array

        (

            [0] => <string>

            [1] => 0

        )

        [1] => Array

        (

            [0] => string

            [1] => 1

        )

    )

)



Notice that the offset in the last index is one, not the twelve we expected.  The best way to solve this problem is to run over the results with a recursive function, adding the parent's offset.

emanueledelgrande at email dot it
10.01.2010 0:47


The recursion in regular expressions is the only way to allow the parsing of HTML code with nested tags of indefinite depth.

It seems it's not yet a spreaded practice; not so much contents are available on the web regarding regexp recursion, and until now no user contribute notes have been published on this manual page.

I made several tests with complex patterns to get tags with specific attributes or namespaces, studying the recursion of a subpattern only instead of the full pattern.

Here's an example that may power a fast LL parser with recursive descent (http://en.wikipedia.org/wiki/Recursive_descent_parser):



$pattern = "/<([\w]+)([^>]*?) (([\s]*\/>)| (>((([^<]*?|<\!\-\-.*?\-\->)| (?R))*)<\/\\1[\s]*>))/xsm";



The performances of a preg_match or preg_match_all function call over an avarage (x)HTML document are quite fast and may drive you to chose this way instead of classic DOM object methods, which have a lot of limits and are usually poor in performance with their workarounds, too.

I post a sample application in a brief function (easy to be turned into OOP), which returns an array of objects:



<?php

// test function:

function parse($html) {

    // I have split the pattern in two lines not to have long lines alerts by the PHP.net form:

    $pattern = "/<([\w]+)([^>]*?)(([\s]*\/>)|".

    "(>((([^<]*?|<\!\-\-.*?\-\->)|(?R))*)<\/\\1[\s]*>))/sm";

    preg_match_all($pattern, $html, $matches, PREG_OFFSET_CAPTURE);

    $elements = array();

    

    foreach ($matches[0] as $key => $match) {

        $elements[] = (object)array(

            'node' => $match[0],

            'offset' => $match[1],

            'tagname' => $matches[1][$key][0],

            'attributes' => isset($matches[2][$key][0]) ? $matches[2][$key][0] : '',

            'omittag' => ($matches[4][$key][1] > -1), // boolean

            'inner_html' => isset($matches[6][$key][0]) ? $matches[6][$key][0] : ''

        );

    }

    return $elements;

}



// random html nodes as example:

$html = <<<EOD

<div id="airport">

    <div geo:position="1.234324,3.455546" class="index">

        <!-- comment test:

        <div class="index_top" />

        -->

        <div class="element decorator">

                <ul class="lister">

                    <li onclick="javascript:item.showAttribute('desc');">

                        <h3 class="outline">

                            <a href="http://php.net/manual/en/regexp.reference.recursive.php" onclick="openPopup()">Link</a>

                        </h3>

                        <div class="description">Sample description</div>

                    </li>

                </ul>

        </div>

        <div class="clean-line"></div>

    </div>

</div>

<div id="omittag_test" rel="rootChild" />

EOD;



// application:

$elements = parse($html);



if (count($elements) > 0) {

    echo "Elements found: <b>".count($elements)."</b><br />";

    

    foreach ($elements as $element) {

        echo "<p>Tpl node: <pre>".htmlentities($element->node)."</pre>

        Tagname: <tt>".$element->tagname."</tt><br />

        Attributes: <tt>".$element->attributes."</tt><br />

        Omittag: <tt>".($element->omittag ? 'true' : 'false')."</tt><br />

        Inner HTML: <pre>".htmlentities($element->inner_html)."</pre></p>";

    }

}

?>

Ein Service von Reinhard Neidl - Webprogrammierung.

Recursive patterns