Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
92 views
in Technique[技术] by (71.8m points)

php - Truncate text containing HTML, ignoring tags

I want to truncate some text (loaded from a database or text file), but it contains HTML so as a result the tags are included and less text will be returned. This can then result in tags not being closed, or being partially closed (so Tidy may not work properly and there is still less content). How can I truncate based on the text (and probably stopping when you get to a table as that could cause more complex issues).

substr("Hello, my <strong>name</strong> is <em>Sam</em>. I&acute;m a web developer.",0,26)."..."

Would result in:

Hello, my <strong>name</st...

What I would want is:

Hello, my <strong>name</strong> is <em>Sam</em>. I&acute;m...

How can I do this?

While my question is for how to do it in PHP, it would be good to know how to do it in C#... either should be OK as I think I would be able to port the method over (unless it is a built in method).

Also note that I have included an HTML entity &acute; - which would have to be considered as a single character (rather than 7 characters as in this example).

strip_tags is a fallback, but I would lose formatting and links and it would still have the problem with HTML entities.

question from:https://stackoverflow.com/questions/1193500/truncate-text-containing-html-ignoring-tags

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Assuming you are using valid XHTML, it's simple to parse the HTML and make sure tags are handled properly. You simply need to track which tags have been opened so far, and make sure to close them again "on your way out".

<?php
header('Content-type: text/plain; charset=utf-8');

function printTruncated($maxLength, $html, $isUtf8=true)
{
    $printedLength = 0;
    $position = 0;
    $tags = array();

    // For UTF-8, we need to count multibyte sequences as one character.
    $re = $isUtf8
        ? '{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;|[x80-xFF][x80-xBF]*}'
        : '{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}';

    while ($printedLength < $maxLength && preg_match($re, $html, $match, PREG_OFFSET_CAPTURE, $position))
    {
        list($tag, $tagPosition) = $match[0];

        // Print text leading up to the tag.
        $str = substr($html, $position, $tagPosition - $position);
        if ($printedLength + strlen($str) > $maxLength)
        {
            print(substr($str, 0, $maxLength - $printedLength));
            $printedLength = $maxLength;
            break;
        }

        print($str);
        $printedLength += strlen($str);
        if ($printedLength >= $maxLength) break;

        if ($tag[0] == '&' || ord($tag) >= 0x80)
        {
            // Pass the entity or UTF-8 multibyte sequence through unchanged.
            print($tag);
            $printedLength++;
        }
        else
        {
            // Handle the tag.
            $tagName = $match[1][0];
            if ($tag[1] == '/')
            {
                // This is a closing tag.

                $openingTag = array_pop($tags);
                assert($openingTag == $tagName); // check that tags are properly nested.

                print($tag);
            }
            else if ($tag[strlen($tag) - 2] == '/')
            {
                // Self-closing tag.
                print($tag);
            }
            else
            {
                // Opening tag.
                print($tag);
                $tags[] = $tagName;
            }
        }

        // Continue after the tag.
        $position = $tagPosition + strlen($tag);
    }

    // Print any remaining text.
    if ($printedLength < $maxLength && $position < strlen($html))
        print(substr($html, $position, $maxLength - $printedLength));

    // Close any open tags.
    while (!empty($tags))
        printf('</%s>', array_pop($tags));
}


printTruncated(10, '<b>&lt;Hello&gt;</b> <img src="world.png" alt="" /> world!'); print("
");

printTruncated(10, '<table><tr><td>Heck, </td><td>throw</td></tr><tr><td>in a</td><td>table</td></tr></table>'); print("
");

printTruncated(10, "<em><b>Hello</b>&#20;wxC3xB8rld!</em>"); print("
");

Encoding note: The above code assumes the XHTML is UTF-8 encoded. ASCII-compatible single-byte encodings (such as Latin-1) are also supported, just pass false as the third argument. Other multibyte encodings are not supported, though you may hack in support by using mb_convert_encoding to convert to UTF-8 before calling the function, then converting back again in every print statement.

(You should always be using UTF-8, though.)

Edit: Updated to handle character entities and UTF-8. Fixed bug where the function would print one character too many, if that character was a character entity.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

56.8k users

...