I want to truncate some text (loaded from a database or text file), but it contains HTML so as a result the tags are included and less text will be returned. This can then result in tags not being closed, or being partially closed (so Tidy may not work properly and there is still less content). How can I truncate based on the text (and probably stopping when you get to a table as that could cause more complex issues).
substr("Hello, my <strong>name</strong> is <em>Sam</em>. I´m a web developer.",0,26)."..."
Would result in:
Hello, my <strong>name</st...
What I would want is:
Hello, my <strong>name</strong> is <em>Sam</em>. I´m...
How can I do this?
While my question is for how to do it in PHP, it would be good to know how to do it in C#... either should be OK as I think I would be able to port the method over (unless it is a built in method).
Also note that I have included an HTML entity ´
- which would have to be considered as a single character (rather than 7 characters as in this example).
strip_tags
is a fallback, but I would lose formatting and links and it would still have the problem with HTML entities.
Another light changes to Søren Løvborg printTruncated function making it UTF-8 (Needs mbstring) compatible and making it return string not print one. I think it's more useful. And my code not use buffering like Bounce variant, just one more variable.
UPD: to make it work properly with utf-8 chars in tag attributes you need mb_preg_match function, listed below.
Great thanks to Søren Løvborg for that function, it's very good.
The CakePHP framework has a HTML-aware truncate() function in the TextHelper that works for me. See Core-Helpers/Text. MIT license.
Bounce added multi-byte character support to Søren Løvborg's solution - I've added:
<hr>
,<br>
<col>
etc. don't get closed - in HTML a '/' is not required at the end of these (in is for XHTML though)),&hellips;
i.e. … ),All this at Pastie.
I've written a function that truncates HTML just as yous suggest, but instead of printing it out it puts it just keeps it all in a string variable. handles HTML Entities, as well.
you can use tidy as well:
Use the function
truncateHTML()
from: https://github.com/jlgrall/truncateHTMLExample: truncate after 9 characters including the ellipsis:
Features: UTF-8, configurable ellipsis, include/exclude length of ellipsis, self-closing tags, collapsing spaces, invisible elements (
<head>
,<script>
,<noscript>
,<style>
,<!-- comments -->
), HTML$entities;
, truncating at last whole word (with option to still truncate very long words), PHP 5.6 and 7.0+, 240+ unit tests, returns a string (doesn't use the output buffer), and well commented code.I wrote this function, because I really liked Søren Løvborg's function above (especially how he managed encodings), but I needed a bit more functionality and flexibility.