I've found this very interesting function on internet:
CREATE OR REPLACE FUNCTION strip_tags(TEXT) RETURNS TEXT AS $$
SELECT regexp_replace(regexp_replace($1, E'(?x)<[^>]*?(\s alt \s* = \s* ([\'"]) ([^>]*?) \2) [^>]*? >', E'\3'), E'(?x)(< [^>]*? >)', '', 'g')
$$ LANGUAGE SQL;
But it doesn't remove html codes like: "
Is it possible to remove them using regexp_replace?
Yes it is possible to replace HTML or other character entities with the respective characters using a function. First create a character entity table:
This is the function:
Use it like this:
It only works for UTF-8.
This classic quote may apply here: Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems. Regex are useful, but HTML parsing is not a job they're well suited for. Jeff Atwood explains this well. To strip tags from HTML correctly some kind of parsing is necessary.
What I would recommend is that you use a more powerful PL like
PL/Perl
orPL/Pythonu
to invoke mature and well tested HTML-stripping libraries. For example, you could use Perl'sHTML::Strip
via aplperl
function that acceptstext
and returnstext
.The quick and dirty way to handle this would be to use another layer of
regexp_replace
expressions to convert entities. This will rapidly lead you down the path alluded to by Igor though, and is best avoided by using tools that aready exist. For example, if you useHTML::Strip
it'll useHTML::Entities
to convert entities for you as part of the process.