I have a table with ID,TEXT,etc columns
Here TEXT is clob column which contains data in HTML FORMAT
SELECT ID,TEXT FROM TABLE WHERE ID=1000
I'm getting output for text column as below
<p>NAME: XXX<br />Company Name: YYYYY<br />Location: ZZZ, 22 Z1Z1Z1, Z2Z2Z2,Z3Z3Z3, 0000024, IND<br />Type: PrePA<br />
Team: Team1, Dues tamble <br />Date: January 25 – 26, 2016<br />Rating: Tr 2<br />Number: 8554342</p>
<p><u>Observ: <br /></u>There were (6) major and (2) minor .<br /> <br />MAJOR</p><ul> <li>Sample Text_1.</li> <li>Sample Text_2.</li>
<li>Sample Text_33.</li> <li>Sample Text_4.</li> <li>Sample Text_5.</li></ul><p>MINOR</p><ul> <li>Sample Text_7</li>
<li>Sample Text_8<br /> </li></ul><p><b> Background</b><br /> </p>
I need output like below
NAME: XXX
Company Name: YYYYY
Location: ZZZ, 22 Z1Z1Z1, Z2Z2Z2,Z3Z3Z3, 0000024, IND
Type: PrePA
Team: Team1, Dues tamble
Date: January 25 – 26, 2016
Rating: Tr 2
Number: 8554342
Observ:
There were (6) major and (2) minor .
MAJOR
Sample Text_1.
Sample Text_2.
Sample Text_33.
Sample Text_4.
Sample Text_5.
MINOR
Sample Text_7
Sample Text_8
Background
Which means the query should handle HTML data and should provide out like exact data like above.
I have a query beolw which handling somwhat okey but it's not handling the
exactly the HTML tags.
select ID,
trim(regexp_replace (trim
(regexp_replace( TEXT ,'<[^>]+/>|</[^>]+>|<p>',CHR(13)||CHR(10)))
,'<[^>]+>',''))
from TABLE where ID='1000'
Acually i need to handle all the html tags in the data not only the which i showed is sample.
Searching for a 'perfect' query might be a lost cause; browser (including text browsers, as recommended elsewhere) have spent years working out all the kinks and edge cases.
If you really can't use an external resource you could walk the DOM via the dbms_xmldom
package. Here's a demo using an anonymous block and your sample value, which prints to the screen (if you have that enabled); but you could easily adapt it in to a function that returns a CLOB, or whatever you need as an end result.
set serveroutput on
set define off
declare
l_element xmldom.domelement;
l_document xmldom.domdocument;
l_text clob;
procedure print_node (p_node xmldom.domnode) is
l_nodes dbms_xmldom.domnodelist;
begin
-- print out any plain text
if dbms_xmldom.getnodetype(p_node) = 3
and dbms_xmldom.getnodename(p_node) = '#text' then
dbms_output.put_line(dbms_xmldom.getnodevalue(p_node));
end if;
-- just to match your expected output, add a blank line for p tags
if dbms_xmldom.getnodetype(p_node) = 1
and dbms_xmldom.getnodename(p_node) = 'p' then
dbms_output.new_line;
end if;
-- get any child nodes
l_nodes := dbms_xmldom.getchildnodes(p_node);
-- process each node in turn, recursively
for i in 0..dbms_xmldom.getlength(l_nodes) - 1 loop
print_node(dbms_xmldom.item(l_nodes, i));
end loop;
end print_node;
begin
l_text := '<p>NAME: XXX<br />Company Name: YYYYY<br />Location: ZZZ, 22 Z1Z1Z1, Z2Z2Z2,Z3Z3Z3, 0000024, IND<br />Type: PrePA<br />Team: Team1, Dues tamble <br />Date: January 25 – 26, 2016<br />Rating: Tr 2<br />Number: 8554342</p>
<p><u>Observ: <br /></u>There were (6) major and (2) minor .<br /> <br />MAJOR</p><ul> <li>Sample Text_1.</li> <li>Sample Text_2.</li>
<li>Sample Text_33.</li> <li>Sample Text_4.</li> <li>Sample Text_5.</li></ul><p>MINOR</p><ul> <li>Sample Text_7</li>
<li>Sample Text_8<br /> </li></ul><p><b> Background</b><br /> </p>';
-- wrap the fragment in a root node so it parses, and run through
-- utl_i18n.unescape_reference to get rid of etc.
l_document := dbms_xmldom.newdomdocument('<html>'
|| utl_i18n.unescape_reference(l_text) || '</html>');
-- get the root element (which is now the added html)
l_element := dbms_xmldom.getdocumentelement(l_document);
-- call the recursive procedure to process this node
print_node(dbms_xmldom.makenode(l_element));
end;
/
Which produces:
NAME: XXX
Company Name: YYYYY
Location: ZZZ, 22 Z1Z1Z1, Z2Z2Z2,Z3Z3Z3, 0000024, IND
Type: PrePA
Team: Team1, Dues tamble
Date: January 25 – 26, 2016
Rating: Tr 2
Number: 8554342
Observ:
There were (6) major and (2) minor .
MAJOR
Sample Text_1.
Sample Text_2.
Sample Text_33.
Sample Text_4.
Sample Text_5.
MINOR
Sample Text_7
Sample Text_8
Background
Create a (simple) HTML parser.
Oracle Setup:
SET DEFINE OFF;
CREATE OR REPLACE AND COMPILE JAVA SOURCE NAMED HTMLTOTEXT AS
import java.io.IOException;
import java.io.StringReader;
import java.util.HashMap;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.html.HTMLEditorKit.ParserCallback;
import javax.swing.text.html.parser.ParserDelegator;
public class HTMLToText {
private HTMLToText(){};
private static class TextStripper extends ParserCallback {
private final StringBuffer buffer = new StringBuffer();
private static final HashMap<Tag, String> START_TAGS = new HashMap<Tag, String>();
private static final HashMap<Tag, String> END_TAGS = new HashMap<Tag, String>();
private static final String NEWLINE = "\r\n";
static {
START_TAGS.put(Tag.BR, NEWLINE);
START_TAGS.put(Tag.P, NEWLINE);
START_TAGS.put(Tag.LI, NEWLINE);
START_TAGS.put(Tag.DT, NEWLINE);
START_TAGS.put(Tag.DL, NEWLINE);
START_TAGS.put(Tag.OL, NEWLINE);
START_TAGS.put(Tag.UL, NEWLINE);
START_TAGS.put(Tag.TR, NEWLINE);
START_TAGS.put(Tag.TD, "\t");
START_TAGS.put(Tag.TH, "\t");
END_TAGS.put(Tag.P, NEWLINE);
END_TAGS.put(Tag.LI, NEWLINE);
END_TAGS.put(Tag.DD, NEWLINE);
END_TAGS.put(Tag.DL, NEWLINE);
END_TAGS.put(Tag.OL, NEWLINE);
END_TAGS.put(Tag.UL, NEWLINE);
END_TAGS.put(Tag.TR, NEWLINE);
}
private boolean newline = true;
@Override
public void handleText( final char[] data, final int pos ){
buffer.append(data);
newline = false;
}
@Override
public void handleStartTag( final Tag tag, final MutableAttributeSet attribute, final int pos ){
if ( !newline && START_TAGS.containsKey( tag ) )
{
final String value = START_TAGS.get( tag );
buffer.append( value );
newline = value.equals(NEWLINE);
}
}
@Override
public void handleEndTag( final Tag tag, final int pos ){
if ( !newline && END_TAGS.containsKey( tag ) )
{
final String value = END_TAGS.get( tag );
buffer.append( value );
newline = value.equals(NEWLINE);
}
}
@Override
public void handleSimpleTag( final Tag tag, final MutableAttributeSet attribute, final int pos ){
handleStartTag( tag, attribute, pos );
handleEndTag( tag, pos );
}
@Override
public void handleComment( final char[] data, final int pos ){}
@Override
public void handleError( final String errMsg, final int pos ){}
public String getText(){
return buffer.toString();
}
}
private static final ParserDelegator DELEGATOR = new ParserDelegator();
public static String extractText( final String html ) throws IOException{
TextStripper stripper = new TextStripper();
DELEGATOR.parse( new StringReader( html ), stripper, true );
return stripper.getText();
}
};
/
CREATE OR REPLACE FUNCTION HTML_TO_TEXT(
in_html IN VARCHAR2
) RETURN VARCHAR2
AS LANGUAGE JAVA NAME 'HTMLToText.extractText( java.lang.String ) return java.lang.String';
/
SHOW ERRORS;
/
Query:
SELECT HTML_TO_TEXT( '<html><body>Text<p>para<br>graph</p><table><tr><th>R1</th><td>C1</td><td>C2</td></tr><tr><th>R2</th><td>C1</td><td>C2</td></tr></table></body></html>' ) FROM DUAL;
Output:
HTML_TO_TEXT('<HTML><BODY>TEXT<P>PARA<BR>GRAPH</P><TABLE><TR><TH>R1</TH><TD>C1</
--------------------------------------------------------------------------------
Text
para
graph
R1 C1 C2
R2 C1 C2
Consider running the HTML through lynx -stdin
to render it as plain text. Lynx is a web browser for text consoles.