handling HTML data in Oracle query

2019-06-01 04:52发布

问题:

I have a table with ID,TEXT,etc columns Here TEXT is clob column which contains data in HTML FORMAT

SELECT ID,TEXT FROM TABLE WHERE ID=1000

I'm getting output for text column as below

<p>NAME:&nbsp;&nbsp;XXX<br />Company Name: &nbsp;YYYYY<br />Location:&nbsp;ZZZ, 22 Z1Z1Z1, Z2Z2Z2,Z3Z3Z3, 0000024, IND<br />Type:&nbsp;PrePA<br />
Team:&nbsp;Team1, Dues tamble <br />Date:&nbsp;January 25 &ndash; 26, 2016<br />Rating: &nbsp;Tr 2<br />Number:&nbsp;8554342</p>
<p><u>Observ: <br /></u>There were (6) major and (2) minor .<br />&nbsp;<br />MAJOR</p><ul>    <li>Sample Text_1.</li>    <li>Sample Text_2.</li>    
<li>Sample Text_33.</li>    <li>Sample Text_4.</li>    <li>Sample Text_5.</li></ul><p>MINOR</p><ul>    <li>Sample Text_7</li>    
<li>Sample Text_8<br />    &nbsp;</li></ul><p><b> Background</b><br />&nbsp;</p>

I need output like below

NAME:  XXX
Company Name:  YYYYY
Location: ZZZ, 22 Z1Z1Z1, Z2Z2Z2,Z3Z3Z3, 0000024, IND
Type: PrePA
Team: Team1, Dues tamble 
Date: January 25 – 26, 2016
Rating:  Tr 2
Number: 8554342

Observ: 
There were (6) major and (2) minor .

MAJOR

Sample Text_1. 
Sample Text_2. 
Sample Text_33. 
Sample Text_4. 
Sample Text_5.
MINOR

Sample Text_7 
Sample Text_8

Background

Which means the query should handle HTML data and should provide out like exact data like above.

I have a query beolw which handling somwhat okey but it's not handling the

exactly the HTML tags.
select ID,
trim(regexp_replace (trim
(regexp_replace( TEXT ,'<[^>]+/>|</[^>]+>|<p>',CHR(13)||CHR(10)))
,'<[^>]+>',''))
from TABLE where ID='1000'

Acually i need to handle all the html tags in the data not only the which i showed is sample.

回答1:

Searching for a 'perfect' query might be a lost cause; browser (including text browsers, as recommended elsewhere) have spent years working out all the kinks and edge cases.

If you really can't use an external resource you could walk the DOM via the dbms_xmldom package. Here's a demo using an anonymous block and your sample value, which prints to the screen (if you have that enabled); but you could easily adapt it in to a function that returns a CLOB, or whatever you need as an end result.

set serveroutput on
set define off

declare     
  l_element xmldom.domelement;
  l_document xmldom.domdocument;
  l_text clob;

  procedure print_node (p_node xmldom.domnode) is
    l_nodes dbms_xmldom.domnodelist;
  begin
    -- print out any plain text
    if dbms_xmldom.getnodetype(p_node) = 3
        and dbms_xmldom.getnodename(p_node) = '#text' then
      dbms_output.put_line(dbms_xmldom.getnodevalue(p_node));
    end if;
    -- just to match your expected output, add a blank line for p tags
    if dbms_xmldom.getnodetype(p_node) = 1
        and dbms_xmldom.getnodename(p_node) = 'p' then
      dbms_output.new_line;
    end if;
    -- get any child nodes
    l_nodes := dbms_xmldom.getchildnodes(p_node);
    -- process each node in turn, recursively
    for i in 0..dbms_xmldom.getlength(l_nodes) - 1 loop
      print_node(dbms_xmldom.item(l_nodes, i));
    end loop;
  end print_node;

begin           
  l_text := '<p>NAME:&nbsp;&nbsp;XXX<br />Company Name: &nbsp;YYYYY<br />Location:&nbsp;ZZZ, 22 Z1Z1Z1, Z2Z2Z2,Z3Z3Z3, 0000024, IND<br />Type:&nbsp;PrePA<br />Team:&nbsp;Team1, Dues tamble <br />Date:&nbsp;January 25 &ndash; 26, 2016<br />Rating: &nbsp;Tr 2<br />Number:&nbsp;8554342</p>
<p><u>Observ: <br /></u>There were (6) major and (2) minor .<br />&nbsp;<br />MAJOR</p><ul>    <li>Sample Text_1.</li>    <li>Sample Text_2.</li>    
<li>Sample Text_33.</li>    <li>Sample Text_4.</li>    <li>Sample Text_5.</li></ul><p>MINOR</p><ul>    <li>Sample Text_7</li>    
<li>Sample Text_8<br />    &nbsp;</li></ul><p><b> Background</b><br />&nbsp;</p>'; 

  -- wrap the fragment in a root node so it parses, and run through
  -- utl_i18n.unescape_reference to get rid of &nbsp; etc.
  l_document := dbms_xmldom.newdomdocument('<html>'
    || utl_i18n.unescape_reference(l_text) || '</html>');
  -- get the root element (which is now the added html)
  l_element := dbms_xmldom.getdocumentelement(l_document);
  -- call the recursive procedure to process this node
  print_node(dbms_xmldom.makenode(l_element));
end;
/

Which produces:

NAME:  XXX
Company Name:  YYYYY
Location: ZZZ, 22 Z1Z1Z1, Z2Z2Z2,Z3Z3Z3, 0000024, IND
Type: PrePA
Team: Team1, Dues tamble 
Date: January 25 – 26, 2016
Rating:  Tr 2
Number: 8554342

Observ: 
There were (6) major and (2) minor .
 
MAJOR
Sample Text_1.
Sample Text_2.
Sample Text_33.
Sample Text_4.
Sample Text_5.

MINOR
Sample Text_7
Sample Text_8
     

 Background
 


回答2:

Create a (simple) HTML parser.

Oracle Setup:

SET DEFINE OFF;

CREATE OR REPLACE AND COMPILE JAVA SOURCE NAMED HTMLTOTEXT AS
import java.io.IOException;
import java.io.StringReader;
import java.util.HashMap;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.html.HTMLEditorKit.ParserCallback;
import javax.swing.text.html.parser.ParserDelegator;

public class HTMLToText {
    private HTMLToText(){};

    private static class TextStripper extends ParserCallback {
        private final StringBuffer buffer = new StringBuffer();

        private static final HashMap<Tag, String> START_TAGS = new HashMap<Tag, String>();
        private static final HashMap<Tag, String> END_TAGS = new HashMap<Tag, String>();
        private static final String NEWLINE = "\r\n";

        static {
            START_TAGS.put(Tag.BR,    NEWLINE);
            START_TAGS.put(Tag.P,     NEWLINE);
            START_TAGS.put(Tag.LI,    NEWLINE);
            START_TAGS.put(Tag.DT,    NEWLINE);
            START_TAGS.put(Tag.DL,    NEWLINE);
            START_TAGS.put(Tag.OL,    NEWLINE);
            START_TAGS.put(Tag.UL,    NEWLINE);
            START_TAGS.put(Tag.TR,    NEWLINE);
            START_TAGS.put(Tag.TD,    "\t");
            START_TAGS.put(Tag.TH,    "\t");

            END_TAGS.put(Tag.P,     NEWLINE);
            END_TAGS.put(Tag.LI,    NEWLINE);
            END_TAGS.put(Tag.DD,    NEWLINE);
            END_TAGS.put(Tag.DL,    NEWLINE);
            END_TAGS.put(Tag.OL,    NEWLINE);
            END_TAGS.put(Tag.UL,    NEWLINE);
            END_TAGS.put(Tag.TR,    NEWLINE);
        }

        private boolean newline = true;

        @Override
        public void handleText( final char[] data, final int pos ){

            buffer.append(data);
            newline = false;
        }
        @Override
        public void handleStartTag( final Tag tag, final MutableAttributeSet attribute, final int pos ){
            if ( !newline && START_TAGS.containsKey( tag ) )
            {
                final String value = START_TAGS.get( tag );
                buffer.append( value );
                newline = value.equals(NEWLINE);
            }
        }
        @Override
        public void handleEndTag( final Tag tag, final int pos ){
            if ( !newline && END_TAGS.containsKey( tag ) )
            {
                final String value = END_TAGS.get( tag );
                buffer.append( value );
                newline = value.equals(NEWLINE);
            }
        }
        @Override
        public void handleSimpleTag( final Tag tag, final MutableAttributeSet attribute, final int pos ){
            handleStartTag( tag, attribute, pos );
            handleEndTag( tag, pos );
        }
        @Override
        public void handleComment( final char[] data, final int pos ){}
        @Override
        public void handleError( final String errMsg, final int pos ){}

        public String getText(){
            return buffer.toString();
        }
    }

    private static final ParserDelegator DELEGATOR = new ParserDelegator();

    public static String extractText( final String html ) throws IOException{
        TextStripper stripper = new TextStripper();
        DELEGATOR.parse( new StringReader( html ), stripper, true );
        return stripper.getText();
    }
};
/

CREATE OR REPLACE FUNCTION HTML_TO_TEXT(
  in_html  IN VARCHAR2
) RETURN VARCHAR2
AS LANGUAGE JAVA NAME 'HTMLToText.extractText( java.lang.String ) return java.lang.String';
/
SHOW ERRORS;
/

Query:

SELECT HTML_TO_TEXT( '<html><body>Text<p>para<br>graph</p><table><tr><th>R1</th><td>C1</td><td>C2</td></tr><tr><th>R2</th><td>C1</td><td>C2</td></tr></table></body></html>' ) FROM DUAL;

Output:

HTML_TO_TEXT('<HTML><BODY>TEXT<P>PARA<BR>GRAPH</P><TABLE><TR><TH>R1</TH><TD>C1</
--------------------------------------------------------------------------------
Text                                                                            
para                                                                            
graph                                                                           
R1  C1  C2                                                                          
R2  C1  C2


回答3:

Consider running the HTML through lynx -stdin to render it as plain text. Lynx is a web browser for text consoles.