how to Download a XML file from a URL by Escaping

2019-09-13 20:11发布

i am using this code to Download the Xml file.

String url="https://www.sec.gov/Archives/edgar/data/16160/000001616016000061/calm-20160528.xml";

            String fileName = url.substring(url.lastIndexOf("/") + 1,
                    url.length());

            String completeFileLocationWithName="/home/user/Downloads/XBRLCODE/"+fileName;

            URL surl = new URL(url);
            con = surl.openConnection();
            con.setConnectTimeout(0);
            con.setReadTimeout(0);
            InputStream in = con.getInputStream();
            Files.copy(in, Paths.get(completeFileLocationWithName));*/

and also tried with String escapedInput = StringEscapeUtils.escapeXml(appNameInput);

INPUT is : URL

OUTPUT is Upon Downloading XML, it should not have above characters like &lt;, &gt;, &amp; etc - instead < , > ,& would be fine for me..

Please anyone share the knowledge on this..

3条回答
相关推荐>>
2楼-- · 2019-09-13 20:48

I think you're misunderstanding the problem slightly. Your XML here contains embedded HTML (itself with embedded CSS, as it happens).

To be included in that node, those characters have to be escaped, otherwise the overall XML would be invalid (<, >, & etc are all reserved entities in XML).

If you mean you want the results of that XML node (us-gaap:FiscalPeriod) unescaped, then you should extract its string value and then use something like StringEscapeUtils.unescapeHtml as already suggested.

Depending on what you're trying to do, you might want to go further and strip all HTML tags from the output anyway.

查看更多
虎瘦雄心在
3楼-- · 2019-09-13 20:51

The following seems to work.

    InputStream iStream = new FileInputStream(new File("xxxxx"));
    StringWriter writer = new StringWriter();
    IOUtils.copy(iStream, writer, "UTF-8");
    String theString = writer.toString();
    IOUtils.write(StringEscapeUtils.unescapeXml(theString),
            new FileOutputStream("yyyy"));
查看更多
走好不送
4楼-- · 2019-09-13 20:54

Use StringEscapeUtils from commons-lang.jar library.

Here is working code:

import java.io.IOException;
import java.io.InputStream;
import java.io.StringWriter;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.apache.commons.io.IOUtils;
import org.apache.commons.lang.StringEscapeUtils;

public class Test {

    public static void main(String[] args) {
        String url = "https://www.sec.gov/Archives/edgar/data/16160/000001616016000061/calm-20160528.xml";

        URL surl;
        try {
            surl = new URL(url);
            URLConnection con = surl.openConnection();
            con.setConnectTimeout(0);
            con.setReadTimeout(0);
            InputStream in = con.getInputStream();
            StringWriter writer = new StringWriter();
            IOUtils.copy(in, writer, "UTF-8");
            System.out.println(StringEscapeUtils.unescapeHtml(writer.toString()));
        } catch (MalformedURLException ex) {
            Logger.getLogger(Test.class.getName()).log(Level.SEVERE, null, ex);
        } catch (IOException ex) {
            Logger.getLogger(Test.class.getName()).log(Level.SEVERE, null, ex);
        }

    }
}

Output is without escaped characters, here is sample from console:

<td valign="bottom" style="width:02.96%;border-top:1pt none #D9D9D9 ;border-left:1pt none #D9D9D9 ;border-bottom:1pt none #D9D9D9 ;border-right:1pt none #D9D9D9 ;background-color: #auto;height:1.00pt;padding:0pt;">
                    <p style="margin:0pt;font-family:Times New Roman;height:1.00pt;overflow:hidden;font-size:0pt;">
                        &nbsp;</p>
                </td>
                <td valign="bottom" style="width:02.40%;border-top:1pt none #D9D9D9 ;border-left:1pt none #D9D9D9 ;border-bottom:1pt none #D9D9D9 ;border-right:1pt none #D9D9D9 ;background-color: #auto;height:1.00pt;padding:0pt;">
                    <p style="margin:0pt;font-family:Times New Roman;height:1.00pt;overflow:hidden;font-size:0pt;">
                        &nbsp;</p>
                </td>
                <td valign="bottom" style="width:11.82%;border-top:1pt none #D9D9D9 ;border-left:1pt none #D9D9D9 ;border-bottom:1pt none #D9D9D9 ;border-right:1pt none #D9D9D9 ;background-color: #auto;height:1.00pt;padding:0pt;">
                    <p style="margin:0pt;font-family:Times New Roman;height:1.00pt;overflow:hidden;font-size:0pt;">
                        &nbsp;</p>
                </td>

Keep on mind that you need:

import org.apache.commons.io.IOUtils;
import org.apache.commons.lang.StringEscapeUtils;
查看更多
登录 后发表回答