Filter (search and replace) array of bytes in an I

2019-01-11 13:53发布

I have an InputStream which takes the html file as input parameter. I have to get the bytes from the input stream .

I have a string: "XYZ". I'd like to convert this string to byte format and check if there is a match for the string in the byte sequence which I obtained from the InputStream. If there is then, I have to replace the match with the bye sequence for some other string.

Is there anyone who could help me with this? I have used regex to find and replace. however finding and replacing byte stream, I am unaware of.

Previously, I use jsoup to parse html and replace the string, however due to some utf encoding problems, the file seems to appear corrupted when I do that.

TL;DR: My question is:

Is a way to find and replace a string in byte format in a raw InputStream in Java?

6条回答
ら.Afraid
2楼-- · 2019-01-11 14:04

I needed something like this as well and decided to roll my own solution instead of using the example above by @aioobe. Have a look at the code. You can pull the library from maven central, or just copy the source code.

This is how you use it. In this case, I'm using a nested instance to replace two patterns two fix dos and mac line endings.

new ReplacingInputStream(new ReplacingInputStream(is, "\n\r", "\n"), "\r", "\n");

Here's the full source code:

/**
 * Simple FilterInputStream that can replace occurrances of bytes with something else.
 */
public class ReplacingInputStream extends FilterInputStream {

    // while matching, this is where the bytes go.
    int[] buf=null;
    int matchedIndex=0;
    int unbufferIndex=0;
    int replacedIndex=0;

    private final byte[] pattern;
    private final byte[] replacement;
    private State state=State.NOT_MATCHED;

    // simple state machine for keeping track of what we are doing
    private enum State {
        NOT_MATCHED,
        MATCHING,
        REPLACING,
        UNBUFFER
    }

    /**
     * @param is input
     * @return nested replacing stream that replaces \n\r (DOS) and \r (MAC) line endings with UNIX ones "\n".
     */
    public static InputStream newLineNormalizingInputStream(InputStream is) {
        return new ReplacingInputStream(new ReplacingInputStream(is, "\n\r", "\n"), "\r", "\n");
    }

    /**
     * Replace occurances of pattern in the input. Note: input is assumed to be UTF-8 encoded. If not the case use byte[] based pattern and replacement.
     * @param in input
     * @param pattern pattern to replace.
     * @param replacement the replacement or null
     */
    public ReplacingInputStream(InputStream in, String pattern, String replacement) {
        this(in,pattern.getBytes(StandardCharsets.UTF_8), replacement==null ? null : replacement.getBytes(StandardCharsets.UTF_8));
    }

    /**
     * Replace occurances of pattern in the input.
     * @param in input
     * @param pattern pattern to replace
     * @param replacement the replacement or null
     */
    public ReplacingInputStream(InputStream in, byte[] pattern, byte[] replacement) {
        super(in);
        Validate.notNull(pattern);
        Validate.isTrue(pattern.length>0, "pattern length should be > 0", pattern.length);
        this.pattern = pattern;
        this.replacement = replacement;
        // we will never match more than the pattern length
        buf = new int[pattern.length];
    }

    @Override
    public int read(byte[] b, int off, int len) throws IOException {
        // copy of parent logic; we need to call our own read() instead of super.read(), which delegates instead of calling our read
        if (b == null) {
            throw new NullPointerException();
        } else if (off < 0 || len < 0 || len > b.length - off) {
            throw new IndexOutOfBoundsException();
        } else if (len == 0) {
            return 0;
        }

        int c = read();
        if (c == -1) {
            return -1;
        }
        b[off] = (byte)c;

        int i = 1;
        try {
            for (; i < len ; i++) {
                c = read();
                if (c == -1) {
                    break;
                }
                b[off + i] = (byte)c;
            }
        } catch (IOException ee) {
        }
        return i;

    }

    @Override
    public int read(byte[] b) throws IOException {
        // call our own read
        return read(b, 0, b.length);
    }

    @Override
    public int read() throws IOException {
        // use a simple state machine to figure out what we are doing
        int next;
        switch (state) {
        case NOT_MATCHED:
            // we are not currently matching, replacing, or unbuffering
            next=super.read();
            if(pattern[0] == next) {
                // clear whatever was there
                buf=new int[pattern.length]; // clear whatever was there
                // make sure we start at 0
                matchedIndex=0;

                buf[matchedIndex++]=next;
                if(pattern.length == 1) {
                    // edgecase when the pattern length is 1 we go straight to replacing
                    state=State.REPLACING;
                    // reset replace counter
                    replacedIndex=0;
                } else {
                    // pattern of length 1
                    state=State.MATCHING;
                }
                // recurse to continue matching
                return read();
            } else {
                return next;
            }
        case MATCHING:
            // the previous bytes matched part of the pattern
            next=super.read();
            if(pattern[matchedIndex]==next) {
                buf[matchedIndex++]=next;
                if(matchedIndex==pattern.length) {
                    // we've found a full match!
                    if(replacement==null || replacement.length==0) {
                        // the replacement is empty, go straight to NOT_MATCHED
                        state=State.NOT_MATCHED;
                        matchedIndex=0;
                    } else {
                        // start replacing
                        state=State.REPLACING;
                        replacedIndex=0;
                    }
                }
            } else {
                // mismatch -> unbuffer
                buf[matchedIndex++]=next;
                state=State.UNBUFFER;
                unbufferIndex=0;
            }
            return read();
        case REPLACING:
            // we've fully matched the pattern and are returning bytes from the replacement
            next=replacement[replacedIndex++];
            if(replacedIndex==replacement.length) {
                state=State.NOT_MATCHED;
                replacedIndex=0;
            }
            return next;
        case UNBUFFER:
            // we partially matched the pattern before encountering a non matching byte
            // we need to serve up the buffered bytes before we go back to NOT_MATCHED
            next=buf[unbufferIndex++];
            if(unbufferIndex==matchedIndex) {
                state=State.NOT_MATCHED;
                matchedIndex=0;
            }
            return next;

        default:
            throw new IllegalStateException("no such state " + state);
        }
    }

    @Override
    public String toString() {
        return state.name() + " " + matchedIndex + " " + replacedIndex + " " + unbufferIndex;
    }

}
查看更多
▲ chillily
3楼-- · 2019-01-11 14:11

I needed a solution to this, but found the answers here incurred too much memory and/or CPU overhead. The below solution significantly outperforms the others here in these terms based on simple benchmarking.

This solution is especially memory-efficient, incurring no measurable cost even with >GB streams.

That said, this is not a zero-CPU-cost solution. The CPU/processing-time overhead is probably reasonable for all but the most demanding/resource-sensitive scenarios, but the overhead is real and should be considered when evaluating the worthiness of employing this solution in a given context.

In my case, our max real-world file size that we are processing is about 6MB, where we see added latency of about 170ms with 44 URL replacements. This is for a Zuul-based reverse-proxy running on AWS ECS with a single CPU share (1024). For most of the files (under 100KB), the added latency is sub-millisecond. Under high-concurrency (and thus CPU contention), the added latency could increase, however we are currently able to process hundreds of the files concurrently on a single node with no humanly-noticeable latency impact.

The solution we are using:

import java.io.IOException;
import java.io.InputStream;

public class TokenReplacingStream extends InputStream {

    private final InputStream source;
    private final byte[] oldBytes;
    private final byte[] newBytes;
    private int tokenMatchIndex = 0;
    private int bytesIndex = 0;
    private boolean unwinding;
    private int mismatch;
    private int numberOfTokensReplaced = 0;

    public TokenReplacingStream(InputStream source, byte[] oldBytes, byte[] newBytes) {
        assert oldBytes.length > 0;
        this.source = source;
        this.oldBytes = oldBytes;
        this.newBytes = newBytes;
    }

    @Override
    public int read() throws IOException {

        if (unwinding) {
            if (bytesIndex < tokenMatchIndex) {
                return oldBytes[bytesIndex++];
            } else {
                bytesIndex = 0;
                tokenMatchIndex = 0;
                unwinding = false;
                return mismatch;
            }
        } else if (tokenMatchIndex == oldBytes.length) {
            if (bytesIndex == newBytes.length) {
                bytesIndex = 0;
                tokenMatchIndex = 0;
                numberOfTokensReplaced++;
            } else {
                return newBytes[bytesIndex++];
            }
        }

        int b = source.read();
        if (b == oldBytes[tokenMatchIndex]) {
            tokenMatchIndex++;
        } else if (tokenMatchIndex > 0) {
            mismatch = b;
            unwinding = true;
        } else {
            return b;
        }

        return read();

    }

    @Override
    public void close() throws IOException {
        source.close();
    }

    public int getNumberOfTokensReplaced() {
        return numberOfTokensReplaced;
    }

}
查看更多
▲ chillily
4楼-- · 2019-01-11 14:14

The following approach will work but I don't how big the impact is on the performance.

  1. Wrap the InputStream with a InputStreamReader,
  2. wrap the InputStreamReader with a FilterReader that replaces the strings, then
  3. wrap the FilterReader with a ReaderInputStream.

It is crucial to choose the appropriate encoding, otherwise the content of the stream will become corrupted.

If you want to use regular expressions to replace the strings, then you can use Streamflyer, a tool of mine, which is a convenient alternative to FilterReader. You will find an example for byte streams on the webpage of Streamflyer. Hope this helps.

查看更多
▲ chillily
5楼-- · 2019-01-11 14:16

Not sure you have chosen the best approach to solve your problem.

That said, I don't like to (and have as policy not to) answer questions with "don't" so here goes...

Have a look at FilterInputStream.

From the documentation:

A FilterInputStream contains some other input stream, which it uses as its basic source of data, possibly transforming the data along the way or providing additional functionality.


It was a fun exercise to write it up. Here's a complete example for you:

import java.io.*;
import java.util.*;

class ReplacingInputStream extends FilterInputStream {

    LinkedList<Integer> inQueue = new LinkedList<Integer>();
    LinkedList<Integer> outQueue = new LinkedList<Integer>();
    final byte[] search, replacement;

    protected ReplacingInputStream(InputStream in,
                                   byte[] search,
                                   byte[] replacement) {
        super(in);
        this.search = search;
        this.replacement = replacement;
    }

    private boolean isMatchFound() {
        Iterator<Integer> inIter = inQueue.iterator();
        for (int i = 0; i < search.length; i++)
            if (!inIter.hasNext() || search[i] != inIter.next())
                return false;
        return true;
    }

    private void readAhead() throws IOException {
        // Work up some look-ahead.
        while (inQueue.size() < search.length) {
            int next = super.read();
            inQueue.offer(next);
            if (next == -1)
                break;
        }
    }

    @Override
    public int read() throws IOException {    
        // Next byte already determined.
        if (outQueue.isEmpty()) {
            readAhead();

            if (isMatchFound()) {
                for (int i = 0; i < search.length; i++)
                    inQueue.remove();

                for (byte b : replacement)
                    outQueue.offer((int) b);
            } else
                outQueue.add(inQueue.remove());
        }

        return outQueue.remove();
    }

    // TODO: Override the other read methods.
}

Example Usage

class Test {
    public static void main(String[] args) throws Exception {

        byte[] bytes = "hello xyz world.".getBytes("UTF-8");

        ByteArrayInputStream bis = new ByteArrayInputStream(bytes);

        byte[] search = "xyz".getBytes("UTF-8");
        byte[] replacement = "abc".getBytes("UTF-8");

        InputStream ris = new ReplacingInputStream(bis, search, replacement);

        ByteArrayOutputStream bos = new ByteArrayOutputStream();

        int b;
        while (-1 != (b = ris.read()))
            bos.write(b);

        System.out.println(new String(bos.toByteArray()));

    }
}

Given the bytes for the string "Hello xyz world" it prints:

Hello abc world
查看更多
我想做一个坏孩纸
6楼-- · 2019-01-11 14:26

I came up with this simple piece of code when I needed to serve a template file in a Servlet replacing a certain keyword by a value. It should be pretty fast and low on memory. Then using Piped Streams I guess you can use it for all sorts of things.

/JC

public static void replaceStream(InputStream in, OutputStream out, String search, String replace) throws IOException
{
    replaceStream(new InputStreamReader(in), new OutputStreamWriter(out), search, replace);
}

public static void replaceStream(Reader in, Writer out, String search, String replace) throws IOException
{
    char[] searchChars = search.toCharArray();
    int[] buffer = new int[searchChars.length];

    int x, r, si = 0, sm = searchChars.length;
    while ((r = in.read()) > 0) {

        if (searchChars[si] == r) {
            // The char matches our pattern
            buffer[si++] = r;

            if (si == sm) {
                // We have reached a matching string
                out.write(replace);
                si = 0;
            }
        } else if (si > 0) {
            // No match and buffered char(s), empty buffer and pass the char forward
            for (x = 0; x < si; x++) {
                out.write(buffer[x]);
            }
            si = 0;
            out.write(r);
        } else {
            // No match and nothing buffered, just pass the char forward
            out.write(r);
        }
    }

    // Empty buffer
    for (x = 0; x < si; x++) {
        out.write(buffer[x]);
    }
}
查看更多
ら.Afraid
7楼-- · 2019-01-11 14:28

There isn't any built-in functionality for search-and-replace on byte streams (InputStream).

And, a method for completing this task efficiently and correctly is not immediately obvious. I have implemented the Boyer-Moore algorithm for streams, and it works well, but it took some time. Without an algorithm like this, you have to resort to a brute-force approach where you look for the pattern starting at every position in the stream, which can be slow.

Even if you decode the HTML as text, using a regular expression to match patterns might be a bad idea, since HTML is not a "regular" language.

So, even though you've run into some difficulties, I suggest you pursue your original approach of parsing the HTML as a document. While you are having trouble with the character encoding, it will probably be easier, in the long run, to fix the right solution than it will be to jury-rig the wrong solution.

查看更多
登录 后发表回答