I have an InputStream which takes the html file as input parameter. I have to get the bytes from the input stream .
I have a string: "XYZ"
. I'd like to convert this string to byte format and check if there is a match for the string in the byte sequence which I obtained from the InputStream. If there is then, I have to replace the match with the bye sequence for some other string.
Is there anyone who could help me with this? I have used regex to find and replace. however finding and replacing byte stream, I am unaware of.
Previously, I use jsoup to parse html and replace the string, however due to some utf encoding problems, the file seems to appear corrupted when I do that.
TL;DR: My question is:
Is a way to find and replace a string in byte format in a raw InputStream in Java?
I needed something like this as well and decided to roll my own solution instead of using the example above by @aioobe. Have a look at the code. You can pull the library from maven central, or just copy the source code.
This is how you use it. In this case, I'm using a nested instance to replace two patterns two fix dos and mac line endings.
new ReplacingInputStream(new ReplacingInputStream(is, "\n\r", "\n"), "\r", "\n");
Here's the full source code:
I needed a solution to this, but found the answers here incurred too much memory and/or CPU overhead. The below solution significantly outperforms the others here in these terms based on simple benchmarking.
This solution is especially memory-efficient, incurring no measurable cost even with >GB streams.
That said, this is not a zero-CPU-cost solution. The CPU/processing-time overhead is probably reasonable for all but the most demanding/resource-sensitive scenarios, but the overhead is real and should be considered when evaluating the worthiness of employing this solution in a given context.
In my case, our max real-world file size that we are processing is about 6MB, where we see added latency of about 170ms with 44 URL replacements. This is for a Zuul-based reverse-proxy running on AWS ECS with a single CPU share (1024). For most of the files (under 100KB), the added latency is sub-millisecond. Under high-concurrency (and thus CPU contention), the added latency could increase, however we are currently able to process hundreds of the files concurrently on a single node with no humanly-noticeable latency impact.
The solution we are using:
The following approach will work but I don't how big the impact is on the performance.
InputStream
with aInputStreamReader
,InputStreamReader
with aFilterReader
that replaces the strings, thenFilterReader
with aReaderInputStream
.It is crucial to choose the appropriate encoding, otherwise the content of the stream will become corrupted.
If you want to use regular expressions to replace the strings, then you can use Streamflyer, a tool of mine, which is a convenient alternative to
FilterReader
. You will find an example for byte streams on the webpage of Streamflyer. Hope this helps.Not sure you have chosen the best approach to solve your problem.
That said, I don't like to (and have as policy not to) answer questions with "don't" so here goes...
Have a look at
FilterInputStream
.From the documentation:
It was a fun exercise to write it up. Here's a complete example for you:
Example Usage
Given the bytes for the string
"Hello xyz world"
it prints:I came up with this simple piece of code when I needed to serve a template file in a Servlet replacing a certain keyword by a value. It should be pretty fast and low on memory. Then using Piped Streams I guess you can use it for all sorts of things.
/JC
There isn't any built-in functionality for search-and-replace on byte streams (
InputStream
).And, a method for completing this task efficiently and correctly is not immediately obvious. I have implemented the Boyer-Moore algorithm for streams, and it works well, but it took some time. Without an algorithm like this, you have to resort to a brute-force approach where you look for the pattern starting at every position in the stream, which can be slow.
Even if you decode the HTML as text, using a regular expression to match patterns might be a bad idea, since HTML is not a "regular" language.
So, even though you've run into some difficulties, I suggest you pursue your original approach of parsing the HTML as a document. While you are having trouble with the character encoding, it will probably be easier, in the long run, to fix the right solution than it will be to jury-rig the wrong solution.