Reading the first part of a file using HTTP

2019-01-15 11:01发布

问题:

I would like to determine the type of a file (generally UTF-8) by reading the first part of the file and analysing the content. (The type is specific to my community but not under my control and not covered by MIME/MediaType which is normally TEXT_PLAIN). I am using the 'org.restlet' library on the client to analyse the header with

Request request = new Request(Method.HEAD, url);

so I know the content-length and can (if necessary and possible) estimate how many bytes I should download for the analysis

CLARIFICATION: I cannot use the MediaType. From answer 1 seems like I have to GET the content. A revised question would therefore be:

"Can I GET part of a file using Restlet?"

ANSWER: The following code does what I want. I have credited @BalusC for showing the way. Please comment if I have missed anything:

public String readFirstChunk(String urlString, int byteCount) {
    String text = null;
    if (urlString != null) {
        org.restlet.Client restletClient = new org.restlet.Client(Protocol.HTTP);
        Request request = new Request(Method.GET, urlString);
        List<Range> ranges = Collections.singletonList(new Range(0, byteCount));
        request.setRanges(ranges);
        Response response = restletClient.handle(request);
        if (Status.SUCCESS_OK.equals(response.getStatus())) {
            text = processSuccessfulChunkRequest(response);
        } else if (Status.SUCCESS_PARTIAL_CONTENT .equals(response.getStatus())) {
            text = processSuccessfulChunkRequest(response);
        } else {
            System.err.println("FAILED "+response.getStatus());
        }
    }
    return text;
}

private String processSuccessfulChunkRequest(Response response) {
    String text = null;
    try {
        text = response.getEntity().getText();
    } catch (IOException e) {
        throw new RuntimeException("Cannot download chunk", e);
    }
    return text;
}

回答1:

That's only possible if the server has sent the Accept-Ranges and Content-Range headers along with ETag or Last-Modified. E.g.

Accept-Ranges: bytes
Content-Range: bytes 0-1233/1234
ETag: file.ext_1234_1234567890

The Accept-Ranges: bytes indicates that the server supports requests returning partial content in a specified byte range. The Content-Range header informs about the length. The ETag and Last-Modified indicate the unique file idenfier or the last modified timestamp on the resource behind the request URI.

If those headers are present in the response, then you can request a part of the resource using If-Range and Range request headers with respectively the unique file identifier or the last modified timestamp and the desired byte range.

If-Range: file.ext_1234_1234567890
Range: bytes=0-99

The above example returns the first 100 bytes of the file.



回答2:

the HEAD operation, as defined by the HTTP standard does not return any content apart from the header information. So if you are sending a head request, you could only inspect the MIME type of the file from the HTTP response header.

The header information can be obtained by looking at the Representation returned from wrapping it into a ClientResource and performing a head request. This gives you a high level interface to the HTTP transport and you don't need to do custom header parsing.

ClientResource resource = new ClientResource(url);
Representation representation = resource.head();
representation.getMediaType(); // returns the Media Type

If you want to do content type guessing on the actual content of the file, you would need to download the actual content, for example with a GET request against that resource.

Or in true REST fashion you could model an extra query parameter for your resource which would return your custom meta information for that file, e.g.

http://server/file?contentType

In similar fashion, to retrieve the actual content, you could get a handle on the Stream and then do your encoding guessing.

Representation representation = resource.get();
InputStream stream = representation.getStream();

To specify ranges, if supported by the server, you can set the ranges, before submitting your get request.

List<Range> ranges = new ArrayList<Range>();
ranges.add(new Range(0,100)); // this would request the first 100 bytes
resource.setRanges(ranges);
Representation representation = resource.get();

Make sure you consume the response (stream) completely, before returning.

I suggest you would look into other efforts which help you determining the content type. Like here Java charset and Windows Or http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding



回答3:

Since it's your content why not just include all the data you need in the first few bytes of each file?