Store metadata into Jackrabbit repository

2020-05-18 04:00发布

问题:

can anybody explain to me, how to proceed in following scenario ?

  1. receiving documents (MS docs, ODS, PDF)

  2. Dublic core metadata extraction via Apache Tika + content extraction via jackrabbit-content-extractors

  3. using Jackrabbit to store documents (content) into repository together with their metadata ?

  4. retrieving documents + metadata

I'm interested in points 3 and 4 ...

DETAILS: The application is processing documents interactively (some analysis - language detection, word count etc. + gather as many details possible - Dublin core + parsing the content/events handling) so that it returns results of the processing to the user and then the extracted content and metadata(extracted and custom user metadata) stores into JCR repository

Appreciate any helps, thank you

回答1:

Uploading files is basically the same for JCR 2.0 as it is for JCR 1.0. However, JCR 2.0 adds a few additional built-in property definitions that are useful.

The "nt:file" node type is intended to represent a file and has two built-in property definitions in JCR 2.0 (both of which are auto-created by the repository when nodes are created):

  • jcr:created (DATE)
  • jcr:createdBy (STRING)

and defines a single child named "jcr:content". This "jcr:content" node can be of any node type, but generally speaking all information pertaining to the content itself is stored on this child node. The de facto standard is to use the "nt:resource" node type, which has these properties defined:

  • jcr:data (BINARY) mandatory
  • jcr:lastModified (DATE) autocreated
  • jcr:lastModifiedBy (STRING) autocreated
  • jcr:mimeType (STRING) protected?
  • jcr:encoding (STRING) protected?

Note that "jcr:mimeType" and "jcr:encoding" were added in JCR 2.0.

In particular, the purpose of the "jcr:mimeType" property was to do exactly what you're asking for - capture the "type" of the content. However, the "jcr:mimeType" and "jcr:encoding" property definitions can be defined (by the JCR implementation) as protected (meaning the JCR implementation automatically sets them) - if this is the case, you would not be allowed to manually set these properties. I believe that Jackrabbit and ModeShape do not treat these as protected.

Here is some code that shows how to upload a file into a JCR 2.0 repository using these built-in node types:

// Get an input stream for the file ...
File file = ...
InputStream stream = new BufferedInputStream(new FileInputStream(file));

Node folder = session.getNode("/absolute/path/to/folder/node");
Node file = folder.addNode("Article.pdf","nt:file");
Node content = file.addNode("jcr:content","nt:resource");
Binary binary = session.getValueFactory().createBinary(stream);
content.setProperty("jcr:data",binary);

And if the JCR implementation does not treat the "jcr:mimeType" property as protected (i.e., Jackrabbit and ModeShape), you'd have to set this property manually:

content.setProperty("jcr:mimeType","application/pdf");

Metadata can very easily be stored on the "nt:file" and "jcr:content" nodes, but out-of-the-box the "nt:file" and "nt:resource" node types don't allow for extra properties. So before you can add other properties, you first need to add a mixin (or multiple mixins) that have property definitions for the kinds of properties you want to store. You can even define a mixin that would allow any property. Here is a CND file defining such a mixin:

<custom = 'http://example.com/mydomain'>
[custom:extensible] mixin
- * (undefined) multiple 
- * (undefined) 

After registering this node type definition, you can then use this on your nodes:

content.addMixin("custom:extensible");
content.setProperty("anyProp","some value");
content.setProperty("custom:otherProp","some other value");

You could also define and use a mixin that allowed for any Dublin Core element:

<dc = 'http://purl.org/dc/elements/1.1/'>
[dc:metadata] mixin
- dc:contributor (STRING)
- dc:coverage (STRING)
- dc:creator (STRING)
- dc:date (DATE)
- dc:description (STRING)
- dc:format (STRING)
- dc:identifier (STRING)
- dc:language (STRING)
- dc:publisher (STRING)
- dc:relation (STRING)
- dc:right (STRING)
- dc:source (STRING)
- dc:subject (STRING)
- dc:title (STRING)
- dc:type (STRING)

All of these properties are optional, and this mixin doesn't allow for properties of any name or type. I've also not really addressed with this 'dc:metadata' mixin the fact that some of these are already represented with the built-in properties (e.g., "jcr:createBy", "jcr:lastModifiedBy", "jcr:created", "jcr:lastModified", "jcr:mimeType") and that some of them may be more related to content while others more related to the file.

You could of course define other mixins that better suit your metadata needs, using inheritance where needed. But be careful using inheritance with mixins - since JCR allows a node to multiple mixins, it's often best to design your mixins to be tightly scoped and facet-oriented (e.g., "ex:taggable", "ex:describable", etc.) and then simply apply the appropriate mixins to a node as needed.

(It's even possible, though much more complicated, to define a mixin that allows more children under the "nt:file" nodes, and to store some metadata there.)

Mixins are fantastic and give a tremendous amount of flexibility and power to your JCR content.

Oh, and when you've created all of the nodes you want, be sure to save the session:

session.save();


回答2:

I am a bit rusty with JCR and I have never used 2.0 but this should get you started.

See this link. You'll want to open up the second comment.

You just store the file in a node and add additional metadata to the node. Here is how to store the file:

Node folder = session.getRootNode().getNode("path/to/file/uploads"); 
Node file = folder.addNode(fileName, "nt:file"); 
Node fileContent = file.addNode("jcr:content"); 
fileContent.setProperty("jcr:data", fileStream);
// Add other metadata
session.save();

How you store meta-data is up to you. A simple way is to just store key value pairs:

fileContent.setProperty(key, value, PropertyType.STRING);

To read the data you just call getProperty().

fileStream = fileContent.getProperty("jcr:data");
value = fileContent.getProperty(key);


回答3:

I am new to Jackrabbit, working on 2.4.2. As for your solution, you can check for the type using a core java logic and put cases defining any variation in your action.

You won't need to worry about issues with saving contents of different .txt or .pdf as their content is converted into binary and saved. Here is a small sample in which I uploaded and downloaded a pdf file in/from jackrabbit repo.

    // Import the pdf file unless already imported 
            // This program is for sample purpose only so everything is hard coded.
        if (!root.hasNode("Alfresco_E0_Training.pdf"))
        { 
            System.out.print("Importing PDF... "); 

            // Create an unstructured node under which to import the XML 
            //Node node = root.addNode("importxml", "nt:unstructured"); 
            Node file = root.addNode("Alfresco_E0_Training.pdf","nt:file");

            // Import the file "Alfresco_E0_Training.pdf" under the created node 
            FileInputStream stream = new FileInputStream("<path of file>\\Alfresco_E0_Training.pdf");
            Node content = file.addNode("jcr:content","nt:resource");
            Binary binary = session.getValueFactory().createBinary(stream);
            content.setProperty("jcr:data",binary);
            stream.close();
            session.save(); 
            //System.out.println("done."); 
            System.out.println("::::::::::::::::::::Checking content of the node:::::::::::::::::::::::::");
            System.out.println("File Node Name : "+file.getName());
            System.out.println("File Node Identifier : "+file.getIdentifier());
            System.out.println("File Node child : "+file.JCR_CHILD_NODE_DEFINITION);
            System.out.println("Content Node Name : "+content.getName());
            System.out.println("Content Node Identifier : "+content.getIdentifier());
            System.out.println("Content Node Content : "+content.getProperty("jcr:data"));
            System.out.println(":::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::");

        }else
        {
            session.save();
            Node file = root.getNode("Alfresco_E0_Training.pdf");
            Node content = file.getNode("jcr:content");
            String path = content.getPath();
            Binary bin = session.getNode(path).getProperty("jcr:data").getBinary();
            InputStream stream = bin.getStream();
             File f=new File("C:<path of the output file>\\Alfresco_E0_Training.pdf");

              OutputStream out=new FileOutputStream(f);
              byte buf[]=new byte[1024];
              int len;
              while((len=stream.read(buf))>0)
              out.write(buf,0,len);
              out.close();
              stream.close();
              System.out.println("\nFile is created...................................");


            System.out.println("done."); 
            System.out.println("::::::::::::::::::::Checking content of the node:::::::::::::::::::::::::");
            System.out.println("File Node Name : "+file.getName());
            System.out.println("File Node Identifier : "+file.getIdentifier());
            //System.out.println("File Node child : "+file.JCR_CHILD_NODE_DEFINITION);
            System.out.println("Content Node Name : "+content.getName());
            System.out.println("Content Node Identifier : "+content.getIdentifier());
            System.out.println("Content Node Content : "+content.getProperty("jcr:data"));
            System.out.println(":::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::");
        } 

        //output the repository content
        } 
    catch (IOException e){
        System.out.println("Exception: "+e);
    }
    finally { 
        session.logout(); 
        } 
        } 
}

Hope this helps