How do we get the document file url using the Wats

2019-03-02 20:04发布

问题:

I don't see a solution to this using the available api documentation.

It is also not available on the web console.

Is it possible to get the file url using the Watson Discovery Service?

回答1:

If you need to store the original source/file URL, you can include it as a field within your documents in the Discovery service, then you will be able to query that field back out when needed.



回答2:

I also struggled with this request but ultimately got it working using Python bindings into Watson Discovery. The online documentation and API reference is very poor; here's what I used to get it working:

(Assume you have a Watson Discovery service and have a created collection):

# Programmatic upload and retrieval of documents and metadata with Watson Discovery

from watson_developer_cloud import DiscoveryV1
import os
import json

discovery = DiscoveryV1(
    version='2017-11-07',
    iam_apikey='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',
    url='https://gateway-syd.watsonplatform.net/discovery/api'
)

environments = discovery.list_environments().get_result()
print(json.dumps(environments, indent=2))

This gives you your environment ID. Now append to your code:

collections = discovery.list_collections('{environment-id}').get_result()
print(json.dumps(collections, indent=2))

This will show you the collection ID for uploading documents into programmatically. You should have a document to upload (in my case, an MS Word document), and its accompanying URL from your own source document system. I'll use a trivial fictitious example.

NOTE: the documentation DOES NOT tell you to append , 'rb' to the end of the open statement, but it is required when uploading a Word document, as in my example below. Raw text / HTML documents can be uploaded without the 'rb' parameter.

url = {"source_url":"http://mysite/dis030.docx"}
with open(os.path.join(os.getcwd(), '{path to your document folder with trailing / }', 'dis030.docx'), 'rb') as fileinfo:
    add_doc = discovery.add_document('{environment-id}', '{collections-id}', metadata=json.dumps(url), file=fileinfo).get_result()
    print(json.dumps(add_doc, indent=2))
    print(add_doc["document_id"])

Note the setting up of the metadata as a JSON dictionary, and then encoding it using json.dumps within the parameters. So far I've only wanted to store the original source URL but you could extend this with other parameters as your own use case requires.

This call to Discovery gives you the document ID.

You can now query the collection and extract the metadata using something like a Discovery query:

my_query = discovery.query('{environment-id}', '{collection-id}', natural_language_query="chlorine safety")
print(json.dumps(my_query.result["results"][0]["metadata"], indent=2))

Note - I'm extracting just the stored metadata here from within the overall returned results - if you instead just had: print(my_query) you'll get the full response from Discovery ... but ... there's a lot to go through to identify just your own custom metadata.