Azure blob storage to JSON in azure function using

2019-06-11 11:51发布

问题:

I am trying to create a timer trigger azure function that takes data from blob, aggregates it, and puts the aggregates in a cosmosDB. I previously tried using the bindings in azure functions to use blob as input, which I was informed was incorrect (see this thread: Azure functions python no value for named parameter).

I am now using the SDK and am running into the following problem:

import sys, os.path
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), 'myenv/Lib/site-packages')))
import json
import pandas as pd
from azure.storage.blob import BlockBlobService 

data = BlockBlobService(account_name='accountname', account_key='accountkey')
container_name = ('container')
generator = data.list_blobs(container_name)

for blob in generator:
print("{}".format(blob.name))
json = json.loads(data.get_blob_to_text('container', open(blob.name)))


df = pd.io.json.json_normalize(json)
print(df)

This results in an error:

IOError: [Errno 2] No such file or directory: 'test.json'

I realize this might be an absolute path issue, but im not sure how that works with azure storage. Any ideas on how to circumvent this?


Made it "work" by doing the following:

for blob in generator:
loader = data.get_blob_to_text('kvaedevdystreamanablob',blob.name,if_modified_since=delta)
json = json.loads(loader.content)

This works for ONE json file, i.e I only had one in storage, but when more are added I get this error:

ValueError: Expecting object: line 1 column 21907 (char 21906)

This happens even if i add if_modified_since as to only take in one blob. Will update if I figure something out. Help always welcome.


Another update: My data is coming in through stream analytics, and then down to the blob. I have selected that the data should come in as arrays, this is why the error is occurring. When the stream is terminated, the blob doesnt immediately append ] to the EOF line in json, thus the json file isnt valid. Will try now with using line-by-line in stream analytics instead of array.

回答1:

figured it out. In the end it was a quite simple fix:

I had to make sure each json entry in the blob was less than 1024 characters, or it would create a new line, thus making reading lines problematic.

The code that iterates through each blob file, reads and adds to a list is a follows:

data = BlockBlobService(account_name='accname', account_key='key')
generator = data.list_blobs('collection')

dataloaded = []
for blob in generator:
loader = data.get_blob_to_text('collection',blob.name)
trackerstatusobjects = loader.content.split('\n')
for trackerstatusobject in trackerstatusobjects:
    dataloaded.append(json.loads(trackerstatusobject))

From this you can add to a dataframe and do what ever you want :) Hope this helps if someone stumbles upon a similar problem.