how to get intermidiate results in ipython paralle

2019-07-16 00:31发布

问题:

my work is to deal with lots of xmls; to get faster results i want to use ipython's parallel processing; below is my sample code. in that i am just finding the number of elements of xml/xsd with celementTree module.

>>> from IPython.parallel import Client
>>> import os
>>> c = Client()
>>> c.ids
>>> lview = c.load_balanced_view()
>>> lview.block =True
>>> def return_len(xml_filepath):
        import xml.etree.cElementTree as cElementTree
        tree = cElementTree.parse(xml_filepath)
        my_count=0
        file_result=[]
        cdict={}
        for elem in tree.getiterator():
            cdict[my_count]={}
            if elem.tag:
                cdict[my_count]['tag']=elem.tag
            if elem.text:
                cdict[my_count]['text']=(elem.text).strip()
            if elem.attrib.items():
                cdict[my_count]['xmlattb']={}
                for key, value in elem.attrib.items():
                    cdict[my_count]['xmlattb'][key]=value
            if list(elem):
                cdict[my_count]['xmlinfo']=len(list(elem))
            if elem.tail:
                cdict[my_count]['tail']=elem.tail.strip()
            my_count+=1
        output=xml_filepath.split('\\')[-1],len(cdict)
        return output
        ## return cdict
>>> def get_dir_list(target_dir, *extensions):
        """
        This function will filter out the files from given dir based on their extensions
        """
        my_paths=[]
        for top, dirs, files in os.walk(target_dir):
            for nm in files:
                fileStats = os.stat(os.path.join(top, nm))
                if nm.split('.')[-1] in extensions:
                    my_paths.append(top+'\\'+nm)
        return my_paths
>>> r=lview.map_async(return_len,get_dir_list('C:\\test_folder','xsd','xml'))

to get the final result i have to do >>> r.get() by this i will get result when process will complete

my question is am i able to get the intermediate results while they are getting finished;
for example if i have applied my work to a folder which contains 1000 xmls/xsds files then can i get results immediately when that particular files has been processed. like 1st file is done--> show its result... 2nd file is done---> show its result........ 1000th file is done--> show its result not like current work as above; wait till final file get finished then it will show complete result of all those 1000 files.
also to deal with import/namespace error i have defined import inside of return_len function; is there any better way to deal with that?

回答1:

Sure. AsyncMapResult (the type returned by map_async) are iterable immediately, and the items yielded by the iteration are the same as the list ultimately produced by r.get(). So after you do:

amr = lview.map_async(return_len, get_dir_list('C:\\test_folder','xsd','xml'))

You can do:

for r in amr:
    print r

or keep the index with enumerate

for i,r in enumerate(amr):
    print i, r 

or perform reductions with the reduce builtin:

summary_result = reduce(myfunc, amr)

All of these will iterate through your results as they arrive. If you don't care about the ordering and the time for each task is significantly varied, you can pass map_async(...,ordered=False). If you do this, when you iterate through the AMR, you will get individual results on a first-come-first-serve basis, rather than preserving the submission order.

There's a bit more info in the ipython docs.

also to deal with import/namespace error i have defined import inside of return_len function; is there any better way to deal with that?

Yes and no. There are a few ways to set up the engine namespace, such as using modules, the @parallel.require("module") decorator, or simply performing the import explicitly with %px import xml.etree.cElementTree as cElementTree, each of which has benefits in certain scenarios. But I often find putting imports in the function to be the easiest way, and with the least surprises.