Amazon AWS - python for beginners

2019-07-09 00:20发布

问题:

I have a computationally intensive program doing calculations that I intend to parallelise. It is written in python and I hope to use the multiprocess module. I would like some help with understanding what I would need to do to have one program run from my laptop controlling the entire process.

I have two options in terms of what computers I can use. One is computers which I can access through ssh user@comp1.com from the terminal ( not sure how to access them through python ) and then run the instance there, although I'd like a more programmatic way to get to them than that. It seems that if I ran a remote manager type application it would work?

The second option I was thinking is utilising AWS E2C servers. (I think that is what I need). And I found boto which I've never used but seemed to provide an interface to control the AWS system. I feel that I would then need something to actually distribute jobs on AWS, probably similarly as option 1 (?). I'm a bit in the dark here.

EDIT:

To give you an idea of how parallelisable it is:

res = []
for param in Parameters:
    res.append(FunctionA(param))
Parameters2 = FunctionB(res)
res2 = []
for param in Parameters2:
    res2.append(FunctionC(param))
return res, res2

So the two loops are basically where I can send off many param values to be run in parallel and I know how to recombine them to create res as long as I know which param they came from. Then I need to group them all together to get Parameters2 and then the second part is again parallelisable.

回答1:

you would want to use the multiprocess module only if you want the processes to share data in memory. That is something I would recommend ONLY if you absolutely have to have shared memory due to performance considerations. python multiprocess applications are non-trivial to write and debug.

If you are doing something like the distributed.net or seti@home projects, where even though the tasks are computationally intenive they are reasonably isolated, you can follow the following process.

  1. Create a master application that would break down the large task into smaller computation chunks (assuming that the task can be broken down and the results then can be combined centrally).
  2. Create python code that would take the task from the server (perhaps as a file or some other one time communication with instructions on what to do) and run multiple copies of these python processes
  3. These python processes will work independently from each other, process data and then return the results to the master process for collation of results.

you could run these processes on AWS single core instances if you wanted, or use your laptop to run as many copies as you have cores to spare.

EDIT: Based on the updated question

So your master process will create files (or some other data structures) that will have the parameter info in them. As many files as you have params to process. This files will be stored in a shared folder called needed-work

Each python worker (on AWS instances) will look at the needed-work shared folder, looking for available files to work on (or wait on a socket for the master process to assign the file to them).

The python process that takes on a file that needs work, will work on it and store the result in a separate shared folder with the the parameter as part of the file structure.

The master process will look at the files in the work-done folder, process these files and generate the combined response

This whole solution could be implemented as sockets as well, where workers will listen to sockets for the master to assign work to them, and master will wait on a socket for the workers so submit response.

The file based approach would require a way for the workers to make sure that the work they pick up is not taken on by another worker. This could be fixed by having separate work folders for each worker and the master process would decided when there needs to be more work for the worker.

Workers could delete files that they pick up from the work folder and master process could keep a watch on when a folder is empty and add more work files to it.

Again more elegant to do this using sockets if you are comfortable with that.