Which MPI implementations currently have support for fault tolerance, and what is the state of their development?
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
回答1:
This question is probably too broad to give you a good answer here, especially since the answer will change as time progresses.
In general, there's lots of fault tolerant work going on with various MPI implementations that is in various states of support.
- FT-MPI is an old project that isn't really in development anymore, but somewhat started it all in terms of integrated FT within the MPI library.
- ULFM is a spiritual successor to FT-MPI that's currently being proposed for inclusion in the future MPI Standard which means eventually every MPI implementation will provide it (if it is accepted). There's currently and implementation in an old branch of Open MPI and an implementation in MPICH is currently in progress for a future release.
There's lot of other MPI libraries that implement some form of fault tolerance on top of MPI or make some sort of tweaks to the implementation itself. These are just a couple of options.