Correct way of writing two floats into a regular t

2019-07-08 10:42发布

问题:

I am running a big job, in cluster mode. However, I am only interested in two floats numbers, which I want to read somehow, when the job succeeds.

Here what I am trying:

from pyspark.context import SparkContext

if __name__ == "__main__":
    sc = SparkContext(appName='foo')

    f = open('foo.txt', 'w')
    pi = 3.14
    not_pi = 2.79 
    f.write(str(pi) + "\n")
    f.write(str(not_pi) + "\n")
    f.close()

    sc.stop()

However, 'foo.txt' doesn't appear to be written anywhere (probably it gets written in an executor, or something). I tried '/homes/gsamaras/foo.txt', which would be the pwd of the gateway. However, it says: No such file or directory: '/homes/gsamaras/myfile.txt'.

How to do that?


import os, sys
import socket
print "Current working dir : %s" % os.getcwd()
print(socket.gethostname())

suggest that the driver is actually a node of the cluster, that's why I don't see the file in my gateway.

Maybe write the file in the HDFS somehow?

This won't work either:

Traceback (most recent call last):
  File "computeCostAndUnbalancedFactorkMeans.py", line 15, in <module>
    f = open('hdfs://myfile.txt','w')
IOError: [Errno 2] No such file or directory: 'hdfs://myfile.txt'

回答1:

At the first glance there is nothing particularly (you should context manager in case like this instead of manually closing but it is not the point) wrong with your code. If this script is passed to spark-submit file will be written to the directory local to the driver code.

If you submit your code in the cluster mode it will be an arbitrary worker node in your cluster. If you're in doubt you can always log os.getcwd() and socket.gethostname() to figure out which machine is used and what is the working directory.

Finally you cannot use standard Python IO tools to write to HDFS. There a few tools which can achieve that including native dask/hdfs3.