I am going to use spark-sql cli to replace the hive cli shell, and I run the spark-sql cli with following the command,(We are using on yarn Hadoop cluster, the hive-site.xml already copied to /conf)
.> spark-sql
Then the shell is opened and works ok,
And I execute a query something like,
./spark-sql>select devicetype, count(*) from mytable group by devicetype;
The command execute successfully and the result is correct. But I notice the performance is very slow.
From the spark job ui, http://myhost:4040, I noticed that only 1 Executor marked used, so that’s maybe the reason.
And I try to modify the spark-sql script and add the –num-executors 500 in the exec command, but it does not help.
So anyone could help and explain why?
Thanks.
Refer to the documentation: http://spark.apache.org/docs/latest/sql-programming-guide.html
spark-sql
is an SQL CLI tool that works only in local mode, that is why you see only one executor
If you want to have a cluster version of SQL, you should start thriftserver
and connect to it via JDBC using beeline
tool (that goes with Spark), for example. You can find the description in chapter Running the Thrift JDBC/ODBC server of the official documentation http://spark.apache.org/docs/latest/sql-programming-guide.html
To start:
export HIVE_SERVER2_THRIFT_PORT=<listening-port>
export HIVE_SERVER2_THRIFT_BIND_HOST=<listening-host>
./sbin/start-thriftserver.sh \
--master <master-uri> \
...
To connect:
./bin/beeline
beeline> !connect jdbc:hive2://<listening-host>:<listening-port>
beeline
\> !connect jdbc:hive2://localhost:10002/default;transportMode=http;httpPath=cliservice
10002 is my port for the spark thrift server.
change it to yours.
you can find your thrift port from your thrift log.