I'm trying to retrieve data from an standalone Hadoop (version 2.7.2 qith properties configured by default) HDFS using Pentaho Kettle (version 6.0.1.0-386). Pentaho and Hadoop are not in the same machine but I have acces from one to another.
I created a new "Hadoop File Input" with the following properties:
Environment File/Folder Wildcard Rquired Include subfolders url-to-file N N
url-to-file is built like: ${PROTOCOL}://${USER}:${PASSWORD}@${IP}:${PORT}${PATH_TO_FILE}
eg: hdfs://hadoop:@the_ip:50010/user/hadoop/red_libelium/Ikusi/libelium_waspmote_AC_2_libelium_waspmote/libelium_waspmote_AC_2_libelium_waspmote.txt
Password is empty
I checked and this file exist in HDFS and downloaded correctly via web mannager and using haddop command line.
Scenario A) When I'm using ${PROTOCOL} = hdfs and ${PORT} = 50010 I'm getting error in both Pentaho and Hadoop consoles:
Pentaho:
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2016/04/05 15:23:46 - FileInputList - ERROR (version 6.0.1.0-386, build 1 from 2015-12-03 11.37.25 by buildguy) : org.apache.commons.vfs2.FileSystemEx
ception: Could not list the contents of folder "hdfs://hadoop@172.21.0.35:50010/user/hadoop/red_libelium/Ikusi/libelium_waspmote_AC_2_libelium_waspmot
e/libelium_waspmote_AC_2_libelium_waspmote.txt".
2016/04/05 15:23:46 - FileInputList - at org.apache.commons.vfs2.provider.AbstractFileObject.getChildren(AbstractFileObject.java:1193)
2016/04/05 15:23:46 - FileInputList - at org.pentaho.di.core.fileinput.FileInputList.createFileList(FileInputList.java:243)
2016/04/05 15:23:46 - FileInputList - at org.pentaho.di.core.fileinput.FileInputList.createFileList(FileInputList.java:142)
2016/04/05 15:23:46 - FileInputList - at org.pentaho.di.trans.steps.textfileinput.TextFileInputMeta.getTextFileList(TextFileInputMeta.java:1580)
2016/04/05 15:23:46 - FileInputList - at org.pentaho.di.trans.steps.textfileinput.TextFileInput.init(TextFileInput.java:1513)
2016/04/05 15:23:46 - FileInputList - at org.pentaho.di.trans.step.StepInitThread.run(StepInitThread.java:69)
2016/04/05 15:23:46 - FileInputList - at java.lang.Thread.run(Thread.java:745)
2016/04/05 15:23:46 - FileInputList - Caused by: java.io.EOFException: End of File Exception between local host is: "EI001115/192.168.231.248"; destin
ation host is: "172.21.0.35":50010; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException
2016/04/05 15:23:46 - FileInputList - at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
2016/04/05 15:23:46 - FileInputList - at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
2016/04/05 15:23:46 - FileInputList - at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
2016/04/05 15:23:46 - FileInputList - at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
2016/04/05 15:23:46 - FileInputList - at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
2016/04/05 15:23:46 - FileInputList - at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
2016/04/05 15:23:46 - FileInputList - at org.apache.hadoop.ipc.Client.call(Client.java:1472)
2016/04/05 15:23:46 - FileInputList - at org.apache.hadoop.ipc.Client.call(Client.java:1399)
2016/04/05 15:23:46 - FileInputList - at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
2016/04/05 15:23:46 - FileInputList - at com.sun.proxy.$Proxy70.getListing(Unknown Source)
2016/04/05 15:23:46 - FileInputList - at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTrans
latorPB.java:554)
2016/04/05 15:23:46 - FileInputList - at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
2016/04/05 15:23:46 - FileInputList - at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
2016/04/05 15:23:46 - FileInputList - at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2016/04/05 15:23:46 - FileInputList - at java.lang.reflect.Method.invoke(Method.java:606)
2016/04/05 15:23:46 - FileInputList - at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
2016/04/05 15:23:46 - FileInputList - at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
2016/04/05 15:23:46 - FileInputList - at com.sun.proxy.$Proxy71.getListing(Unknown Source)
2016/04/05 15:23:46 - FileInputList - at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1969)
2016/04/05 15:23:46 - FileInputList - at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1952)
2016/04/05 15:23:46 - FileInputList - at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:693)
2016/04/05 15:23:46 - FileInputList - at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:105)
2016/04/05 15:23:46 - FileInputList - at org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:755)
2016/04/05 15:23:46 - FileInputList - at org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:751)
2016/04/05 15:23:46 - FileInputList - at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
2016/04/05 15:23:46 - FileInputList - at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:751)
2016/04/05 15:23:46 - FileInputList - at com.pentaho.big.data.bundles.impl.shim.hdfs.HadoopFileSystemImpl$9.call(HadoopFileSystemImpl.java:126)
2016/04/05 15:23:46 - FileInputList - at com.pentaho.big.data.bundles.impl.shim.hdfs.HadoopFileSystemImpl$9.call(HadoopFileSystemImpl.java:124)
2016/04/05 15:23:46 - FileInputList - at com.pentaho.big.data.bundles.impl.shim.hdfs.HadoopFileSystemImpl.callAndWrapExceptions(HadoopFileSystemImpl
.java:200)
2016/04/05 15:23:46 - FileInputList - at com.pentaho.big.data.bundles.impl.shim.hdfs.HadoopFileSystemImpl.listStatus(HadoopFileSystemImpl.java:124)
2016/04/05 15:23:46 - FileInputList - at org.pentaho.big.data.impl.vfs.hdfs.HDFSFileObject.doListChildren(HDFSFileObject.java:115)
2016/04/05 15:23:46 - FileInputList - at org.apache.commons.vfs2.provider.AbstractFileObject.getChildren(AbstractFileObject.java:1184)
2016/04/05 15:23:46 - FileInputList - ... 6 more
2016/04/05 15:23:46 - FileInputList - Caused by: java.io.EOFException
2016/04/05 15:23:46 - FileInputList - at java.io.DataInputStream.readInt(DataInputStream.java:392)
2016/04/05 15:23:46 - FileInputList - at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1071)
2016/04/05 15:23:46 - FileInputList - at org.apache.hadoop.ipc.Client$Connection.run(Client.java:966)
2016/04/05 15:23:48 - cfgbuilder - Warning: The configuration parameter [org] is not supported by the default configuration builder for scheme: sftp
Hadoop:
2016-04-05 14:22:56,045 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: fiware-hadoop:50010:DataXceiver error processing unknown operation src: /192.168.231.248:62961 dst: /172.21.0.35:50010
java.io.IOException: Version Mismatch (Expected: 28, Received: 26738 )
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.readOp(Receiver.java:60)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:229)
at java.lang.Thread.run(Thread.java:745)
Scenario Other) In other cases using different por number (50070, 9000...) I'm just getting error from Pentaho, Hadoop standalone seems not to be receiving any request.
Reading some documentation of Pentaho it seems that the Big Data plugin is buit form Hadoop v 2.2.x, since I'm trying to connect to a 2.7.2. Can it be the source of the problem? Is there any pluging working with higher versions? Os simply my url to HDFS file is wrong?
Thanks you everyone for your time, any hint will be more than welcome.