FTP using Talend, get only most recent file?

2019-04-09 19:04发布

I have a Talend job that I need to pull down an XML file from an sFTP server to then be processed into an Oracle database. The date of the XML extraction is in the file name, for example "FileNameHere_Outbound_201407092215.xml", which I believe is yyyyMMddhhmm formatting. The beginning portion where "FileNameHere" is the same for all the files. I need to be able to read the date from the end of the file name and only pull that one down from the server to be processed.

I am not sure how to do this with FTP. I've previously used tFilelist to order the items by date descending, but that is not an option with FTP. I know it probably has some Java involved in how to pull the portion of the File Name out, but I'm not very Java-literate. I can manage though with a bit of assistance.

Does anyone have any insight on how to only download the most recent file from an FTP?

1条回答
男人必须洒脱
2楼-- · 2019-04-09 19:11

There's a tFTPFileList component on the palette. That should give you a list of all the files on the FTP location. From here you then want to parse out the time stamp which could be done with a regular expression or alternatively by substringing it depending on which you feel more comfortable with.

Then it's just a case of sorting by the extracted time stamp and then that gives you the newest file name so you can then go fetch that specific file.

Here's an outline of an overly laborious way to get this done but it works. You should be able to tweak this easily yourself too:

Example job layout

In the above job design I've gone for a tFileList rather than a tFTPFileList because I don't have an example FTP location to play with for testing here. The premise stays the same although this would be pointless with a real tFileList due to the ability to sort by modified date (among other options).

We start off by running the tFileList/tFTPFileList component to iterate through all the files (it's possible to file mask these too to limit what you return here) in the location. We then read this in iteratively to a tFixedFlowInput component which allows us to retrieve the values from the globalMap as the tFileList/tFTPFileList iterates through each file:

Use a tFixedFlowInput component to retrieve the values from the globalMap

I've listed everything that the tFileList provides (you can see the options by pressing ctrl+space) but you only really need the file name and potentially the file path or file directory. From here we then throw everything into a buffer with a tBufferOutput component so that we can gather every iteration of the location.

Once the tFileList/tFTPFileList has iterated through every file in the directory it then triggers the next sub job with an OnSubjobOk link where we start by reading the completed buffer back in with a tBufferInput component. At this point I've started scattering tLogRow components throughout the flow so I can better visualise the data at each step.

After this we then use a tExtractRegexFields component to extract the date time stamp from the file name:

Image showing the configuration of the tExtractRegexFields component

Here, I am using the following regex "^.+?_Outbound_([0-9]{12})\\.xml$" to capture the date time stamp. It relies on the file name being a combination of any characters, followed by the string literal _Outbound_, then followed by the date time stamp that we want to capture (which is represented by 12 numeric characters) and then finished with .xml.

We also add a column to our schema to accommodate the captured date time stamp like so:

Schema for tExtractRegexFields component

As the extra column is a date time stamp of the form yyyyMMddhhmm we can specify this directly here and use it as a date object from then on.

From here we simply sort by date descending on the extracted date time stamp column and then use a tSampleRow to take only the first row of the flow of data as per the guidelines on the component configuration.

To finish this job you would then output the target file path to the globalMap (either in a tJavaRow or using a tFlowToIterate that will automatically do this for you) and then use the globalMap stored file path in the tFTPFileGet's file mask setting:

Putting the data in the globalMap with a tFlowToIterate and tFTPGet Configuration

查看更多
登录 后发表回答