batch/offline processing design book / documentati

2019-03-16 19:04发布

问题:

Is there a book or any documentation available that describes the best practice for designing batch (offline) processes for sharing data between two parties?

I have found some useful information on the spring batch site, but it is quite low level: batch processing strategies and batch principles guidelines.

There are a lots of considerations for batch, for example:

  1. data transfer method (e.g. files)
  2. control protocol between the two parties
  3. error handling
  4. file naming conventions (if using files for transfer)
  5. synchronising cut-off times between the two parties
  6. etc.

It would be good if there was some authorative document or checklists that ensure designs follow the best practice in the field.


UPDATE:

I'll add answers to this section as I come across them.

General Batch/Offline Processing info

This section is taken from @user1813068's answer.

You can find some architectural design patterns at this link and also at this link that describe approaches for partner to partner integration and for data synchronization.

This wikipedia page also gives a high level overview of architectural patterns and includes patterns for Data Integration: architectural patterns.

The book Data Integration Blueprint and Modeling is very good too.

Data Files

Most of the content in this section has come from here: source

The use of headers and footers for flat file exchange is considered best practice. Flat files can be exchanged without headers and footers and the naming of the file can outline some of the same information as the header. When using a delimited file, the field list header is always required.

Headers

When exchanging data between systems, it is very important for the receiving party to know exactly what type of data is being sent. One way to ensure this is to provide a header row that includes pertinent information regarding the content of the data and how it should be processed.

When working with flat files, the filename itself can also be used to inform the receiving party of the content of the file. However, a header row provides better support for all options that may be available.

When working with an API these header fields can be provided in a similar fashion. Implementation will be determined by the developer of the API service.

If the header is included, it consists of a single set of data, and must always be the first data in the file.

Footers

A footer may be provided when using file-based formats to indicate that there is no more data left to process.

When processing, the data found after the footer row should be ignored. Also, when creating the data, be aware that any data after the footer row will be ignored.

Data Formats

Delimited Files

The de facto industry standard is delimited files.

Comma-delimited (CSV, or comma-separated values) files usually requires data encapsulation, usually with double quotes ("); the double quotes must then be escaped, either with a backslash () or double double quotes (""). Due to the inconsistencies in CSV implementation, it is recommended to use tabs as a delimiter, with no encapsulation. In this case, tab characters must be removed from the data. Delimited Files are usually quicker to process that XML Files.

XML Files

There are some in the industry who prefer XML files. XML allows for a more clear representation of the information, since it supports nested data. Many companies have limited or no support for this format, so it is not recommended.

Encoding

UTF-8 Encoding

All data should be UTF-8 encoded to ensure maximum compatibility between all systems.

Dates & Times

It is recommended to use UTC time for all date & time fields to prevent confusion.


Some more best practices: EDI Scheduling and File Transfer

回答1:

You can find some architectural design patterns at this link and also at this link that describe approaches for partner to partner integration and for data synchronization.

This wikipedia page also gives a high level overview of architectural patterns and includes patterns for Data Integration: architectural patterns.

The book Data Integration Blueprint and Modeling is very good too.



回答2:

Depending on your requirements you can either look at data replication systems to transfer the data as is. There are a lot of commercial and open source tools around. You can take a look at the source code and documentation of SymmetricDS

If you need to do some conversion and processing you can take a look at ETL (Extraction, Transformation, Load) tools. Most datawarehouse books have chapters on that subject, an example is Here