Java ETL process

2020-08-04 04:38发布

问题:

I have this new challenge to load ~100M rows from an Oracle database and insert them in a remote MySQL database server.

I've divided the problem in two:

  1. a server side REST server responsible for loading data into the MySQL server;
  2. a client side Java app that is responsible from loading the Oracle data source.

At the Java side I've used plain JDBC for loading paginated content and transfer it over the wire to the server. This approach works well but it makes the code cumbersome and not very scalable as I'm doing pagination myself using Oracle's ROWNUM.....WHERE ROWNUM > x and ROWNUM < y.

I've now tried Hibernate's StatelessSession with my entities mapped through Annotations. The code is much more readable and clean but the performance is worse.

I've heard of ETL tools and SpringBatch but I don't know them very well. Are there other approaches to this problem?

Thanks in advance.

UPDATE

Thank you for the invaluable suggestions. I've opted for using SpringBatch to load data from the Oracle database because the environment is pretty tight and I don't have access to Oracle's toolset. SpringBatch is trie and true. For the data writing step I opted for writing chunks of records using MySQL's LOAD DATA INFILE as you all stated. REST services are in the middle as they are hidden from each other for security reasons.

回答1:

100M rows is quite a lot. You can design it in plenty of ways: REST servers, JDBC reading, Spring Batch, Spring integration, Hibernate, ETL. But the bottom line is: time.

No matter what architecture you choose, you eventually have to perform these INSERTs into MySQL. Your mileage may vary but just to give you an order of magnitude: with 2K inserts per second it'll take half a day to populate MySQL with 100M rows (source).

According to the same source LOAD DATA INFILE can handle around 25K inserts/second (roughly 10x more and about an hour of work).

That being said with such an amount of data I would suggest:

  • dump Oracle table using native Oracle database tools that produce human readable content (or computer readable, but you have to be able to parse it)

  • parse the dump file using as fast tools as you can. Maybe grep/sed/gawk/cut will be enough?

  • generate target file compatible with MySQL LOAD DATA INFILE (it is very configurable)

  • Import the file in MySQL using aforementioned command

Of course you can do this in Java with nice and readable code, unit tested and versioned. But with this amount of data you need to be pragmatic.

That is for initial load. After that probably Spring Batch will be a good choice. If you can, try to connect your application directly to both databases - again, this will be faster. On the other hand this might not be possible for security reasons.

If you want to be very flexible and not tie yourself into databases directly, expose both input (Oracle) and output (MySQL) behind web-services (REST is fine as well). Spring integration will help you a lot.



回答2:

You can use Scriptella to transfer data between databases. Here is an example of a XML transformation file:

<!DOCTYPE etl SYSTEM "http://scriptella.javaforge.com/dtd/etl.dtd">
<etl>
    <connection id="in" url="jdbc:oracle:thin:@localhost:1521:ORCL" 
              classpath="ojdbc14.jar" user="scott" password="tiger"/>

    <connection id="out" url="jdbc:mysql://localhost:3306/fromdb" 
              classpath="mysql-connector.jar" user="user" password="password"/>
    <!-- Copy all table rows from one to another database -->
    <query connection-id="in">
        SELECT * FROM Src_Table
        <!-- For each row executes insert -->  
        <script connection-id="out"> 
            INSERT INTO Dest_Table(ID, Name) VALUES (?id,?name)
        </script>
    </query>
  </etl>