What is the best way to process large CSV files?

2020-07-14 09:03发布

I have a third party system that generates a large amount of data each day (those are CSV files that are stored on FTP). There are 3 types of files that are being generated:

  • every 15 minutes (2 files). These files are pretty small (~ 2 Mb)
  • everyday at 5 PM (~ 200 - 300 Mb)
  • every midnight (this CSV file is about 1 Gb)

Overall the size of 4 CSVs is 1.5 Gb. But we should take into account that some of the files are being generated every 15 minutes. These data should be aggregated also (not so hard process but it will definitely require time). I need fast responses. I am thinking how to store these data and overall on the implementation.

We have java stack. The database is MS SQL Standard. From my measurements MS SQL Standard with other applications won't handle such load. What comes to my mind:

  • This could be an upgrade to MS SQL Enterprise with the separate server.
  • Usage of PostgreSQL on a separate server. Right now I'm working on PoC for this approach.

What would you recommend here? Probably there are better alternatives.

Edit #1

Those large files are new data for the each day.

4条回答
ゆ 、 Hurt°
2楼-- · 2020-07-14 09:42

You might consider looking into the Apache Spark project. After validating and curating the data maybe use Presto to run queries.

查看更多
老娘就宠你
3楼-- · 2020-07-14 09:52

Pentaho Data Integration (or a similar ETL tool) can handle importing the data into a SQL database and can do aggregation on the fly. PDI has a community edition and can be run stand-alone or via a Java API.

查看更多
We Are One
4楼-- · 2020-07-14 09:53

Okay. After spending some time with this problem (it includes reading, consulting, experimenting, doing several PoC). I came up with the following solution.

Tl;dr

Database: PostgreSQL as it is good for CSV, free and open source.

Tool: Apache Spark is a good fit for such type of tasks. Good performance.

DB

Regarding database, it is an important thing to decide. What to pick and how it will work in future with such amount of data. It is definitely should be a separate server instance in order not to generate an additional load on the main database instance and not to block other applications.

NoSQL

I thought about the usage of Cassandra here, but this solution would be too complex right now. Cassandra does not have ad-hoc queries. Cassandra data storage layer is basically a key-value storage system. It means that you must "model" your data around the queries you need, rather than around the structure of the data itself.

RDBMS

I didn't want to overengineer here. And I stopped the choice here.

MS SQL Server

It is a way to go, but the big downside here is pricing. Pretty expensive. Enterprise edition costs a lot of money taking into account our hardware. Regarding pricing, you could read this policy document.

Another drawback here was the support of CSV files. This will be the main data source for us here. MS SQL Server can neither import nor export CSV.

  • MS SQL Server silently truncating a text field.

  • MS SQL Server's text encoding handling going wrong.

MS SQL Server throwing an error message because it doesn't understand quoting or escaping. More on that comparison could be found in the article PostgreSQL vs. MS SQL Server.

PostgreSQL

This database is a mature product and well battle-tested too. I heard a lot of positive feedback on it from others (of course, there are some tradeoffs too). It has a more classic SQL syntax, good CSV support, moreover, it is open source.

It is worth to mention that SSMS is a way better than PGAdmin. SSMS has an autocomplete feature, multiple results (when you run several queries and get the several results at one, but in PGAdmin you get the last one only).

Anyway, right now I'm using DataGrip from JetBrains.

Processing Tool

I've looked through Spring Batch and Apache Spark. Spring Batch is a bit too low-level thing to use for this task and also Apache Spark provides the ability to scale easier if it will be needed in future. Anyway, Spring Batch could also do this work too.

Regarding Apache Spark example, the code could be found in learning-spark project. My choice is Apache Spark for now.

查看更多
做个烂人
5楼-- · 2020-07-14 09:59

You could use uniVocity-parsers to process the CSV as fast as possible, as this library comes with the fastest CSV parser around. I'm the author of this library and it is is open-source and free (Apache V2 License)

Now for loading the data into a database, you could try the univocity framework (commercial). We use it to load massive amounts of data into databases such as SQL server and PostgreSQL very quickly - from 25K to 200K rows/second, depending on the database and its config.

Here's a simple example on how the code to migrate from your CSV would look like:

public static void main(String ... args){
    //Configure CSV input directory
    CsvDataStoreConfiguration csv = new CsvDataStoreConfiguration("csv");
    csv.addEntitiesFromDirectory(new File("/path/to/csv/dir/"), "ISO-8859-1");

    //should grab column names from CSV files
    csv.getDefaultEntityConfiguration().setHeaderExtractionEnabled(true);

    javax.sql.DataSource dataSource = connectToDatabaseAndGetDataSource(); //specific to your environment

    //Configures the target database
    JdbcDataStoreConfiguration database = new JdbcDataStoreConfiguration("database", dataSource);

    //Use only for postgres - their JDBC driver requires us to convert the input Strings from the CSV to the correct column types.
    database.getDefaultEntityConfiguration().setParameterConversionEnabled(true);

    DataIntegrationEngine engine = Univocity.getEngine(new EngineConfiguration(csv, database));

    //Creates a mapping between data stores "csv" and "database"
    DataStoreMapping mapping = engine.map(csv, database);

    // if names of CSV files and their columns match database tables an their columns
    // we can detect the mappings from one to the other automatically
    mapping.autodetectMappings();

    //loads the database.
    engine.executeCycle();

}

To improve performance, the framework allows you can manage the database schema and perform operations such as drop constraints and indexes, load the data, and recreate them. Data & schema transformations are also very well supported if you need.

Hope this helps.

查看更多
登录 后发表回答