I have a large JSON file, about 5 million records and a file size of about 32GB, that I need to get loaded into our Snowflake Data Warehouse. I need to get this file broken up into chunks of about 200k records (about 1.25GB) per file. I'd like to do this in either Node.JS or Python for deployment to an AWS Lambda function, unfortunately I haven't coded in either, yet. I have C# and a lot of SQL experience, and learning both node and python are on my to do list, so why not dive right in, right!?
My first question is "Which language would better serve this function? Python, or Node.JS?"
I know I don't want to read this entire JSON file into memory (or even the output smaller file). I need to be able to "stream" it in and out into the new file based on a record count (200k), properly close up the json objects, and continue into a new file for another 200k, and so on. I know Node can do this, but if Python can also do this, I feel like it would be easier to quickly start using for other ETL stuff I'll be doing soon.
My second question is "Based on your recommendation above, can you also recommend what modules I should require/import to help me get started? Primarily as it relates to not pulling the entire json file into memory? Maybe some tips, tricks, or 'How would you do it's? And if you're feeling really generous, some code example to help push me into the deep end on this?
I can't include a sample of the JSON data, as it contains personal information. But I can provide the JSON schema ...
{
"$schema": "http://json-schema.org/draft-04/schema#",
"items": {
"properties": {
"activities": {
"properties": {
"activity_id": {
"items": {
"type": "integer"
},
"type": "array"
},
"frontlineorg_id": {
"items": {
"type": "integer"
},
"type": "array"
},
"import_id": {
"items": {
"type": "integer"
},
"type": "array"
},
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"is_source": {
"items": {
"type": "boolean"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"address": {
"properties": {
"city": {
"items": {
"type": "string"
},
"type": "array"
},
"congress_dist_name": {
"items": {
"type": "string"
},
"type": "array"
},
"congress_dist_number": {
"items": {
"type": "integer"
},
"type": "array"
},
"congress_end_yr": {
"items": {
"type": "integer"
},
"type": "array"
},
"congress_number": {
"items": {
"type": "integer"
},
"type": "array"
},
"congress_start_yr": {
"items": {
"type": "integer"
},
"type": "array"
},
"county": {
"items": {
"type": "string"
},
"type": "array"
},
"formatted": {
"items": {
"type": "string"
},
"type": "array"
},
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"latitude": {
"items": {
"type": "number"
},
"type": "array"
},
"longitude": {
"items": {
"type": "number"
},
"type": "array"
},
"number": {
"items": {
"type": "string"
},
"type": "array"
},
"observes_dst": {
"items": {
"type": "boolean"
},
"type": "array"
},
"post_directional": {
"items": {
"type": "null"
},
"type": "array"
},
"pre_directional": {
"items": {
"type": "null"
},
"type": "array"
},
"school_district": {
"items": {
"properties": {
"school_dist_name": {
"items": {
"type": "string"
},
"type": "array"
},
"school_dist_type": {
"items": {
"type": "string"
},
"type": "array"
},
"school_grade_high": {
"items": {
"type": "string"
},
"type": "array"
},
"school_grade_low": {
"items": {
"type": "string"
},
"type": "array"
},
"school_lea_code": {
"items": {
"type": "integer"
},
"type": "array"
}
},
"type": "object"
},
"type": "array"
},
"secondary_number": {
"items": {
"type": "null"
},
"type": "array"
},
"secondary_unit": {
"items": {
"type": "null"
},
"type": "array"
},
"state": {
"items": {
"type": "string"
},
"type": "array"
},
"state_house_dist_name": {
"items": {
"type": "string"
},
"type": "array"
},
"state_house_dist_number": {
"items": {
"type": "integer"
},
"type": "array"
},
"state_senate_dist_name": {
"items": {
"type": "string"
},
"type": "array"
},
"state_senate_dist_number": {
"items": {
"type": "integer"
},
"type": "array"
},
"street": {
"items": {
"type": "string"
},
"type": "array"
},
"suffix": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"timezone": {
"items": {
"type": "string"
},
"type": "array"
},
"utc_offset": {
"items": {
"type": "integer"
},
"type": "array"
},
"zip": {
"items": {
"type": "integer"
},
"type": "array"
}
},
"type": "object"
},
"age": {
"type": "integer"
},
"anniversary": {
"properties": {
"date": {
"type": "null"
},
"insert_datetime_utc": {
"type": "null"
},
"suppressed_datetime_utc": {
"type": "null"
}
},
"type": "object"
},
"baptism": {
"properties": {
"church_id": {
"type": "null"
},
"date": {
"type": "null"
},
"insert_datetime_utc": {
"type": "null"
},
"suppressed_datetime_utc": {
"type": "null"
}
},
"type": "object"
},
"birth_dd": {
"type": "integer"
},
"birth_mm": {
"type": "integer"
},
"birth_yyyy": {
"type": "integer"
},
"church_attendance": {
"properties": {
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"likelihood": {
"items": {
"type": "integer"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"cohabiting": {
"properties": {
"confidence": {
"items": {
"type": "string"
},
"type": "array"
},
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"likelihood": {
"items": {
"type": "null"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"dating": {
"properties": {
"bool": {
"type": "null"
},
"insert_datetime_utc": {
"type": "null"
},
"suppressed_datetime_utc": {
"type": "null"
}
},
"type": "object"
},
"divorced": {
"properties": {
"bool": {
"items": {
"type": "null"
},
"type": "array"
},
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"likelihood_considering": {
"items": {
"type": "integer"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"education": {
"properties": {
"est_level": {
"items": {
"type": "string"
},
"type": "array"
},
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"email": {
"properties": {
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"is_work_school": {
"items": {
"type": "boolean"
},
"type": "array"
},
"string": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"engaged": {
"properties": {
"insert_datetime_utc": {
"type": "null"
},
"likelihood": {
"type": "null"
},
"suppressed_datetime_utc": {
"type": "null"
}
},
"type": "object"
},
"est_income": {
"properties": {
"est_level": {
"items": {
"type": "string"
},
"type": "array"
},
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"ethnicity": {
"type": "string"
},
"first_name": {
"type": "string"
},
"formatted_birthdate": {
"type": "string"
},
"gender": {
"type": "string"
},
"head_of_household": {
"properties": {
"bool": {
"type": "null"
},
"insert_datetime_utc": {
"type": "null"
},
"suppressed_datetime_utc": {
"type": "null"
}
},
"type": "object"
},
"home_church": {
"properties": {
"church_id": {
"type": "null"
},
"group_participant": {
"type": "null"
},
"insert_datetime_utc": {
"type": "null"
},
"is_coaching": {
"type": "null"
},
"is_giving": {
"type": "null"
},
"is_serving": {
"type": "null"
},
"membership_date": {
"type": "null"
},
"regular_attendee": {
"type": "null"
},
"suppressed_datetime_utc": {
"type": "null"
}
},
"type": "object"
},
"hub_poid": {
"type": "integer"
},
"insert_datetime_utc": {
"type": "string"
},
"ip_address": {
"properties": {
"insert_datetime_utc": {
"type": "null"
},
"string": {
"type": "null"
},
"suppressed_datetime_utc": {
"type": "null"
}
},
"type": "object"
},
"last_name": {
"type": "string"
},
"marriage_segment": {
"properties": {
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"string": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"married": {
"properties": {
"bool": {
"items": {
"type": "boolean"
},
"type": "array"
},
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"middle_name": {
"type": "string"
},
"miscellaneous": {
"properties": {
"attribute": {
"items": {
"type": "string"
},
"type": "array"
},
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"value": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"name_suffix": {
"type": "null"
},
"name_title": {
"type": "null"
},
"newlywed": {
"properties": {
"bool": {
"type": "null"
},
"insert_datetime_utc": {
"type": "null"
},
"suppressed_datetime_utc": {
"type": "null"
}
},
"type": "object"
},
"parent": {
"properties": {
"bool": {
"items": {
"type": "boolean"
},
"type": "array"
},
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"likelihood_expecting": {
"items": {
"type": "integer"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"person_id": {
"type": "integer"
},
"phone": {
"properties": {
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"number": {
"items": {
"type": "integer"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"type": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"property_rights": {
"properties": {
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"string": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"psychographic_cluster": {
"properties": {
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"string": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"religion": {
"properties": {
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"string": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"religious_segment": {
"properties": {
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"string": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"separated": {
"properties": {
"bool": {
"type": "null"
},
"insert_datetime_utc": {
"type": "null"
},
"suppressed_datetime_utc": {
"type": "null"
}
},
"type": "object"
},
"significant_other": {
"properties": {
"first_name": {
"type": "null"
},
"insert_datetime_utc": {
"type": "null"
},
"last_name": {
"type": "null"
},
"middle_name": {
"type": "null"
},
"name_suffix": {
"type": "null"
},
"name_title": {
"type": "null"
},
"suppressed_datetime_utc": {
"type": "null"
}
},
"type": "object"
},
"suppressed_datetime_utc": {
"type": "string"
},
"target_group": {
"properties": {
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"string": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
}
},
"type": "object"
},
"type": "array"
}
Answering the question whether Python or Node will be better for the task would be an opinion and we are not allowed to voice our opinions on Stack Overflow. You have to decide yourself what you have more experience in and what you want to work with - Python or Node.
If you go with Node, there are some modules that can help you with that task, that do streaming JSON parsing. E.g. those modules:
If you go with Python, there are streaming JSON parsers here as well:
Use this code in linux command prompt
Refer to this link: https://askubuntu.com/questions/28847/text-editor-to-edit-large-4-3-gb-plain-text-file
consider to use jq to preprocessing your json files
it could split and stream your large json files
see the official documentation and this questions for more.
extra: for your first questions jq is written by C, it's faster than python/node isn't it ?