How to convert json files stored in s3 to csv usin

2019-08-18 08:18发布

问题:

I have some json files stored in s3, and I need to convert them, at the folder folder they are, to csv format.

Currently I'm using glue to map them to athena, but, as I said, now I need to map them to csv.

Is it possible to use a Glue JOB to do that?

I trying to understand if a glue job can crawl into my s3 folder directories, converting all json files it finds to csv (as new files).

If not possible, is there any aws service that could help me do that?

EDIT1:

Here's the current code i'm trying to run

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext()
glueContext = GlueContext(sc)

inputGDF = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": ["s3://agco-sa-dfs-dv/dealer-data"]}, format = "json")
outputGDF = glueContext.write_dynamic_frame.from_options(frame = inputGDF, connection_type = "s3", connection_options = {"path": "s3://agco-sa-dfs-dv/dealer-data"}, format = "csv")

The job runs with no error, but nothing seems to happen on s3 folder. I'm supposing the code will get the json files from /dealer-data and convert to the same folder, as csv. I'm probably wrong.

EDIT2:

Ok, I almost made it work the way i needed.

The thing is, the create dynamic frame is only working for folders with files, not folders with subfolders with files.

import sys
import logging
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

sc = SparkContext()
glueContext = GlueContext(sc)

inputGDF = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": ["s3://agco-sa-dfs-dv/dealer-data/installations/3555/2019/2"]}, format = "json")

outputGDF = glueContext.write_dynamic_frame.from_options(frame = inputGDF, connection_type = "s3", connection_options = {"path": "s3://agco-sa-dfs-dv/dealer-data/installations/3555/2019/2/bla.csv"}, format = "csv")

The above works, but only for that directory (../2) Is there a way to read all files given a folder and subfolders?

回答1:

You should set the recurse option to True for S3 connection:

inputGDF = glueContext.create_dynamic_frame_from_options(
    connection_type = "s3", 
    connection_options = {
        "paths": ["s3://agco-sa-dfs-dv/dealer-data/installations/3555/2019/2"],
        "recurse" : True
    }, 
    format = "json
)