PySpark: Map a SchemaRDD into a SchemaRDD

2019-02-15 20:17发布

问题:

I am loading a file of JSON objects as a PySpark SchemaRDD. I want to change the "shape" of the objects (basically, I'm flattening them) and then insert into a Hive table.

The problem I have is that the following returns a PipelinedRDD not a SchemaRDD:

log_json.map(flatten_function)

(Where log_json is a SchemaRDD).

Is there either a way to preserve type, cast back to the desired type, or efficiently insert from the new type?

回答1:

More an idea than a real solution. Let's assume your data looks like this:

data = [
    {"foobar":
        {"foo": 1, "bar": 2, "fozbaz": {
            "foz": 0, "baz": {"b": -1, "a": -1, "z": -1}
        }}}]

import json 
with open("foobar.json", "w") as fw:
    for record in data:
        fw.write(json.dumps(record))

First lets load it and check schema:

>>> srdd = sqlContext.jsonFile("foobar.json")
>>> srdd.printSchema()
root
 |-- foobar: struct (nullable = true)
 |    |-- bar: integer (nullable = true)
 |    |-- foo: integer (nullable = true)
 |    |-- fozbaz: struct (nullable = true)
 |    |    |-- baz: struct (nullable = true)
 |    |    |    |-- a: integer (nullable = true)
 |    |    |    |-- b: integer (nullable = true)
 |    |    |    |-- z: integer (nullable = true)
 |    |    |-- foz: integer (nullable = true)

Now we register table as suggested by Justin Pihony and extract schema:

srdd.registerTempTable("srdd")
schema = srdd.schema().jsonValue()

Instead of flattening data we can flatten schema using something similar to this:

def flatten_schema(schema):
    """Take schema as returned from schema().jsonValue()
    and return list of field names with full path"""
    def _flatten(schema, path="", accum=None):
        # Extract name of the current element
        name = schema.get("name")
        # If there is a name extend path
        if name is not None:
            path = "{0}.{1}".format(path, name) if path else name
        # It is some kind of struct
        if isinstance(schema.get("fields"), list):
            for field in schema.get("fields"):
                _flatten(field, path, accum)
        elif isinstance(schema.get("type"), dict):
            _flatten(schema.get("type"), path, accum)
        # It is an atomic type
        else:
            accum.append(path)
    accum = []
    _flatten(schema, "", accum)
    return  accum

add small helper to format query string:

def build_query(schema, df):
    select = ", ".join(
            "{0} AS {1}".format(field, field.replace(".", "_"))
            for field in flatten_schema(schema))
    return "SELECT {0} FROM {1}".format(select, df)

and finally results:

>>> sqlContext.sql(build_query(schema, "srdd")).printSchema()
root
 |-- foobar_bar: integer (nullable = true)
 |-- foobar_foo: integer (nullable = true)
 |-- foobar_fozbaz_baz_a: integer (nullable = true)
 |-- foobar_fozbaz_baz_b: integer (nullable = true)
 |-- foobar_fozbaz_baz_z: integer (nullable = true)
 |-- foobar_fozbaz_foz: integer (nullable = true)

Disclaimer: I didn't try to get very deep into schema structure so most likely there are some cases not covered by flatten_schema.



回答2:

It looks like select is not available in python, so you will have to registerTempTable and write it as a SQL statement, like

`SELECT flatten(*) FROM TABLE`

after setting up the function for use in SQL

sqlCtx.registerFunction("flatten", lambda x: flatten_function(x))

As @zero323 brought up, a function against * is probably not supported...so you can just create a function that takes in your data types and pass all of that in.



回答3:

The solution is applySchema:

mapped = log_json.map(flatten_function)
hive_context.applySchema(mapped, flat_schema).insertInto(name)

Where flat_schema is a StructType representing the schema in the same way as you would obtain from log_json.schema() (but flattened, obviously).



回答4:

you can try this one... a bit long but works

def flat_table(df,table_name):
def rec(l,in_array,name):
    for i,v in enumerate(l):
        if isinstance(v['type'],dict):
            if 'fields' in v['type'].keys():
                rec(name=name+[v['name']],l=v['type']['fields'],in_array=False)
            if 'elementType' in v['type'].keys():
                rec(name=name+[v['name']],l=v['type']['elementType']['fields'],in_array=True)
        else:#recursia stop rule
            #if this is an array so we need to explode every element in the array
            if in_array:
                field_list.append('{node}{subnode}.array'.format(node=".".join(name)+'.' if name else '', subnode=v['name']))
            else:
                field_list.append('{node}{subnode}'.format(node=".".join(name)+'.' if name else '', subnode=v['name']))

   # table_name='x'
   field_list=[]
   l=df.schema.jsonValue()['fields']
   df.registerTempTable(table_name)
   rec(l,in_array=False,name=[table_name])

   #create the select satement

   inner_fileds=[]
   outer_fields=[]
   flag=True

   for x in field_list:
      f=x.split('.')
      if f[-1]<>'array':
        inner_fileds.append('{field} as {name}'.format(field=".".join(f),name=f[-1]))
        of=['a']+f[-1:]

        outer_fields.append('{field} as {name}'.format(field=".".join(of),name=of[-1]))
    else:
        if flag:#add the array to the inner query for expotion only once for every array field
            inner_fileds.append('explode({field}) as {name}'.format(field=".".join(f[:-2]),name=f[-3]))
            flag=False

        of=['a']+f[-3:-1]
        outer_fields.append('{field} as {name}'.format(field=".".join(of),name=of[-1]))


   q="""select {outer_fields}
        from (select {inner_fileds}
        from {table_name})      a""".format(outer_fields=',\n'.join(outer_fields),inner_fileds=',\n'.join(inner_fileds),table_name=table_name)
   return q