I have a DataFrame (df
) which consists of more than 50 columns and different types of data types, such as
df3.printSchema()
CtpJobId: string (nullable = true)
|-- TransformJobStateId: string (nullable = true)
|-- LastError: string (nullable = true)
|-- PriorityDate: string (nullable = true)
|-- QueuedTime: string (nullable = true)
|-- AccurateAsOf: string (nullable = true)
|-- SentToDevice: string (nullable = true)
|-- StartedAtDevice: string (nullable = true)
|-- ProcessStart: string (nullable = true)
|-- LastProgressAt: string (nullable = true)
|-- ProcessEnd: string (nullable = true)
|-- ClipFirstFrameNumber: string (nullable = true)
|-- ClipLastFrameNumber: double (nullable = true)
|-- SourceNamedLocation: string (nullable = true)
|-- TargetId: string (nullable = true)
|-- TargetNamedLocation: string (nullable = true)
|-- TargetDirectory: string (nullable = true)
|-- TargetFilename: string (nullable = true)
|-- Description: string (nullable = true)
|-- AssignedDeviceId: string (nullable = true)
|-- DeviceResourceId: string (nullable = true)
|-- DeviceName: string (nullable = true)
|-- srcDropFrame: string (nullable = true)
|-- srcDuration: double (nullable = true)
|-- srcFrameRate: double (nullable = true)
|-- srcHeight: double (nullable = true)
|-- srcMediaFormat: string (nullable = true)
|-- srcWidth: double (nullable = true)
Now I wants all one type columns can be changed in one go such as
timestamp_type = [
'PriorityDate', 'QueuedTime', 'AccurateAsOf', 'SentToDevice',
'StartedAtDevice', 'ProcessStart', 'LastProgressAt', 'ProcessEnd'
]
integer_type = [
'ClipFirstFrameNumber', 'ClipLastFrameNumber', 'TargetId', 'srcHeight',
'srcMediaFormat', 'srcWidth'
]
I know how to do one by one as i'm doing now.
df3 = df3.withColumn("PriorityDate", df3["PriorityDate"].cast(TimestampType()))
df3 = df3.withColumn("QueuedTime", df3["QueuedTime"].cast(TimestampType()))
df3 = df3.withColumn("AccurateAsOf", df3["AccurateAsOf"].cast(TimestampType())
df3= df3.withColumn("srcMediaFormat", df3["srcMediaFormat"].cast(IntegerType()))
df3= df3.withColumn("DeviceResourceId", df3["DeviceResourceId"].cast(IntegerType()))
df3= df3.withColumn("AssignedDeviceId", df3["AssignedDeviceId"].cast(IntegerType()))
But this looks ugly and easily I can missed any column which I want to change. Is there any way I can write any function that will take care same type of list of columns to change.So I can easily implement convert_data_type and pass those columns names. Thanks in advance
Instead of enumerating all of your values, you should use a loop:
Or equivalently, you can use
functools.reduce
: