I am trying to read from a csv from in GCP Storage, converting that into dictionaries and then write to a Bigquery table as follows:
p | ReadFromText("gs://bucket/file.csv")
| (beam.ParDo(BuildAdsRecordFn()))
| WriteToBigQuery('ads_table',dataset='dds',project='doubleclick-2',schema=ads_schema)
where: 'doubleclick-2' and 'dds' are existing project and dataset, ads_schema is defined as follows:
ads_schema='Advertiser_ID:INTEGER,Campaign_ID:INTEGER,Ad_ID:INTEGER,Ad_Name:STRING,Click_through_URL:STRING,Ad_Type:STRING'
BuildAdsRecordFn() is defined as follows:
class AdsRecord:
dict = {}
def __init__(self, line):
record = line.split(",")
self.dict['Advertiser_ID'] = record[0]
self.dict['Campaign_ID'] = record[1]
self.dict['Ad_ID'] = record[2]
self.dict['Ad_Name'] = record[3]
self.dict['Click_through_URL'] = record[4]
self.dict['Ad_Type'] = record[5]
class BuildAdsRecordFn(beam.DoFn):
def __init__(self):
super(BuildAdsRecordFn, self).__init__()
def process(self, element):
text_line = element.strip()
ads_record = AdsRecord(text_line).dict
return ads_record
However, when I run the pipeline, I got the following error:
"dataflow_job_18146703755411620105-B" failed., (6c011965a92e74fa): BigQuery job "dataflow_job_18146703755411620105-B" in project "doubleclick-2" finished with error(s): errorResult: JSON table encountered too many errors, giving up. Rows: 1; errors: 1., error: JSON table encountered too many errors, giving up. Rows: 1; errors: 1., error: JSON parsing error in row starting at position 0: Value encountered without start of object
Here is the sample testing data I used:
100001,1000011,10000111,ut,https://bloomberg.com/aliquam/lacus/morbi.xml,Brand-neutral
100001,1000011,10000112,eu,http://weebly.com/sed/vel/enim/sit.jsp,Dynamic Click
I'm new to both Dataflow and python so could not figure out what could be wrong in the above code. Greatly appreciate any help!
I just implemented your code and it didn't work as well, but I got a different message error (something like "you can't return a
dict
as the result of aParDo
").This code worked normally for me, notice not only I'm not using the class attribute
dict
as well as now a list is returned:This is the data I simulated:
I also filtered out the header line that I created in my file (in the
Filter
operation), if you don't have a header then this is not necessary