Pyspark, error:input doesn't have expected num

2019-09-21 10:01发布

问题:

First I made two tables(RDD) to use following commands

rdd1=sc.textFile('checkouts').map(lambda line:line.split(',')).map(lambda fields:((fields[0],fields[3],fields[5]), 1) )
rdd2=sc.textFile('inventory2').map(lambda line:line.split(',')).map(lambda fields:((fields[0],fields[8],fields[10]), 1) )

The keys in first RDD are BibNum, ItemCollection and CheckoutDateTime. And when I checked the values for first RDD to use rdd1.take(2) it shows

[((u'BibNum', u'ItemCollection', u'CheckoutDateTime'), 1), ((u'1842225', u'namys', u'05/23/2005 03:20:00 PM'), 1)]

and similarly the keys in second RDD are BibNum, ItemCollection and Itemlocation. And the values are as following:

[((u'BibNum', u'ItemCollection', u'ItemLocation'), 1), ((u'3011076', u'ncrdr', u'qna'), 1)]

Once I created two RDDs, I tried to join those two to use rdd3=rdd1.join(rdd2) After it, when I checked the value of rdd3 to use rdd3.take(2). Following error happened.

IndexError: list index out of range

I do not know why it happend. Please enlighten me if you know the reason. If you have any doubts for my question or code, just let me know I will try to clarify it. Thanks

edit--- I put up my sample input data for each RDD

BibNum,ItemBarcode,ItemType,ItemCollection,CallNumber,CheckoutDateTime,,,,,,,
1842225,10035249209,acbk,namys,MYSTERY ELKINS1999,05/23/2005 03:20:00 PM,,,,,,,
1928264,10037335444,jcbk,ncpic,E TABACK,12/14/2005 05:56:00 PM,,,,,,,
1982511,10039952527,jcvhs,ncvidnf,VHS J796.2 KNOW_YO 2000,08/11/2005 01:52:00 PM,,,,,,,
2026467,10040985615,accd,nacd,CD 782.421642 Y71T,10/19/2005 07:47:00 PM,,,,,,,
2174698,10047696215,jcbk,ncpic,E KROSOCZ,12/29/2005 03:42:00 PM,,,,,,,
1602768,10028318730,jcbk,ncpic,E BLACK,10/08/2005 02:15:00 PM,,,,,,,
2285195,10053424767,accd,cacd,CD 782.42166 F19R,09/30/2005 10:16:00 AM,,,,Input,BinNumber,Date,BinNumber+Month
2245955,10048392665,jcbk,ncnf,J949.73 Or77S 2004,12/05/2005 05:03:00 PM,,,,,,,
770918,10044828100,jcbk,ncpic,E HILL,07/22/2005 03:17:00 PM,,,,,,,

.

BibNum,Title,Author,ISBN,PublicationYear,Publisher,Subjects,ItemType,ItemCollection,FloatingItem,ItemLocation,ReportDate,ItemCount,,,,,,,,,,,,,
3011076,A tale of two friends / adapted by Ellie O'Ryan ; illustrated by Tom Caulfield| Frederick Gardner| Megan Petasky| and Allen Tam.,O'Ryan| Ellie,1481425730| 1481425749| 9781481425735| 9781481425742,2014,Simon Spotlight|,Musicians Fiction| Bullfighters Fiction| Best friends Fiction| Friendship Fiction| Adventure and adventurers Fiction,jcbk,ncrdr,Floating,qna,09/01/2017,1,,,,,,,,,,,,,
2248846,Naruto. Vol. 1| Uzumaki Naruto / story and art by Masashi Kishimoto ; [English adaptation by Jo Duffy].,Kishimoto| Masashi| 1974-,1569319006,2003| c1999.,Viz|,Ninja Japan Comic books strips etc| Comic books strips etc Japan Translations into English| Graphic novels,acbk,nycomic,NA,lcy,09/01/2017,1,,,,,,,,,,,,,
3209270,Peace| love & Wi-Fi : a ZITS treasury / by Jerry Scott and Jim Borgman.,Scott| Jerry| 1955-,144945867X| 9781449458676,2014,Andrews McMeel Publishing|,Duncan Jeremy Fictitious character Comic books strips etc| Teenagers United States Comic books strips etc| Parent and teenager Comic books strips etc| Families Comic books strips etc| Comic books strips etc| Comics Graphic works| Humorous comics,acbk,nycomic,NA,bea,09/01/2017,1,,,,,,,,,,,,,
1907265,The Paris pilgrims : a novel / Clancy Carlile.,Carlile| Clancy| 1930-,786706155,c1999.,Carroll & Graf|,Hemingway Ernest 1899 1961 Fiction| Biographical fiction| Historical fiction,acbk,cafic,NA,cen,09/01/2017,1,,,,,,,,,,,,,
1644616,Erotic by nature : a celebration of life| of love| and of our wonderful bodies / edited by David Steinberg.,,094020813X,1991| c1988.,Red Alder Books/Down There Press|,Erotic literature American| American literature 20th century,acbk,canf,NA,cen,09/01/2017,1,,,,,,,,,,,,,

edit---------------------------------------------- date_count --> DataFrame[BibNum : string, ItemCollection : string, CheckoutDateTime : string, count : BigInt ] .. shows likes this but when I checked the value of it using date_count.take(2) , it shows error like this : Input doesn't have expected number of values required by the schema. 6 fields are required while 7 values are provided.

df_final schema looks like this: DataFrame[BibNum:string, ItemType:string, ItemCollection:string, ItemBarcode:string, CallNumber:string, CheckoutDateTime:string, Title:string, Author:string, ISBN:string, PublicationYear:string, Publisher:string, Subjects:string, FloatingItem:string, ItemLocation:string, ReportDate:string, ItemLocation:string, : string , : string, : string .... : string , : string ]

回答1:

So I'll try to answer your question. The solution might be syntactically haywire but I'll try to do my best (I don't have an environment to test right now). Let me know if this is what you are looking for otherwise I can help you with fine tuning the solution.

Here is the documentation for Join in Pyspark

So when you read the files :

rdd1=sc.textFile('checkouts').map(lambda line:line.split(','))
rdd2=sc.textFile('inventory2').map(lambda line:line.split(','))
# Define the headers for both the files
rdd1_header = rdd1.first()
rdd2_header = rdd2.first()

# Define the dataframe
rdd1_df = rdd1.filter(lambda line: line != rdd1_header).toDF(rdd1_header)
rdd2_df = rdd2.filter(lambda line: line != rdd2_header).toDF(rdd2_header)

common_cols = [x for x in rdd1_df.columns if x in rdd2_df.columns]

df_final = rdd1_df.join(rdd2_df, on=common_cols)
date_count = df_final.groupBy(["BibNum", "ItemCollection", "CheckoutDateTime"]).count()

EDITS :

1) Your error : "pyspark.sql.utils.AnalysisException: u"Reference 'ItemCollection' is ambiguous, could be ItemCollection#3, ItemCollection#21" is due to multiple columns being generated after the join. What you need to do is include all common columns in your join. I will mention it in the code.

2) Another issue : Some of weird parts are added into the last part of each RDD, such as -- [Row(BibNum=u'1842225', ItemBarcode=u'10035249209', ItemType=u'acbk', ItemCollection=u'namys', CallNumber=u'MYSTERY ELKINS1999', CheckoutDateTime=u'05/23/2005 03:20:00 PM', =u'', =u'', =u'', =u'', =u'', =u'', =u'')

For this, you had mentioned your CSV file as follows :

BibNum,ItemBarcode,ItemType,ItemCollection,CallNumber,CheckoutDateTime,,,,,,,
1842225,10035249209,acbk,namys,MYSTERY ELKINS1999,05/23/2005 03:20:00 PM,,,,,,,
1928264,10037335444,jcbk,ncpic,E TABACK,12/14/2005 05:56:00 PM,,,,,,,

Now if you can see there are a lot of trailing commas after the date column. i.e. ',,,,,,,', which is giving those extra empty columns (after the split on commas) which you can drop.