First I made two tables(RDD) to use following commands
rdd1=sc.textFile('checkouts').map(lambda line:line.split(',')).map(lambda fields:((fields[0],fields[3],fields[5]), 1) )
rdd2=sc.textFile('inventory2').map(lambda line:line.split(',')).map(lambda fields:((fields[0],fields[8],fields[10]), 1) )
The keys in first RDD are BibNum, ItemCollection and CheckoutDateTime. And when I checked the values for first RDD to use rdd1.take(2) it shows
[((u'BibNum', u'ItemCollection', u'CheckoutDateTime'), 1), ((u'1842225', u'namys', u'05/23/2005 03:20:00 PM'), 1)]
and similarly the keys in second RDD are BibNum, ItemCollection and Itemlocation. And the values are as following:
[((u'BibNum', u'ItemCollection', u'ItemLocation'), 1), ((u'3011076', u'ncrdr', u'qna'), 1)]
Once I created two RDDs, I tried to join those two to use rdd3=rdd1.join(rdd2)
After it, when I checked the value of rdd3 to use rdd3.take(2). Following error happened.
IndexError: list index out of range
I do not know why it happend. Please enlighten me if you know the reason. If you have any doubts for my question or code, just let me know I will try to clarify it. Thanks
edit--- I put up my sample input data for each RDD
BibNum,ItemBarcode,ItemType,ItemCollection,CallNumber,CheckoutDateTime,,,,,,,
1842225,10035249209,acbk,namys,MYSTERY ELKINS1999,05/23/2005 03:20:00 PM,,,,,,,
1928264,10037335444,jcbk,ncpic,E TABACK,12/14/2005 05:56:00 PM,,,,,,,
1982511,10039952527,jcvhs,ncvidnf,VHS J796.2 KNOW_YO 2000,08/11/2005 01:52:00 PM,,,,,,,
2026467,10040985615,accd,nacd,CD 782.421642 Y71T,10/19/2005 07:47:00 PM,,,,,,,
2174698,10047696215,jcbk,ncpic,E KROSOCZ,12/29/2005 03:42:00 PM,,,,,,,
1602768,10028318730,jcbk,ncpic,E BLACK,10/08/2005 02:15:00 PM,,,,,,,
2285195,10053424767,accd,cacd,CD 782.42166 F19R,09/30/2005 10:16:00 AM,,,,Input,BinNumber,Date,BinNumber+Month
2245955,10048392665,jcbk,ncnf,J949.73 Or77S 2004,12/05/2005 05:03:00 PM,,,,,,,
770918,10044828100,jcbk,ncpic,E HILL,07/22/2005 03:17:00 PM,,,,,,,
.
BibNum,Title,Author,ISBN,PublicationYear,Publisher,Subjects,ItemType,ItemCollection,FloatingItem,ItemLocation,ReportDate,ItemCount,,,,,,,,,,,,,
3011076,A tale of two friends / adapted by Ellie O'Ryan ; illustrated by Tom Caulfield| Frederick Gardner| Megan Petasky| and Allen Tam.,O'Ryan| Ellie,1481425730| 1481425749| 9781481425735| 9781481425742,2014,Simon Spotlight|,Musicians Fiction| Bullfighters Fiction| Best friends Fiction| Friendship Fiction| Adventure and adventurers Fiction,jcbk,ncrdr,Floating,qna,09/01/2017,1,,,,,,,,,,,,,
2248846,Naruto. Vol. 1| Uzumaki Naruto / story and art by Masashi Kishimoto ; [English adaptation by Jo Duffy].,Kishimoto| Masashi| 1974-,1569319006,2003| c1999.,Viz|,Ninja Japan Comic books strips etc| Comic books strips etc Japan Translations into English| Graphic novels,acbk,nycomic,NA,lcy,09/01/2017,1,,,,,,,,,,,,,
3209270,Peace| love & Wi-Fi : a ZITS treasury / by Jerry Scott and Jim Borgman.,Scott| Jerry| 1955-,144945867X| 9781449458676,2014,Andrews McMeel Publishing|,Duncan Jeremy Fictitious character Comic books strips etc| Teenagers United States Comic books strips etc| Parent and teenager Comic books strips etc| Families Comic books strips etc| Comic books strips etc| Comics Graphic works| Humorous comics,acbk,nycomic,NA,bea,09/01/2017,1,,,,,,,,,,,,,
1907265,The Paris pilgrims : a novel / Clancy Carlile.,Carlile| Clancy| 1930-,786706155,c1999.,Carroll & Graf|,Hemingway Ernest 1899 1961 Fiction| Biographical fiction| Historical fiction,acbk,cafic,NA,cen,09/01/2017,1,,,,,,,,,,,,,
1644616,Erotic by nature : a celebration of life| of love| and of our wonderful bodies / edited by David Steinberg.,,094020813X,1991| c1988.,Red Alder Books/Down There Press|,Erotic literature American| American literature 20th century,acbk,canf,NA,cen,09/01/2017,1,,,,,,,,,,,,,
edit----------------------------------------------
date_count --> DataFrame[BibNum : string, ItemCollection : string, CheckoutDateTime : string, count : BigInt ]
.. shows likes this
but when I checked the value of it using date_count.take(2) , it shows error like this : Input doesn't have expected number of values required by the schema. 6 fields are required while 7 values are provided.
df_final schema looks like this: DataFrame[BibNum:string, ItemType:string, ItemCollection:string, ItemBarcode:string, CallNumber:string, CheckoutDateTime:string, Title:string, Author:string, ISBN:string, PublicationYear:string, Publisher:string, Subjects:string, FloatingItem:string, ItemLocation:string, ReportDate:string, ItemLocation:string, : string , : string, : string .... : string , : string
]