How can I create a multiindex data frame with the

2019-08-17 08:14发布

问题:

I have to create a multi index data frame condisering data contained in two different data frames. For each index of the second data frame (Date), for each row of the first data frame, if the value in the column Date of the first data frame is equal to the index of the second data frame then create me a multi index dataframe with each date, the number of tweets published each day and the features of each row.

This is the first data frame with Datas from Twitter:

        Date            Full text   Retweets    Likes
333     2018-04-13  RT @Tesla...    2838             0
332     2018-04-13  @timkhiggins... 7722             40733
331     2018-04-13  @TheEconomist.. 1911             18634

This is the second data frame with Datas from Tesla stock market:

                Open        High     Low         Close  Volume       Gap
Date                        
2018-04-13  283.000000  296.859985   279.519989  294.089996 8569400  11.089996
2018-04-14  303.320000  304.940002   291.619995  291.970001 7286800  -11.349999
2018-04-25  287.760010  288.000000   273.420013  275.010010 8945800  -12.750000

This is what I have tried to do:

for i in TeslaData.index:
    for row in sortedTweetsData.iterrows():
        if row[1]==i:
            NumTweetsByDay+=1
            for num in NumTweetsByDay:
                idx=pd.MultiIndex.from_product([[i],[NumTweetsBy]])
                colum=col
                df= pd.DataFrame(row,idx,column)

The output that I am looking for is the following one:

Date        Number of Tweets    Full text       Retweets    Likes

2018-04-13        1              RT @Tesla...    2838        0
                  2              @timkhiggins... 7722        40733
                  3              @TheEconomist.. 1911        18634

回答1:

If I understand correctly, you want to filter twitter data by date if there is an entry in the stock dataset for the same date.

You can do this with isin():

# convert datatypes first:
sortedTweetsData['Date'] = pd.to_datetime(sortedTweetsData['Date'])
TeslaData.index = pd.to_datetime(TeslaData.index)

# do filtering
df = sortedTweetsData[sortedTweetsData['Date'].isin(TeslaData.index.values)]

next, you can determine how many tweets each group has:

groupsizes = df.groupby(by='Date').size()

and use that to build a list of tuples, to define your multiindex (there is likely a more elegant way to do this):

tups = [(ix, gs + 1) for ix in groupsizes.index.values for gs in range(groupsizes[ix])]

finally:

df.index = pd.MultiIndex.from_tuples(tups, names=['Date', 'Number of Tweets'])