I'm being asked to generate a new variable based on the data from an old one. Basically, what is being asked is that I take values at random (by using the random
function) from the original one and have at least 10x as many observations as the old one, and then save this as a new variable.
This is my dataset: https://archive.ics.uci.edu/ml/machine-learning-databases/forest-fires/forestfires.csv
The variable I wanna work with, is area
This is my attempt but it is giving me a module object is not callable
error:
import pandas as pd
import random as rand
dataFrame = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/forest-fires/forestfires.csv")
area = dataFrame['area']
random_area = rand(area)
print(random_area)
You can use the sample
function with replace=True
:
df = df.sample(n=len(df) * 10, replace=True)
Or, to sample only the area column, use
area = df.area.sample(n=len(df) * 10, replace=True)
Another option would involve np.random.choice
, and would look something like:
df = df.iloc[np.random.choice(len(df), len(df) * 10)]
The idea is to generate random indices from 0-len(df)-1
. The first argument specifies the upper bound and the second (len(df) * 10
) specifies the number of indices to generate. We then use the generated indices to index into df
.
If you just want to get the area
, this is sufficient.
area = df.iloc[np.random.choice(len(df), len(df) * 10), df.columns.get_loc('area')]
Index.get_loc
converts the "area" label to position, for iloc
.
df = pd.DataFrame({'A': list('aab'), 'B': list('123')})
df
A B
0 a 1
1 a 2
2 b 3
# Sample 3 times the original size
df.sample(n=len(df) * 3, replace=True)
A B
2 b 3
1 a 2
1 a 2
2 b 3
1 a 2
0 a 1
0 a 1
2 b 3
2 b 3
df.iloc[np.random.choice(len(df), len(df) * 3)]
A B
0 a 1
1 a 2
1 a 2
0 a 1
2 b 3
0 a 1
0 a 1
0 a 1
2 b 3