I have the following code from a machine learning book in python:
copy_set.plot(kind = "scatter" , x = "longitude" ,
y = "latitude" , alpha = 0.4 ,
s = copy_set[ "population" ],
label = "population" , figsize=(10,7),
c = "median_house_value" , cmap = plt.get_cmap ( "jet" ) )
median_house_value
and population
are two columns in the copy_set
dataframe. I don't understand why for argument s
I have to use copy_set['population']
but for argument c
it is possible to only use the column name median_house_value
. When I try to only use the column name for parameter s
, I get an an error message:
TypeError: ufunc 'sqrt' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Very good question. df.plot
is a wrapper around several of matplotlib's plotting functions. For kind="scatter"
matplotlib's scatter
function will be called. Most of the arguments to df.plot()
are first converted to the data within the Series
you get from the dataframe's column of the respective name.
E.g.
df.plot(x="lon", y="lat")
will be converted to
ax.scatter(x=df["lon"].values, y=df["lat"].values)
Remaining arguments are passed through to scatter,
hence
df.plot(x="lon", y="lat", some_argument_pandas_doesnt_know=True)
will result in
ax.scatter(x=df["lon"].values, y=df["lat"].values, some_argument_pandas_doesnt_know=True)
So while pandas converts th arguments x
, y
, c
, it doesn't do so for s
. s
is hence simply passed on to ax.scatter
, but that matplotlib function doesn't know what some string like "population"
would mean.
For arguments that are passed on to the matplotlib function one would need to stick to matplotlib's signature and in the case of s
supply the data directly.
Note however, that matplotlib's scatter itself also allows to use strings for its arguments. This however requires to tell it from which dataset they shall be taken. This is done via the data
argument. Hence the following works fine and would be the matplotlib equivalent to the pandas call in the question:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np; np.random.seed(42)
df = pd.DataFrame(np.random.rand(20,2), columns=["lon", "lat"])
df["pop"] = np.random.randint(5,300,size=20)
df["med"] = np.random.rand(20)*1e5
fig, ax = plt.subplots(figsize=(10,7))
sc = ax.scatter(x = "lon", y = "lat", alpha = 0.4,
s = "pop", label = "population" ,
c = "med" , cmap = "jet", data=df)
fig.colorbar(sc, label="med")
ax.set(xlabel="longitude", ylabel="latitude")
plt.show()
Finally you may now ask whether supplying the data to matplotlib via the data
argument would not equally be possible via passing through the pandas wrapper. Unfortunately not, because pandas uses data
as argument internally such that it'll not be passed through.
Therefore your two options are:
- Use pandas as in the question and supply the data itself via the
s
argument instead of the column name.
- Use matplotlib as shown here and use column names for all arguments. (Or use the data itself, which you see most often when looking at matplotlib code.)