pandas.plot argument c vs s

2020-07-27 05:27发布

问题:

I have the following code from a machine learning book in python:

copy_set.plot(kind = "scatter" , x = "longitude" , 
              y = "latitude" , alpha = 0.4 , 
              s = copy_set[ "population" ], 
              label = "population" , figsize=(10,7), 
              c = "median_house_value" , cmap = plt.get_cmap ( "jet" ) ) 

median_house_value and population are two columns in the copy_set dataframe. I don't understand why for argument s I have to use copy_set['population'] but for argument c it is possible to only use the column name median_house_value. When I try to only use the column name for parameter s, I get an an error message:

TypeError: ufunc 'sqrt' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

回答1:

Very good question. df.plot is a wrapper around several of matplotlib's plotting functions. For kind="scatter" matplotlib's scatter function will be called. Most of the arguments to df.plot() are first converted to the data within the Series you get from the dataframe's column of the respective name.

E.g.

df.plot(x="lon", y="lat")

will be converted to

ax.scatter(x=df["lon"].values, y=df["lat"].values)

Remaining arguments are passed through to scatter, hence

df.plot(x="lon", y="lat", some_argument_pandas_doesnt_know=True)

will result in

ax.scatter(x=df["lon"].values, y=df["lat"].values, some_argument_pandas_doesnt_know=True)

So while pandas converts th arguments x, y, c, it doesn't do so for s. s is hence simply passed on to ax.scatter, but that matplotlib function doesn't know what some string like "population" would mean.
For arguments that are passed on to the matplotlib function one would need to stick to matplotlib's signature and in the case of s supply the data directly.

Note however, that matplotlib's scatter itself also allows to use strings for its arguments. This however requires to tell it from which dataset they shall be taken. This is done via the data argument. Hence the following works fine and would be the matplotlib equivalent to the pandas call in the question:

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np; np.random.seed(42)

df = pd.DataFrame(np.random.rand(20,2), columns=["lon", "lat"])
df["pop"] = np.random.randint(5,300,size=20)
df["med"] = np.random.rand(20)*1e5

fig, ax = plt.subplots(figsize=(10,7))
sc = ax.scatter(x = "lon", y = "lat", alpha = 0.4, 
                s = "pop", label = "population" , 
                c = "med" , cmap = "jet", data=df)
fig.colorbar(sc, label="med")
ax.set(xlabel="longitude", ylabel="latitude")

plt.show()

Finally you may now ask whether supplying the data to matplotlib via the data argument would not equally be possible via passing through the pandas wrapper. Unfortunately not, because pandas uses data as argument internally such that it'll not be passed through. Therefore your two options are:

  1. Use pandas as in the question and supply the data itself via the s argument instead of the column name.
  2. Use matplotlib as shown here and use column names for all arguments. (Or use the data itself, which you see most often when looking at matplotlib code.)