Plotting statistics (seaborn)¶

%pylab inline

Populating the interactive namespace from numpy and matplotlib

import pandas as pd

Examination data¶

We again work on our physics examination dataset. We now focus on calculating some statistics and quick plots. Let's read the data again:

df=pd.read_csv("data/examination.csv.gz",sep=";")

Basic statistics¶

We can get a quick overview of the data by using the describe function.

df.describe()

df.describe().T

The corr function gives us the Pearson/Speaman/Kendall correlation table.

df.corr(method="pearson")

We can easily plot it.

c=df.corr(method="pearson")

imshow(c,interpolation="none",cmap='BrBG')
xticks(range(len(c.index)),c.index,rotation=90)
yticks(range(len(c.index)),c.index)
colorbar();

Plotting¶

A very great strength of pandas is that we can make plots from DataFrames with a very quick syntax. pandas uses the matplotlib library for these purposes, therefore, we have to import it.

We can set the parameters of the plot (title, ylabel etc.) similarly to the methods we've already seen in matplotlib.

First, we increase axis labels:

rcParams["font.size"]=12

Dsitributions

points=df[['points_written',
         'complex_exercises']]
points.head()

points.hist(figsize=(8,4));

points.plot.hist(figsize=(9,6),alpha=0.5);

points.plot.box(figsize=(9,6));

Covariances

points.plot.scatter(x='points_written',y= 'complex_exercises');

points.plot.hexbin(x='points_written',y= 'complex_exercises',gridsize=20);

Histograms and covariances together.

pd.scatter_matrix(points,figsize=(9,9));

/home/bokanyie/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: pandas.scatter_matrix is deprecated. Use pandas.plotting.scatter_matrix instead
  """Entry point for launching an IPython kernel.

Further figures¶

The number of normal and advances level students in time:

toplot=df.groupby(['level_abbrev','year'])[['level']].count()
toplot

Let's use the unstack function to plot the groups easily:

toplot=toplot.unstack(level=0)
toplot

toplot.plot(kind="area", figsize=(9, 6))
ylabel("Number of students")
ylim(0,7000)
xticks(range(2011,2016),map(str,range(2011,2016)));

We could have a look at some pie charts to count which school type sent the students to normal or advances level exams between 2011 and 2015. We make two subplots for that. What did the autopct keyword do? What can - be?

t=df[(df["attendance"]=="present")].groupby(["level_abbrev","school_type"])
t=t.size().unstack(level=1)
t.plot.pie(subplots=True,autopct="%.1f",figsize=(12,4));

Seaborn package¶

import seaborn as sns

This is a package that makes it easy to create nicely looking statistics plots.

Histogram and density function estimation:

sns.distplot(df['points_written'].dropna());

Only kernel density estimate:

sns.kdeplot(df['points_written']);

/home/bokanyie/anaconda3/lib/python3.6/site-packages/statsmodels/nonparametric/kde.py:454: RuntimeWarning: invalid value encountered in greater
  X = X[np.logical_and(X>clip[0], X<clip[1])] # won't work for two columns.
/home/bokanyie/anaconda3/lib/python3.6/site-packages/statsmodels/nonparametric/kde.py:454: RuntimeWarning: invalid value encountered in less
  X = X[np.logical_and(X>clip[0], X<clip[1])] # won't work for two columns.

Estimated cumulative distribution function.

sns.kdeplot(df['points_written'].dropna(),cumulative=True);

More distributions in one plot:

points=df.loc[df['level_abbrev']=='N',['points_written',
           'points_oral',
           'school_type']]
points.head()

Drawing all the points:

sns.stripplot(x='school_type', y='points_written',
              data=points.dropna().iloc[:1000],jitter=True);

Boxplots:

sns.boxplot(x='school_type', y='points_written',
              data=points);

Density functions next to each other:

sns.violinplot(x='school_type', y='points_written',
              data=points);

Multidimensional distributions:

x='points_written'
y='points_oral'

Estimated density:

sns.jointplot(x=x, y=y, data=points, kind="kde");

Histograms:

sns.jointplot(x=x, y=y, data=points,kind="hex", color="k");

Pairplot from the previous notebooks.

sns.pairplot(points.dropna());

Linear models:

sns.lmplot(x=y, y=x, data=points.dropna().iloc[:30,:]);

With histograms:

sns.jointplot(x=y, y=x, data=points.dropna().iloc[:30,:],kind='reg');

	yeargroup	grade	percentage	points	questions	essay_content	essay_style	complex_exercises	points_written	measurement_content	measurement_style	points_oral	year	topic_content	topic_style	complex_exercises2
count	19811.000000	21444.000000	21444.000000	21444.000000	20635.000000	5047.000000	5047.000000	5047.000000	20635.000000	5058.000000	5058.000000	20715.000000	21890.000000	15657.000000	15657.000000	15588.000000
mean	12.737015	4.022617	66.553861	100.219269	22.418173	12.948286	4.228849	28.152170	55.282966	35.715896	4.014433	47.302872	2013.364824	45.667369	4.162866	28.829099
std	34.817782	1.074826	19.247675	28.865995	7.112939	3.860824	1.023183	12.347646	18.886694	9.203527	1.096126	12.863038	1.157214	11.646716	1.100549	12.855951
min	1.000000	1.000000	0.000000	0.000000	2.000000	0.000000	0.000000	0.000000	8.000000	0.000000	0.000000	0.000000	2011.000000	0.000000	0.000000	0.000000
25%	12.000000	3.000000	54.000000	82.000000	18.000000	11.000000	4.000000	19.000000	41.000000	31.000000	3.000000	40.000000	2012.000000	40.000000	3.000000	19.000000
50%	12.000000	4.000000	70.000000	105.000000	22.000000	14.000000	5.000000	29.000000	56.000000	38.000000	4.000000	50.000000	2013.000000	50.000000	5.000000	29.000000
75%	12.000000	5.000000	82.000000	123.000000	28.000000	16.000000	5.000000	39.000000	70.000000	43.000000	5.000000	58.000000	2014.000000	55.000000	5.000000	40.000000
max	2014.000000	5.000000	100.000000	150.000000	40.000000	18.000000	5.000000	47.000000	100.000000	90.000000	10.000000	120.000000	2015.000000	110.000000	10.000000	50.000000

	count	mean	std	min	25%	50%	75%	max
yeargroup	19811.0	12.737015	34.817782	1.0	12.0	12.0	12.0	2014.0
grade	21444.0	4.022617	1.074826	1.0	3.0	4.0	5.0	5.0
percentage	21444.0	66.553861	19.247675	0.0	54.0	70.0	82.0	100.0
points	21444.0	100.219269	28.865995	0.0	82.0	105.0	123.0	150.0
questions	20635.0	22.418173	7.112939	2.0	18.0	22.0	28.0	40.0
essay_content	5047.0	12.948286	3.860824	0.0	11.0	14.0	16.0	18.0
essay_style	5047.0	4.228849	1.023183	0.0	4.0	5.0	5.0	5.0
complex_exercises	5047.0	28.152170	12.347646	0.0	19.0	29.0	39.0	47.0
points_written	20635.0	55.282966	18.886694	8.0	41.0	56.0	70.0	100.0
measurement_content	5058.0	35.715896	9.203527	0.0	31.0	38.0	43.0	90.0
measurement_style	5058.0	4.014433	1.096126	0.0	3.0	4.0	5.0	10.0
points_oral	20715.0	47.302872	12.863038	0.0	40.0	50.0	58.0	120.0
year	21890.0	2013.364824	1.157214	2011.0	2012.0	2013.0	2014.0	2015.0
topic_content	15657.0	45.667369	11.646716	0.0	40.0	50.0	55.0	110.0
topic_style	15657.0	4.162866	1.100549	0.0	3.0	5.0	5.0	10.0
complex_exercises2	15588.0	28.829099	12.855951	0.0	19.0	29.0	40.0	50.0

	yeargroup	grade	percentage	points	questions	essay_content	essay_style	complex_exercises	points_written	measurement_content	measurement_style	points_oral	year	topic_content	topic_style	complex_exercises2
yeargroup	1.000000	0.013730	-0.000021	0.000101	-0.006200	-0.012108	-0.016003	-0.008011	0.006674	-0.008256	0.014697	-0.012063	0.012776	0.006201	-0.000067	0.050594
grade	0.013730	1.000000	0.907772	0.906896	0.507316	0.560202	0.445160	0.707938	0.846829	0.775403	0.645829	0.632853	0.049254	0.787951	0.666969	0.842976
percentage	-0.000021	0.907772	1.000000	0.998582	0.684114	0.663993	0.500275	0.891349	0.903683	0.805784	0.686136	0.759383	0.019835	0.810324	0.687220	0.895515
points	0.000101	0.906896	0.998582	1.000000	0.684060	0.664042	0.500438	0.891243	0.904050	0.796333	0.676032	0.748877	0.019255	0.799206	0.676825	0.895725
questions	-0.006200	0.507316	0.684114	0.684060	1.000000	0.356895	0.256725	0.534321	0.652535	0.415918	0.335439	0.474384	-0.107093	0.400883	0.350101	0.639866
essay_content	-0.012108	0.560202	0.663993	0.664042	0.356895	1.000000	0.646490	0.518907	0.686980	0.433555	0.392058	0.436247	0.131977	NaN	NaN	NaN
essay_style	-0.016003	0.445160	0.500275	0.500438	0.256725	0.646490	1.000000	0.370415	0.506517	0.344682	0.332567	0.349502	0.076108	NaN	NaN	NaN
complex_exercises	-0.008011	0.707938	0.891349	0.891243	0.534321	0.518907	0.370415	1.000000	0.941172	0.550949	0.459826	0.550095	0.081351	NaN	NaN	NaN
points_written	0.006674	0.846829	0.903683	0.904050	0.652535	0.686980	0.506517	0.941172	1.000000	0.591616	0.499105	0.420815	0.025150	0.544409	0.468870	0.954703
measurement_content	-0.008256	0.775403	0.805784	0.796333	0.415918	0.433555	0.344682	0.550949	0.591616	1.000000	0.832555	0.994463	0.009425	NaN	NaN	NaN
measurement_style	0.014697	0.645829	0.686136	0.676032	0.335439	0.392058	0.332567	0.459826	0.499105	0.832555	1.000000	0.861251	0.007014	NaN	NaN	NaN
points_oral	-0.012063	0.632853	0.759383	0.748877	0.474384	0.436247	0.349502	0.550095	0.420815	0.994463	0.861251	1.000000	-0.022371	0.995005	0.836474	0.552048
year	0.012776	0.049254	0.019835	0.019255	-0.107093	0.131977	0.076108	0.081351	0.025150	0.009425	0.007014	-0.022371	1.000000	0.002093	0.011521	0.030493
topic_content	0.006201	0.787951	0.810324	0.799206	0.400883	NaN	NaN	NaN	0.544409	NaN	NaN	0.995005	0.002093	1.000000	0.809229	0.550518
topic_style	-0.000067	0.666969	0.687220	0.676825	0.350101	NaN	NaN	NaN	0.468870	NaN	NaN	0.836474	0.011521	0.809229	1.000000	0.471478
complex_exercises2	0.050594	0.842976	0.895515	0.895725	0.639866	NaN	NaN	NaN	0.954703	NaN	NaN	0.552048	0.030493	0.550518	0.471478	1.000000

	points_written	complex_exercises
0	NaN	NaN
1	NaN	NaN
2	86.0	NaN
3	23.0	NaN
4	66.0	NaN

		level
level_abbrev	year
A	2011	50
	2012	1152
	2013	1457
	2014	1462
	2015	1357
N	2011	358
	2012	4829
	2013	4176
	2014	3491
	2015	3558

	points_written	points_oral	school_type
0	NaN	NaN	comprehensive
1	NaN	NaN	comprehensive
2	86.0	60.0	comprehensive
3	23.0	38.0	comprehensive
4	66.0	60.0	comprehensive