Plotting statistics (seaborn)

In [1]:
%pylab inline
Populating the interactive namespace from numpy and matplotlib
In [2]:
import pandas as pd

Examination data

We again work on our physics examination dataset. We now focus on calculating some statistics and quick plots. Let's read the data again:

In [4]:
df=pd.read_csv("data/examination.csv.gz",sep=";")

Basic statistics

We can get a quick overview of the data by using the describe function.

In [5]:
df.describe()
Out[5]:
yeargroup grade percentage points questions essay_content essay_style complex_exercises points_written measurement_content measurement_style points_oral year topic_content topic_style complex_exercises2
count 19811.000000 21444.000000 21444.000000 21444.000000 20635.000000 5047.000000 5047.000000 5047.000000 20635.000000 5058.000000 5058.000000 20715.000000 21890.000000 15657.000000 15657.000000 15588.000000
mean 12.737015 4.022617 66.553861 100.219269 22.418173 12.948286 4.228849 28.152170 55.282966 35.715896 4.014433 47.302872 2013.364824 45.667369 4.162866 28.829099
std 34.817782 1.074826 19.247675 28.865995 7.112939 3.860824 1.023183 12.347646 18.886694 9.203527 1.096126 12.863038 1.157214 11.646716 1.100549 12.855951
min 1.000000 1.000000 0.000000 0.000000 2.000000 0.000000 0.000000 0.000000 8.000000 0.000000 0.000000 0.000000 2011.000000 0.000000 0.000000 0.000000
25% 12.000000 3.000000 54.000000 82.000000 18.000000 11.000000 4.000000 19.000000 41.000000 31.000000 3.000000 40.000000 2012.000000 40.000000 3.000000 19.000000
50% 12.000000 4.000000 70.000000 105.000000 22.000000 14.000000 5.000000 29.000000 56.000000 38.000000 4.000000 50.000000 2013.000000 50.000000 5.000000 29.000000
75% 12.000000 5.000000 82.000000 123.000000 28.000000 16.000000 5.000000 39.000000 70.000000 43.000000 5.000000 58.000000 2014.000000 55.000000 5.000000 40.000000
max 2014.000000 5.000000 100.000000 150.000000 40.000000 18.000000 5.000000 47.000000 100.000000 90.000000 10.000000 120.000000 2015.000000 110.000000 10.000000 50.000000
In [6]:
df.describe().T
Out[6]:
count mean std min 25% 50% 75% max
yeargroup 19811.0 12.737015 34.817782 1.0 12.0 12.0 12.0 2014.0
grade 21444.0 4.022617 1.074826 1.0 3.0 4.0 5.0 5.0
percentage 21444.0 66.553861 19.247675 0.0 54.0 70.0 82.0 100.0
points 21444.0 100.219269 28.865995 0.0 82.0 105.0 123.0 150.0
questions 20635.0 22.418173 7.112939 2.0 18.0 22.0 28.0 40.0
essay_content 5047.0 12.948286 3.860824 0.0 11.0 14.0 16.0 18.0
essay_style 5047.0 4.228849 1.023183 0.0 4.0 5.0 5.0 5.0
complex_exercises 5047.0 28.152170 12.347646 0.0 19.0 29.0 39.0 47.0
points_written 20635.0 55.282966 18.886694 8.0 41.0 56.0 70.0 100.0
measurement_content 5058.0 35.715896 9.203527 0.0 31.0 38.0 43.0 90.0
measurement_style 5058.0 4.014433 1.096126 0.0 3.0 4.0 5.0 10.0
points_oral 20715.0 47.302872 12.863038 0.0 40.0 50.0 58.0 120.0
year 21890.0 2013.364824 1.157214 2011.0 2012.0 2013.0 2014.0 2015.0
topic_content 15657.0 45.667369 11.646716 0.0 40.0 50.0 55.0 110.0
topic_style 15657.0 4.162866 1.100549 0.0 3.0 5.0 5.0 10.0
complex_exercises2 15588.0 28.829099 12.855951 0.0 19.0 29.0 40.0 50.0

The corr function gives us the Pearson/Speaman/Kendall correlation table.

In [7]:
df.corr(method="pearson")
Out[7]:
yeargroup grade percentage points questions essay_content essay_style complex_exercises points_written measurement_content measurement_style points_oral year topic_content topic_style complex_exercises2
yeargroup 1.000000 0.013730 -0.000021 0.000101 -0.006200 -0.012108 -0.016003 -0.008011 0.006674 -0.008256 0.014697 -0.012063 0.012776 0.006201 -0.000067 0.050594
grade 0.013730 1.000000 0.907772 0.906896 0.507316 0.560202 0.445160 0.707938 0.846829 0.775403 0.645829 0.632853 0.049254 0.787951 0.666969 0.842976
percentage -0.000021 0.907772 1.000000 0.998582 0.684114 0.663993 0.500275 0.891349 0.903683 0.805784 0.686136 0.759383 0.019835 0.810324 0.687220 0.895515
points 0.000101 0.906896 0.998582 1.000000 0.684060 0.664042 0.500438 0.891243 0.904050 0.796333 0.676032 0.748877 0.019255 0.799206 0.676825 0.895725
questions -0.006200 0.507316 0.684114 0.684060 1.000000 0.356895 0.256725 0.534321 0.652535 0.415918 0.335439 0.474384 -0.107093 0.400883 0.350101 0.639866
essay_content -0.012108 0.560202 0.663993 0.664042 0.356895 1.000000 0.646490 0.518907 0.686980 0.433555 0.392058 0.436247 0.131977 NaN NaN NaN
essay_style -0.016003 0.445160 0.500275 0.500438 0.256725 0.646490 1.000000 0.370415 0.506517 0.344682 0.332567 0.349502 0.076108 NaN NaN NaN
complex_exercises -0.008011 0.707938 0.891349 0.891243 0.534321 0.518907 0.370415 1.000000 0.941172 0.550949 0.459826 0.550095 0.081351 NaN NaN NaN
points_written 0.006674 0.846829 0.903683 0.904050 0.652535 0.686980 0.506517 0.941172 1.000000 0.591616 0.499105 0.420815 0.025150 0.544409 0.468870 0.954703
measurement_content -0.008256 0.775403 0.805784 0.796333 0.415918 0.433555 0.344682 0.550949 0.591616 1.000000 0.832555 0.994463 0.009425 NaN NaN NaN
measurement_style 0.014697 0.645829 0.686136 0.676032 0.335439 0.392058 0.332567 0.459826 0.499105 0.832555 1.000000 0.861251 0.007014 NaN NaN NaN
points_oral -0.012063 0.632853 0.759383 0.748877 0.474384 0.436247 0.349502 0.550095 0.420815 0.994463 0.861251 1.000000 -0.022371 0.995005 0.836474 0.552048
year 0.012776 0.049254 0.019835 0.019255 -0.107093 0.131977 0.076108 0.081351 0.025150 0.009425 0.007014 -0.022371 1.000000 0.002093 0.011521 0.030493
topic_content 0.006201 0.787951 0.810324 0.799206 0.400883 NaN NaN NaN 0.544409 NaN NaN 0.995005 0.002093 1.000000 0.809229 0.550518
topic_style -0.000067 0.666969 0.687220 0.676825 0.350101 NaN NaN NaN 0.468870 NaN NaN 0.836474 0.011521 0.809229 1.000000 0.471478
complex_exercises2 0.050594 0.842976 0.895515 0.895725 0.639866 NaN NaN NaN 0.954703 NaN NaN 0.552048 0.030493 0.550518 0.471478 1.000000

We can easily plot it.

In [8]:
c=df.corr(method="pearson")

imshow(c,interpolation="none",cmap='BrBG')
xticks(range(len(c.index)),c.index,rotation=90)
yticks(range(len(c.index)),c.index)
colorbar();

Plotting

A very great strength of pandas is that we can make plots from DataFrames with a very quick syntax. pandas uses the matplotlib library for these purposes, therefore, we have to import it.

We can set the parameters of the plot (title, ylabel etc.) similarly to the methods we've already seen in matplotlib.

First, we increase axis labels:

In [9]:
rcParams["font.size"]=12

Dsitributions

In [10]:
points=df[['points_written',
         'complex_exercises']]
points.head()
Out[10]:
points_written complex_exercises
0 NaN NaN
1 NaN NaN
2 86.0 NaN
3 23.0 NaN
4 66.0 NaN
In [11]:
points.hist(figsize=(8,4));
In [12]:
points.plot.hist(figsize=(9,6),alpha=0.5);
In [13]:
points.plot.box(figsize=(9,6));

Covariances

In [14]:
points.plot.scatter(x='points_written',y= 'complex_exercises');
In [15]:
points.plot.hexbin(x='points_written',y= 'complex_exercises',gridsize=20);

Histograms and covariances together.

In [17]:
pd.scatter_matrix(points,figsize=(9,9));
/home/bokanyie/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: pandas.scatter_matrix is deprecated. Use pandas.plotting.scatter_matrix instead
  """Entry point for launching an IPython kernel.

Further figures

The number of normal and advances level students in time:

In [18]:
toplot=df.groupby(['level_abbrev','year'])[['level']].count()
toplot
Out[18]:
level
level_abbrev year
A 2011 50
2012 1152
2013 1457
2014 1462
2015 1357
N 2011 358
2012 4829
2013 4176
2014 3491
2015 3558

Let's use the unstack function to plot the groups easily:

In [19]:
toplot=toplot.unstack(level=0)
toplot
Out[19]:
level
level_abbrev A N
year
2011 50 358
2012 1152 4829
2013 1457 4176
2014 1462 3491
2015 1357 3558
In [20]:
toplot.plot(kind="area", figsize=(9, 6))
ylabel("Number of students")
ylim(0,7000)
xticks(range(2011,2016),map(str,range(2011,2016)));

We could have a look at some pie charts to count which school type sent the students to normal or advances level exams between 2011 and 2015. We make two subplots for that. What did the autopct keyword do? What can - be?

In [21]:
t=df[(df["attendance"]=="present")].groupby(["level_abbrev","school_type"])
t=t.size().unstack(level=1)
t.plot.pie(subplots=True,autopct="%.1f",figsize=(12,4));

Seaborn package

In [23]:
import seaborn as sns

This is a package that makes it easy to create nicely looking statistics plots.

Histogram and density function estimation:

In [24]:
sns.distplot(df['points_written'].dropna());

Only kernel density estimate:

In [25]:
sns.kdeplot(df['points_written']);
/home/bokanyie/anaconda3/lib/python3.6/site-packages/statsmodels/nonparametric/kde.py:454: RuntimeWarning: invalid value encountered in greater
  X = X[np.logical_and(X>clip[0], X<clip[1])] # won't work for two columns.
/home/bokanyie/anaconda3/lib/python3.6/site-packages/statsmodels/nonparametric/kde.py:454: RuntimeWarning: invalid value encountered in less
  X = X[np.logical_and(X>clip[0], X<clip[1])] # won't work for two columns.

Estimated cumulative distribution function.

In [26]:
sns.kdeplot(df['points_written'].dropna(),cumulative=True);

More distributions in one plot:

In [29]:
points=df.loc[df['level_abbrev']=='N',['points_written',
           'points_oral',
           'school_type']]
points.head()
Out[29]:
points_written points_oral school_type
0 NaN NaN comprehensive
1 NaN NaN comprehensive
2 86.0 60.0 comprehensive
3 23.0 38.0 comprehensive
4 66.0 60.0 comprehensive

Drawing all the points:

In [31]:
sns.stripplot(x='school_type', y='points_written',
              data=points.dropna().iloc[:1000],jitter=True);

Boxplots:

In [32]:
sns.boxplot(x='school_type', y='points_written',
              data=points);

Density functions next to each other:

In [33]:
sns.violinplot(x='school_type', y='points_written',
              data=points);

Multidimensional distributions:

In [34]:
x='points_written'
y='points_oral'

Estimated density:

In [35]:
sns.jointplot(x=x, y=y, data=points, kind="kde");

Histograms:

In [36]:
sns.jointplot(x=x, y=y, data=points,kind="hex", color="k");

Pairplot from the previous notebooks.

In [37]:
sns.pairplot(points.dropna());

Linear models:

In [38]:
sns.lmplot(x=y, y=x, data=points.dropna().iloc[:30,:]);

With histograms:

In [39]:
sns.jointplot(x=y, y=x, data=points.dropna().iloc[:30,:],kind='reg');