%pylab inline
import pandas as pd
We again work on our physics examination dataset. We now focus on calculating some statistics and quick plots. Let's read the data again:
df=pd.read_csv("data/examination.csv.gz",sep=";")
We can get a quick overview of the data by using the describe function.
df.describe()
df.describe().T
The corr
function gives us the Pearson/Speaman/Kendall correlation table.
df.corr(method="pearson")
We can easily plot it.
c=df.corr(method="pearson")
imshow(c,interpolation="none",cmap='BrBG')
xticks(range(len(c.index)),c.index,rotation=90)
yticks(range(len(c.index)),c.index)
colorbar();
A very great strength of pandas
is that we can make plots from DataFrame
s with a very quick syntax. pandas
uses the matplotlib
library for these purposes, therefore, we have to import it.
We can set the parameters of the plot (title, ylabel etc.) similarly to the methods we've already seen in matplotlib
.
First, we increase axis labels:
rcParams["font.size"]=12
Dsitributions
points=df[['points_written',
'complex_exercises']]
points.head()
points.hist(figsize=(8,4));
points.plot.hist(figsize=(9,6),alpha=0.5);
points.plot.box(figsize=(9,6));
Covariances
points.plot.scatter(x='points_written',y= 'complex_exercises');
points.plot.hexbin(x='points_written',y= 'complex_exercises',gridsize=20);
Histograms and covariances together.
pd.scatter_matrix(points,figsize=(9,9));
The number of normal and advances level students in time:
toplot=df.groupby(['level_abbrev','year'])[['level']].count()
toplot
Let's use the unstack
function to plot the groups easily:
toplot=toplot.unstack(level=0)
toplot
toplot.plot(kind="area", figsize=(9, 6))
ylabel("Number of students")
ylim(0,7000)
xticks(range(2011,2016),map(str,range(2011,2016)));
We could have a look at some pie charts to count which school type sent the students to normal or advances level exams between 2011 and 2015. We make two subplots for that. What did the autopct
keyword do? What can -
be?
t=df[(df["attendance"]=="present")].groupby(["level_abbrev","school_type"])
t=t.size().unstack(level=1)
t.plot.pie(subplots=True,autopct="%.1f",figsize=(12,4));
import seaborn as sns
This is a package that makes it easy to create nicely looking statistics plots.
Histogram and density function estimation:
sns.distplot(df['points_written'].dropna());
Only kernel density estimate:
sns.kdeplot(df['points_written']);
Estimated cumulative distribution function.
sns.kdeplot(df['points_written'].dropna(),cumulative=True);
More distributions in one plot:
points=df.loc[df['level_abbrev']=='N',['points_written',
'points_oral',
'school_type']]
points.head()
Drawing all the points:
sns.stripplot(x='school_type', y='points_written',
data=points.dropna().iloc[:1000],jitter=True);
Boxplots:
sns.boxplot(x='school_type', y='points_written',
data=points);
Density functions next to each other:
sns.violinplot(x='school_type', y='points_written',
data=points);
Multidimensional distributions:
x='points_written'
y='points_oral'
Estimated density:
sns.jointplot(x=x, y=y, data=points, kind="kde");
Histograms:
sns.jointplot(x=x, y=y, data=points,kind="hex", color="k");
Pairplot from the previous notebooks.
sns.pairplot(points.dropna());
Linear models:
sns.lmplot(x=y, y=x, data=points.dropna().iloc[:30,:]);
With histograms:
sns.jointplot(x=y, y=x, data=points.dropna().iloc[:30,:],kind='reg');