Database operations 1. (groupby)¶

import pandas as pd # usual imports

Physics high school examination ('erettsegi') results¶

Next, we are going to work with the last years' "érettségi" results in physics. This is going to raise many issues related to the analysis of databases/tables. For example, how to manage missing or incompatible data. The website containing the data provides a simple csv textfile, where the column separator is the ; sign.

df=pd.read_csv("data/examination.csv.gz",sep=";")
df.head()

The year and level columns give the year and the level of the examinations. We can also determine whether the student had the examination in the fall or in the spring term (term), and there is information on the school and the type of education of the student. Moreover, the detailed written and oral part points, the overall points and the overall percentage achieved is also there in the data.

It's worth printing the first few lines out to have a look at the data. We write it transposed, because it fits on the screen this way.

df.head().T

Groupby¶

First, we make a smaller table to get the point more easily.

subdf=df[['level','school_type','points_written']]
subdf.head()

We can group the values in a DataFrame according to the values in a column, and then we can use operations on these groups.

g=subdf.groupby('level')
g

<pandas.core.groupby.DataFrameGroupBy object at 0x7f30353b5a20>

Let us now take the maximum of the groups!

g.max()

Or the mean! (Observe that the mean is only computed from the numerical columns.)

g.mean()

We can also select one column.

g['points_written'].mean()

level
advanced    63.410343
normal      52.651527
Name: points_written, dtype: float64

If we select it between two square brackets, we get a DataFrame object, not a pandas.Series, and the printed output is nicer.

g[['points_written']].mean()

We can also group the data based on multiple columns. Then, we have to give a list to groupby.

g=subdf.groupby(['level','school_type'])
g.mean()

We can also have the grouped columns left as normal columns instead of an index.

g=subdf.groupby(['level','school_type']
           ,as_index=False)
g.mean()

If we have a hierarchically indexed table, we can rearrange it by using unstack, if needed.

g=subdf.groupby(['level','school_type'])
g.mean().unstack(level=0)

g.mean().unstack(level=1)

In the following, we ask some sample questions and answer them.¶

What was the average of the grades of the advanced level examinations in each year?

First, we select the rows corresponding to advances level examinations.

advanced=df[df["level_abbrev"]=="A"]

Then, we group these rows according to years.

group=advanced.groupby("year")

We select the grade column, of which we compute the average. Because of the grouping, the average is going to be created according to years.

group[["grade"]].mean()

All in one.

df[df["level_abbrev"]=="A"].groupby("year")[["grade"]].mean()

Who did better in the 2015 normal level examinations? Boys or girls?

First, we select the rows corresponding to normal level 2015 data. We have to concatenate the two conditions by using the & operator, that can make the and operation elementwise on lists. We have to put brackets around the conditions to make the interpreter read them correctly.

After that, we group by the gender of the students, and then we take the average of the overall points.

n_2015=df[(df["level_abbrev"]=="N") & (df["year"]==2015)]
n_2015.groupby("gender")[["points"]].mean()

Let's count the number of attending/missing participants according to each type of schools!

Here, we grouped using two columns at once, therefore, we gave the column names as a list to groupby. After that, we take an arbitrary column such as year, and we count the nonzero rows by using count().

df.groupby(["school_type",
            "attendance"])[["year"]].count()

	county	city	yeargroup	school_type	gender	education_type	subject	level	exam_type	language	...	points_written	measurement_content	measurement_style	points_oral	year	level_abbrev	term	topic_content	topic_style	complex_exercises2
0	Budapest	Budapest	12.0	comprehensive	male	day	physics	normal	forwarded	hungarian	...	NaN	NaN	NaN	NaN	2013	N	spring	NaN	NaN	NaN
1	Budapest	Budapest	12.0	comprehensive	female	day	physics	normal	forwarded	hungarian	...	NaN	NaN	NaN	NaN	2013	N	spring	NaN	NaN	NaN
2	Budapest	Budapest	12.0	comprehensive	male	day	physics	normal	forwarded	hungarian	...	86.0	NaN	NaN	60.0	2013	N	spring	55.0	5.0	48.0
3	Budapest	Budapest	13.0	comprehensive	female	day	physics	normal	forwarded	hungarian	...	23.0	NaN	NaN	38.0	2013	N	spring	36.0	2.0	9.0
4	Budapest	Budapest	12.0	comprehensive	male	day	physics	normal	forwarded	hungarian	...	66.0	NaN	NaN	60.0	2013	N	spring	55.0	5.0	42.0

	0	1	2	3	4
county	Budapest	Budapest	Budapest	Budapest	Budapest
city	Budapest	Budapest	Budapest	Budapest	Budapest
yeargroup	12	12	12	13	12
school_type	comprehensive	comprehensive	comprehensive	comprehensive	comprehensive
gender	male	female	male	female	male
education_type	day	day	day	day	day
subject	physics	physics	physics	physics	physics
level	normal	normal	normal	normal	normal
exam_type	forwarded	forwarded	forwarded	forwarded	forwarded
language	hungarian	hungarian	hungarian	hungarian	hungarian
grade	1	1	5	3	5
percentage	26	19	97	40	84
points	40	29	146	61	126
attendance	absent	absent	present	present	present
questions	NaN	NaN	38	14	24
essay_content	NaN	NaN	NaN	NaN	NaN
essay_style	NaN	NaN	NaN	NaN	NaN
complex_exercises	NaN	NaN	NaN	NaN	NaN
points_written	NaN	NaN	86	23	66
measurement_content	NaN	NaN	NaN	NaN	NaN
measurement_style	NaN	NaN	NaN	NaN	NaN
points_oral	NaN	NaN	60	38	60
year	2013	2013	2013	2013	2013
level_abbrev	N	N	N	N	N
term	spring	spring	spring	spring	spring
topic_content	NaN	NaN	55	36	55
topic_style	NaN	NaN	5	2	5
complex_exercises2	NaN	NaN	48	9	42

	level	school_type	points_written
0	normal	comprehensive	NaN
1	normal	comprehensive	NaN
2	normal	comprehensive	86.0
3	normal	comprehensive	23.0
4	normal	comprehensive	66.0

		points_written
level	school_type
advanced	-	51.651575
	comprehensive	65.804943
	vocational	53.922330
normal	-	51.004021
	comprehensive	58.023469
	vocational	40.857205

	level	school_type	points_written
0	advanced	-	51.651575
1	advanced	comprehensive	65.804943
2	advanced	vocational	53.922330
3	normal	-	51.004021
4	normal	comprehensive	58.023469
5	normal	vocational	40.857205

	grade
year
2011	4.125000
2012	4.429338
2013	4.232607
2014	4.404090
2015	4.432165