Database operations 1. (groupby)

In [1]:
import pandas as pd # usual imports

Physics high school examination ('erettsegi') results

Next, we are going to work with the last years' "érettségi" results in physics. This is going to raise many issues related to the analysis of databases/tables. For example, how to manage missing or incompatible data. The website containing the data provides a simple csv textfile, where the column separator is the ; sign.

In [13]:
df=pd.read_csv("data/examination.csv.gz",sep=";")
df.head()
Out[13]:
county city yeargroup school_type gender education_type subject level exam_type language ... points_written measurement_content measurement_style points_oral year level_abbrev term topic_content topic_style complex_exercises2
0 Budapest Budapest 12.0 comprehensive male day physics normal forwarded hungarian ... NaN NaN NaN NaN 2013 N spring NaN NaN NaN
1 Budapest Budapest 12.0 comprehensive female day physics normal forwarded hungarian ... NaN NaN NaN NaN 2013 N spring NaN NaN NaN
2 Budapest Budapest 12.0 comprehensive male day physics normal forwarded hungarian ... 86.0 NaN NaN 60.0 2013 N spring 55.0 5.0 48.0
3 Budapest Budapest 13.0 comprehensive female day physics normal forwarded hungarian ... 23.0 NaN NaN 38.0 2013 N spring 36.0 2.0 9.0
4 Budapest Budapest 12.0 comprehensive male day physics normal forwarded hungarian ... 66.0 NaN NaN 60.0 2013 N spring 55.0 5.0 42.0

5 rows × 28 columns

The year and level columns give the year and the level of the examinations. We can also determine whether the student had the examination in the fall or in the spring term (term), and there is information on the school and the type of education of the student. Moreover, the detailed written and oral part points, the overall points and the overall percentage achieved is also there in the data.

It's worth printing the first few lines out to have a look at the data. We write it transposed, because it fits on the screen this way.

In [14]:
df.head().T
Out[14]:
0 1 2 3 4
county Budapest Budapest Budapest Budapest Budapest
city Budapest Budapest Budapest Budapest Budapest
yeargroup 12 12 12 13 12
school_type comprehensive comprehensive comprehensive comprehensive comprehensive
gender male female male female male
education_type day day day day day
subject physics physics physics physics physics
level normal normal normal normal normal
exam_type forwarded forwarded forwarded forwarded forwarded
language hungarian hungarian hungarian hungarian hungarian
grade 1 1 5 3 5
percentage 26 19 97 40 84
points 40 29 146 61 126
attendance absent absent present present present
questions NaN NaN 38 14 24
essay_content NaN NaN NaN NaN NaN
essay_style NaN NaN NaN NaN NaN
complex_exercises NaN NaN NaN NaN NaN
points_written NaN NaN 86 23 66
measurement_content NaN NaN NaN NaN NaN
measurement_style NaN NaN NaN NaN NaN
points_oral NaN NaN 60 38 60
year 2013 2013 2013 2013 2013
level_abbrev N N N N N
term spring spring spring spring spring
topic_content NaN NaN 55 36 55
topic_style NaN NaN 5 2 5
complex_exercises2 NaN NaN 48 9 42

Groupby

First, we make a smaller table to get the point more easily.

In [15]:
subdf=df[['level','school_type','points_written']]
subdf.head()
Out[15]:
level school_type points_written
0 normal comprehensive NaN
1 normal comprehensive NaN
2 normal comprehensive 86.0
3 normal comprehensive 23.0
4 normal comprehensive 66.0

We can group the values in a DataFrame according to the values in a column, and then we can use operations on these groups.

In [16]:
g=subdf.groupby('level')
g
Out[16]:
<pandas.core.groupby.DataFrameGroupBy object at 0x7f30353b5a20>

Let us now take the maximum of the groups!

In [17]:
g.max()
Out[17]:
school_type points_written
level
advanced vocational 100.0
normal vocational 90.0

Or the mean! (Observe that the mean is only computed from the numerical columns.)

In [18]:
g.mean()
Out[18]:
points_written
level
advanced 63.410343
normal 52.651527

We can also select one column.

In [19]:
g['points_written'].mean()
Out[19]:
level
advanced    63.410343
normal      52.651527
Name: points_written, dtype: float64

If we select it between two square brackets, we get a DataFrame object, not a pandas.Series, and the printed output is nicer.

In [20]:
g[['points_written']].mean()
Out[20]:
points_written
level
advanced 63.410343
normal 52.651527

We can also group the data based on multiple columns. Then, we have to give a list to groupby.

In [21]:
g=subdf.groupby(['level','school_type'])
g.mean()
Out[21]:
points_written
level school_type
advanced - 51.651575
comprehensive 65.804943
vocational 53.922330
normal - 51.004021
comprehensive 58.023469
vocational 40.857205

We can also have the grouped columns left as normal columns instead of an index.

In [22]:
g=subdf.groupby(['level','school_type']
           ,as_index=False)
g.mean()
Out[22]:
level school_type points_written
0 advanced - 51.651575
1 advanced comprehensive 65.804943
2 advanced vocational 53.922330
3 normal - 51.004021
4 normal comprehensive 58.023469
5 normal vocational 40.857205

If we have a hierarchically indexed table, we can rearrange it by using unstack, if needed.

In [23]:
g=subdf.groupby(['level','school_type'])
g.mean().unstack(level=0)
Out[23]:
points_written
level advanced normal
school_type
- 51.651575 51.004021
comprehensive 65.804943 58.023469
vocational 53.922330 40.857205
In [24]:
g.mean().unstack(level=1)
Out[24]:
points_written
school_type - comprehensive vocational
level
advanced 51.651575 65.804943 53.922330
normal 51.004021 58.023469 40.857205

In the following, we ask some sample questions and answer them.

  • What was the average of the grades of the advanced level examinations in each year?

First, we select the rows corresponding to advances level examinations.

In [25]:
advanced=df[df["level_abbrev"]=="A"]

Then, we group these rows according to years.

In [26]:
group=advanced.groupby("year")

We select the grade column, of which we compute the average. Because of the grouping, the average is going to be created according to years.

In [27]:
group[["grade"]].mean()
Out[27]:
grade
year
2011 4.125000
2012 4.429338
2013 4.232607
2014 4.404090
2015 4.432165

All in one.

In [29]:
df[df["level_abbrev"]=="A"].groupby("year")[["grade"]].mean()
Out[29]:
grade
year
2011 4.125000
2012 4.429338
2013 4.232607
2014 4.404090
2015 4.432165
  • Who did better in the 2015 normal level examinations? Boys or girls?

First, we select the rows corresponding to normal level 2015 data. We have to concatenate the two conditions by using the & operator, that can make the and operation elementwise on lists. We have to put brackets around the conditions to make the interpreter read them correctly.

After that, we group by the gender of the students, and then we take the average of the overall points.

In [32]:
n_2015=df[(df["level_abbrev"]=="N") & (df["year"]==2015)]
n_2015.groupby("gender")[["points"]].mean()
Out[32]:
points
gender
female 97.180775
male 100.270135
  • Let's count the number of attending/missing participants according to each type of schools!

Here, we grouped using two columns at once, therefore, we gave the column names as a list to groupby. After that, we take an arbitrary column such as year, and we count the nonzero rows by using count().

In [33]:
df.groupby(["school_type",
            "attendance"])[["year"]].count()
Out[33]:
year
school_type attendance
- absent 490
present 1260
comprehensive absent 551
present 14448
vocational absent 134
present 5007