import pandas as pd # usual imports
Next, we are going to work with the last years' "érettségi" results in physics. This is going to raise many issues related to the analysis of databases/tables. For example, how to manage missing or incompatible data. The website containing the data provides a simple csv textfile, where the column separator is the ;
sign.
df=pd.read_csv("data/examination.csv.gz",sep=";")
df.head()
The year
and level
columns give the year and the level of the examinations. We can also determine whether the student had the examination in the fall or in the spring term (term
), and there is information on the school and the type of education of the student. Moreover, the detailed written and oral part points, the overall points and the overall percentage achieved is also there in the data.
It's worth printing the first few lines out to have a look at the data. We write it transposed, because it fits on the screen this way.
df.head().T
First, we make a smaller table to get the point more easily.
subdf=df[['level','school_type','points_written']]
subdf.head()
We can group the values in a DataFrame
according to the values in a column, and then we can use operations on these groups.
g=subdf.groupby('level')
g
Let us now take the maximum of the groups!
g.max()
Or the mean! (Observe that the mean is only computed from the numerical columns.)
g.mean()
We can also select one column.
g['points_written'].mean()
If we select it between two square brackets, we get a DataFrame
object, not a pandas.Series
, and the printed output is nicer.
g[['points_written']].mean()
We can also group the data based on multiple columns. Then, we have to give a list to groupby
.
g=subdf.groupby(['level','school_type'])
g.mean()
We can also have the grouped columns left as normal columns instead of an index.
g=subdf.groupby(['level','school_type']
,as_index=False)
g.mean()
If we have a hierarchically indexed table, we can rearrange it by using unstack
, if needed.
g=subdf.groupby(['level','school_type'])
g.mean().unstack(level=0)
g.mean().unstack(level=1)
First, we select the rows corresponding to advances level examinations.
advanced=df[df["level_abbrev"]=="A"]
Then, we group these rows according to years.
group=advanced.groupby("year")
We select the grade
column, of which we compute the average. Because of the grouping, the average is going to be created according to years.
group[["grade"]].mean()
All in one.
df[df["level_abbrev"]=="A"].groupby("year")[["grade"]].mean()
First, we select the rows corresponding to normal level 2015 data. We have to concatenate the two conditions by using the &
operator, that can make the and
operation elementwise on lists. We have to put brackets around the conditions to make the interpreter read them correctly.
After that, we group by the gender of the students, and then we take the average of the overall points.
n_2015=df[(df["level_abbrev"]=="N") & (df["year"]==2015)]
n_2015.groupby("gender")[["points"]].mean()
Here, we grouped using two columns at once, therefore, we gave the column names as a list to groupby
. After that, we take an arbitrary column such as year, and we count the nonzero rows by using count()
.
df.groupby(["school_type",
"attendance"])[["year"]].count()