Database operations 2. (concat, join, apply, pivot)¶

%pylab inline

Populating the interactive namespace from numpy and matplotlib

import pandas as pd

Reading the examination data again:

df=pd.read_csv("data/examination.csv.gz",sep=";")

We create a smaller table that is easier to rewad for demonstration purposes.

subdf=df[['city','school_type',
          'gender','level','points_written']].head(2)

Concatenation¶

We can concatenate two DataFrames, if we give a list containing DataFrames with the same number of columns to the pd.concat() function.

In our example, we write the same DataFrame two times under each other.

cdf=pd.concat([subdf,subdf])
cdf

The indices of the newly created table sill come form the original DataFrames.

cdf.loc[0]

We can make a new index, while also keeping the old ones.

cdf=pd.concat([subdf,subdf]).reset_index()
cdf

Or we could throw them away.

cdf=pd.concat([subdf,subdf]).reset_index(drop=True)
cdf

We can concatenate tables along the other axis as well.

df1=pd.DataFrame(np.ones((2,2)))
df2=pd.DataFrame(2*np.ones((2,2)))

pd.concat([df1,df2],axis=1)

Then, concat uses the indices to copy the DataFrames next to each other.

df1=pd.DataFrame(np.ones((2,2)))
df2=pd.DataFrame(2*np.ones((2,2)),index=[3,4])

pd.concat([df1,df2],axis=1)

Joins¶

If we have a common identifier or columns with common values in two tables, it is possible to merge them. For example, we can take two columns from both tables, and we put those rows next to each other, where the value of these two columns is the same.

Inner Join¶

We omit those lines that do not have a pair in the other table.

We read a data series about cities downloaded from KSH.

h=pd.read_csv('data/settlements.csv')
h.head()

Let's plot the points achieved as a function of the city population.

We merge the two tables by assigning the row from the KSH data to each examination based on the name of the city.

mdf=df.merge(h,how='inner',
             left_on='city',
             right_on='Name')
mdf.head()

We exclude Budapest, beacuse it is too big.

mdf=mdf[mdf['city']!='Budapest']

We load seaborn for this plot.

import seaborn as sns

In the written examinations, bigger cities perform significantly better.

x,y='points_written','Population'
sns.jointplot(x=x, y=y, data=mdf,kind="kde", color="k")

<seaborn.axisgrid.JointGrid at 0x7ff31d67c438>

But in the oral examination part, smaller places do better...

import seaborn as sns
x,y='points_oral','Population'
sns.jointplot(x=x, y=y, data=mdf,kind="kde", color="k",xlim=(0,60))

<seaborn.axisgrid.JointGrid at 0x7ff31d67c160>

Left join¶

We can also merge two tables by using all of the rows from one table, and where the second table has no corresponding rows, we still leave the original row in the resulting table.

For example, we can look for cities where there has been no Physics exam in 2014 (or where it is not in our data).

First, we look for cities where there was an examination.

df14=df[df['year']==2014]
fizcit=df14[['city']].drop_duplicates().reset_index(drop=True)
fizcit.head()

Next, we look for all possible cities.

# város means city
allcit=h[h['Type']=='város'][
    ['Name']].reset_index(drop=True)
allcit.head()

Then, we merge them such that we keep all existing cities.

lj=allcit.merge(fizcit,how='left',right_on='city',
                left_on='Name')
lj

We look for cities, where the value of the examination table is lacking.

lj[pd.isnull(lj['city'])]

Further join possibilities¶

Right outer (keeping all the rows from the right table)
Full outer (keeping all the rows from both tables)
Cross join, Cartesian product (there are only hacked solutions in pandas)

Weaknesses¶

We can only join on concrete values, we cannot use any transformation functions such as 'give me the rows where the absolute value of the x column in table A minus the absolute value of the y column of table B is less than 5'. Though this operation would be trivial in SQL. If we need such a thing, we have to turn to the package sqlite.

Contains¶

If we have to took whether a character string is in the elements of a string columns:

df[df["city"].str.contains("Buda")].tail()['city']

21602     Budaörs
21631     Budaörs
21632    Budapest
21642     Budaörs
21643     Budaörs
Name: city, dtype: object

Apply¶

If we want to use an arbitrary function to each and every element of a column, we can do it by using apply. We write the function that we wany to use into the brackets after apply.

First we define a function that adds 1 to an arbitrary number:

def hozzaad(x):
    return x+1

Then, we build houses.

h['Number of living units'].head()

0    1546
1    2038
2     241
3    1188
4      58
Name: Number of living units, dtype: int64

h['Number of living units'].apply(hozzaad).head()

0    1547
1    2039
2     242
3    1189
4      59
Name: Number of living units, dtype: int64

We can use apply rowwise if we use the axis=1 keyword. For example, let us write a normalization function by hand, that divides a row by the sum of its elements. This function will be very slow!

def norm(sor):
    return sor/sor.sum()

df[['points_written','points_oral']].apply(norm,axis=1).head()

Pivot¶

We can easily rearrange our table even when we had only column-wise values. Now we want to have a look at the median of the points of the written examination for each county in each year.

df.pivot_table(values='points_written',
               columns='year',
               index='county',
               aggfunc=np.median)

	0	1	0	1
0	1.0	1.0	2.0	2.0
1	1.0	1.0	2.0	2.0

	0	1	0	1
0	1.0	1.0	NaN	NaN
1	1.0	1.0	NaN	NaN
3	NaN	NaN	2.0	2.0
4	NaN	NaN	2.0	2.0

	Name	KSH code	Type	County	LAU1 code	LAU1 name	LAU1 center	Common municipality code	Common municipality center	Area (ha)	Population	Number of living units
0	Aba	17376	város	Fejér	850.0	Székesfehérvári	Székesfehérvár	1	–	8805	4424	1546
1	Abádszalók	12441	város	Jász-Nagykun-Szolnok	1640.0	Kunhegyesi	Kunhegyes	2	Helyben – Locally	13223	4160	2038
2	Abaliget	12548	község	Baranya	280.0	Pécsi	Pécs	5	Orfű	1609	649	241
3	Abasár	24554	község	Heves	1060.0	Gyöngyösi	Gyöngyös	1	–	2082	2493	1188
4	Abaújalpár	15662	község	Borsod-Abaúj-Zemplén	570.0	Gönci	Gönc	6	Boldogkőváralja	848	59	58

	county	city	yeargroup	school_type	gender	education_type	subject	level	exam_type	language	...	Type	County	LAU1 code	LAU1 name	LAU1 center	Common municipality code	Common municipality center	Area (ha)	Population	Number of living units
0	Budapest	Budapest	12.0	comprehensive	male	day	physics	normal	forwarded	hungarian	...	főváros	Pest	NaN	NaN	NaN	1	–	52513	1759407	911502
1	Budapest	Budapest	12.0	comprehensive	female	day	physics	normal	forwarded	hungarian	...	főváros	Pest	NaN	NaN	NaN	1	–	52513	1759407	911502
2	Budapest	Budapest	12.0	comprehensive	male	day	physics	normal	forwarded	hungarian	...	főváros	Pest	NaN	NaN	NaN	1	–	52513	1759407	911502
3	Budapest	Budapest	13.0	comprehensive	female	day	physics	normal	forwarded	hungarian	...	főváros	Pest	NaN	NaN	NaN	1	–	52513	1759407	911502
4	Budapest	Budapest	12.0	comprehensive	male	day	physics	normal	forwarded	hungarian	...	főváros	Pest	NaN	NaN	NaN	1	–	52513	1759407	911502

	city
0	Budapest
1	Pécs
2	Kiskunhalas
3	Kecskemét
4	Kalocsa

	Name	city
0	Aba	NaN
1	Abádszalók	NaN
2	Abaújszántó	NaN
3	Abony	NaN
4	Ács	NaN
5	Adony	NaN
6	Ajak	NaN
7	Ajka	Ajka
8	Albertirsa	NaN
9	Alsózsolca	NaN
10	Aszód	Aszód
11	Bábolna	NaN
12	Bácsalmás	Bácsalmás
13	Badacsonytomaj	NaN
14	Baja	Baja
15	Baktalórántháza	NaN
16	Balassagyarmat	Balassagyarmat
17	Balatonalmádi	Balatonalmádi
18	Balatonboglár	Balatonboglár
19	Balatonföldvár	NaN
20	Balatonfüred	Balatonfüred
21	Balatonfűzfő	NaN
22	Balatonkenese	NaN
23	Balatonlelle	NaN
24	Balkány	NaN
25	Balmazújváros	NaN
26	Barcs	Barcs
27	Bátaszék	Bátaszék
28	Bátonyterenye	NaN
29	Battonya	NaN
...	...	...
292	Törökbálint	Törökbálint
293	Törökszentmiklós	Törökszentmiklós
294	Tura	NaN
295	Túrkeve	NaN
296	Újfehértó	NaN
297	Újhartyán	NaN
298	Újkígyós	NaN
299	Újszász	Újszász
300	Üllő	NaN
301	Vác	Vác
302	Vaja	NaN
303	Vámospércs	NaN
304	Várpalota	Várpalota
305	Vásárosnamény	Vásárosnamény
306	Vasvár	NaN
307	Vecsés	NaN
308	Velence	NaN
309	Vép	NaN
310	Veresegyház	NaN
311	Verpelét	NaN
312	Vésztő	NaN
313	Villány	NaN
314	Visegrád	NaN
315	Záhony	NaN
316	Zalakaros	NaN
317	Zalalövő	NaN
318	Zalaszentgrót	Zalaszentgrót
319	Zamárdi	NaN
320	Zirc	Zirc
321	Zsámbék	NaN

	points_written	points_oral
0	NaN	NaN
1	NaN	NaN
2	0.589041	0.410959
3	0.377049	0.622951
4	0.523810	0.476190

year	2011	2012	2013	2014	2015
county
Baranya	83.0	54.0	57.0	55.0	53.0
Borsod-Abaúj-Zemplén	67.5	51.0	58.0	54.0	52.0
Budapest	74.0	59.0	60.0	61.0	61.0
Bács-Kiskun	77.5	49.5	55.0	56.5	54.0
Békés	78.0	42.0	50.0	47.0	50.0
Csongrád	78.0	53.0	59.0	59.0	56.0
Fejér	79.5	47.0	57.0	51.5	48.0
Gyõr-Moson-Sopron	81.0	52.0	59.0	63.0	56.0
Hajdú-Bihar	68.5	42.0	55.0	54.0	57.0
Heves	43.0	42.0	48.0	48.0	50.0
Jász-Nagykun-Szolnok	40.0	39.0	54.0	52.0	51.0
Komárom-Esztergom	75.0	56.0	60.0	57.0	55.0
Nógrád	67.5	49.0	63.0	55.0	46.0
Pest	67.0	55.0	63.0	60.0	55.0
Somogy	76.0	60.0	56.5	53.5	56.0
Szabolcs-Szatmár-Bereg	53.0	40.0	45.0	43.0	49.0
Tolna	66.0	46.0	51.0	49.0	54.0
Vas	82.5	42.0	60.0	53.5	53.0
Veszprém	72.0	54.5	56.0	58.0	58.0
Zala	84.0	54.0	64.0	59.0	57.0