Database operations 2. (concat, join, apply, pivot)

In [1]:
%pylab inline
Populating the interactive namespace from numpy and matplotlib
In [2]:
import pandas as pd

Reading the examination data again:

In [4]:
df=pd.read_csv("data/examination.csv.gz",sep=";")

We create a smaller table that is easier to rewad for demonstration purposes.

In [5]:
subdf=df[['city','school_type',
          'gender','level','points_written']].head(2)

Concatenation

We can concatenate two DataFrames, if we give a list containing DataFrames with the same number of columns to the pd.concat() function.

In our example, we write the same DataFrame two times under each other.

In [6]:
cdf=pd.concat([subdf,subdf])
cdf
Out[6]:
city school_type gender level points_written
0 Budapest comprehensive male normal NaN
1 Budapest comprehensive female normal NaN
0 Budapest comprehensive male normal NaN
1 Budapest comprehensive female normal NaN

The indices of the newly created table sill come form the original DataFrames.

In [7]:
cdf.loc[0]
Out[7]:
city school_type gender level points_written
0 Budapest comprehensive male normal NaN
0 Budapest comprehensive male normal NaN

We can make a new index, while also keeping the old ones.

In [8]:
cdf=pd.concat([subdf,subdf]).reset_index()
cdf
Out[8]:
index city school_type gender level points_written
0 0 Budapest comprehensive male normal NaN
1 1 Budapest comprehensive female normal NaN
2 0 Budapest comprehensive male normal NaN
3 1 Budapest comprehensive female normal NaN

Or we could throw them away.

In [9]:
cdf=pd.concat([subdf,subdf]).reset_index(drop=True)
cdf
Out[9]:
city school_type gender level points_written
0 Budapest comprehensive male normal NaN
1 Budapest comprehensive female normal NaN
2 Budapest comprehensive male normal NaN
3 Budapest comprehensive female normal NaN

We can concatenate tables along the other axis as well.

In [10]:
df1=pd.DataFrame(np.ones((2,2)))
df2=pd.DataFrame(2*np.ones((2,2)))

pd.concat([df1,df2],axis=1)
Out[10]:
0 1 0 1
0 1.0 1.0 2.0 2.0
1 1.0 1.0 2.0 2.0

Then, concat uses the indices to copy the DataFrames next to each other.

In [11]:
df1=pd.DataFrame(np.ones((2,2)))
df2=pd.DataFrame(2*np.ones((2,2)),index=[3,4])

pd.concat([df1,df2],axis=1)
Out[11]:
0 1 0 1
0 1.0 1.0 NaN NaN
1 1.0 1.0 NaN NaN
3 NaN NaN 2.0 2.0
4 NaN NaN 2.0 2.0

Joins

If we have a common identifier or columns with common values in two tables, it is possible to merge them. For example, we can take two columns from both tables, and we put those rows next to each other, where the value of these two columns is the same.

Inner Join

We omit those lines that do not have a pair in the other table.

We read a data series about cities downloaded from KSH.

In [12]:
h=pd.read_csv('data/settlements.csv')
h.head()
Out[12]:
Name KSH code Type County LAU1 code LAU1 name LAU1 center Common municipality code Common municipality center Area (ha) Population Number of living units
0 Aba 17376 város Fejér 850.0 Székesfehérvári Székesfehérvár 1 8805 4424 1546
1 Abádszalók 12441 város Jász-Nagykun-Szolnok 1640.0 Kunhegyesi Kunhegyes 2 Helyben – Locally 13223 4160 2038
2 Abaliget 12548 község Baranya 280.0 Pécsi Pécs 5 Orfű 1609 649 241
3 Abasár 24554 község Heves 1060.0 Gyöngyösi Gyöngyös 1 2082 2493 1188
4 Abaújalpár 15662 község Borsod-Abaúj-Zemplén 570.0 Gönci Gönc 6 Boldogkőváralja 848 59 58

Let's plot the points achieved as a function of the city population.

We merge the two tables by assigning the row from the KSH data to each examination based on the name of the city.

In [13]:
mdf=df.merge(h,how='inner',
             left_on='city',
             right_on='Name')
mdf.head()
Out[13]:
county city yeargroup school_type gender education_type subject level exam_type language ... Type County LAU1 code LAU1 name LAU1 center Common municipality code Common municipality center Area (ha) Population Number of living units
0 Budapest Budapest 12.0 comprehensive male day physics normal forwarded hungarian ... főváros Pest NaN NaN NaN 1 52513 1759407 911502
1 Budapest Budapest 12.0 comprehensive female day physics normal forwarded hungarian ... főváros Pest NaN NaN NaN 1 52513 1759407 911502
2 Budapest Budapest 12.0 comprehensive male day physics normal forwarded hungarian ... főváros Pest NaN NaN NaN 1 52513 1759407 911502
3 Budapest Budapest 13.0 comprehensive female day physics normal forwarded hungarian ... főváros Pest NaN NaN NaN 1 52513 1759407 911502
4 Budapest Budapest 12.0 comprehensive male day physics normal forwarded hungarian ... főváros Pest NaN NaN NaN 1 52513 1759407 911502

5 rows × 40 columns

We exclude Budapest, beacuse it is too big.

In [14]:
mdf=mdf[mdf['city']!='Budapest']

We load seaborn for this plot.

In [15]:
import seaborn as sns

In the written examinations, bigger cities perform significantly better.

In [16]:
x,y='points_written','Population'
sns.jointplot(x=x, y=y, data=mdf,kind="kde", color="k")
Out[16]:
<seaborn.axisgrid.JointGrid at 0x7ff31d67c438>

But in the oral examination part, smaller places do better...

In [17]:
import seaborn as sns
x,y='points_oral','Population'
sns.jointplot(x=x, y=y, data=mdf,kind="kde", color="k",xlim=(0,60))
Out[17]:
<seaborn.axisgrid.JointGrid at 0x7ff31d67c160>

Left join

We can also merge two tables by using all of the rows from one table, and where the second table has no corresponding rows, we still leave the original row in the resulting table.

For example, we can look for cities where there has been no Physics exam in 2014 (or where it is not in our data).

First, we look for cities where there was an examination.

In [19]:
df14=df[df['year']==2014]
fizcit=df14[['city']].drop_duplicates().reset_index(drop=True)
fizcit.head()
Out[19]:
city
0 Budapest
1 Pécs
2 Kiskunhalas
3 Kecskemét
4 Kalocsa

Next, we look for all possible cities.

In [21]:
# város means city
allcit=h[h['Type']=='város'][
    ['Name']].reset_index(drop=True)
allcit.head()
Out[21]:
Name
0 Aba
1 Abádszalók
2 Abaújszántó
3 Abony
4 Ács

Then, we merge them such that we keep all existing cities.

In [22]:
lj=allcit.merge(fizcit,how='left',right_on='city',
                left_on='Name')
lj
Out[22]:
Name city
0 Aba NaN
1 Abádszalók NaN
2 Abaújszántó NaN
3 Abony NaN
4 Ács NaN
5 Adony NaN
6 Ajak NaN
7 Ajka Ajka
8 Albertirsa NaN
9 Alsózsolca NaN
10 Aszód Aszód
11 Bábolna NaN
12 Bácsalmás Bácsalmás
13 Badacsonytomaj NaN
14 Baja Baja
15 Baktalórántháza NaN
16 Balassagyarmat Balassagyarmat
17 Balatonalmádi Balatonalmádi
18 Balatonboglár Balatonboglár
19 Balatonföldvár NaN
20 Balatonfüred Balatonfüred
21 Balatonfűzfő NaN
22 Balatonkenese NaN
23 Balatonlelle NaN
24 Balkány NaN
25 Balmazújváros NaN
26 Barcs Barcs
27 Bátaszék Bátaszék
28 Bátonyterenye NaN
29 Battonya NaN
... ... ...
292 Törökbálint Törökbálint
293 Törökszentmiklós Törökszentmiklós
294 Tura NaN
295 Túrkeve NaN
296 Újfehértó NaN
297 Újhartyán NaN
298 Újkígyós NaN
299 Újszász Újszász
300 Üllő NaN
301 Vác Vác
302 Vaja NaN
303 Vámospércs NaN
304 Várpalota Várpalota
305 Vásárosnamény Vásárosnamény
306 Vasvár NaN
307 Vecsés NaN
308 Velence NaN
309 Vép NaN
310 Veresegyház NaN
311 Verpelét NaN
312 Vésztő NaN
313 Villány NaN
314 Visegrád NaN
315 Záhony NaN
316 Zalakaros NaN
317 Zalalövő NaN
318 Zalaszentgrót Zalaszentgrót
319 Zamárdi NaN
320 Zirc Zirc
321 Zsámbék NaN

322 rows × 2 columns

We look for cities, where the value of the examination table is lacking.

In [23]:
lj[pd.isnull(lj['city'])]
Out[23]:
Name city
0 Aba NaN
1 Abádszalók NaN
2 Abaújszántó NaN
3 Abony NaN
4 Ács NaN
5 Adony NaN
6 Ajak NaN
8 Albertirsa NaN
9 Alsózsolca NaN
11 Bábolna NaN
13 Badacsonytomaj NaN
15 Baktalórántháza NaN
19 Balatonföldvár NaN
21 Balatonfűzfő NaN
22 Balatonkenese NaN
23 Balatonlelle NaN
24 Balkány NaN
25 Balmazújváros NaN
28 Bátonyterenye NaN
29 Battonya NaN
31 Bélapátfalva NaN
32 Beled NaN
34 Berhida NaN
35 Besenyszög NaN
36 Biatorbágy NaN
37 Bicske NaN
38 Biharkeresztes NaN
40 Bóly NaN
42 Borsodnádasd NaN
43 Budakalász NaN
... ... ...
271 Szob NaN
274 Tápiószele NaN
276 Tát NaN
278 Téglás NaN
279 Tét NaN
280 Tiszacsege NaN
289 Tompa NaN
291 Tököl NaN
294 Tura NaN
295 Túrkeve NaN
296 Újfehértó NaN
297 Újhartyán NaN
298 Újkígyós NaN
300 Üllő NaN
302 Vaja NaN
303 Vámospércs NaN
306 Vasvár NaN
307 Vecsés NaN
308 Velence NaN
309 Vép NaN
310 Veresegyház NaN
311 Verpelét NaN
312 Vésztő NaN
313 Villány NaN
314 Visegrád NaN
315 Záhony NaN
316 Zalakaros NaN
317 Zalalövő NaN
319 Zamárdi NaN
321 Zsámbék NaN

191 rows × 2 columns

Further join possibilities

  • Right outer (keeping all the rows from the right table)
  • Full outer (keeping all the rows from both tables)
  • Cross join, Cartesian product (there are only hacked solutions in pandas)

Weaknesses

We can only join on concrete values, we cannot use any transformation functions such as 'give me the rows where the absolute value of the x column in table A minus the absolute value of the y column of table B is less than 5'. Though this operation would be trivial in SQL. If we need such a thing, we have to turn to the package sqlite.

Contains

If we have to took whether a character string is in the elements of a string columns:

In [24]:
df[df["city"].str.contains("Buda")].tail()['city']
Out[24]:
21602     Budaörs
21631     Budaörs
21632    Budapest
21642     Budaörs
21643     Budaörs
Name: city, dtype: object

Apply

If we want to use an arbitrary function to each and every element of a column, we can do it by using apply. We write the function that we wany to use into the brackets after apply.

First we define a function that adds 1 to an arbitrary number:

In [26]:
def hozzaad(x):
    return x+1

Then, we build houses.

In [25]:
h['Number of living units'].head()
Out[25]:
0    1546
1    2038
2     241
3    1188
4      58
Name: Number of living units, dtype: int64
In [27]:
h['Number of living units'].apply(hozzaad).head()
Out[27]:
0    1547
1    2039
2     242
3    1189
4      59
Name: Number of living units, dtype: int64

We can use apply rowwise if we use the axis=1 keyword. For example, let us write a normalization function by hand, that divides a row by the sum of its elements. This function will be very slow!

In [28]:
def norm(sor):
    return sor/sor.sum()
In [29]:
df[['points_written','points_oral']].apply(norm,axis=1).head()
Out[29]:
points_written points_oral
0 NaN NaN
1 NaN NaN
2 0.589041 0.410959
3 0.377049 0.622951
4 0.523810 0.476190

Pivot

We can easily rearrange our table even when we had only column-wise values. Now we want to have a look at the median of the points of the written examination for each county in each year.

In [30]:
df.pivot_table(values='points_written',
               columns='year',
               index='county',
               aggfunc=np.median)
Out[30]:
year 2011 2012 2013 2014 2015
county
Baranya 83.0 54.0 57.0 55.0 53.0
Borsod-Abaúj-Zemplén 67.5 51.0 58.0 54.0 52.0
Budapest 74.0 59.0 60.0 61.0 61.0
Bács-Kiskun 77.5 49.5 55.0 56.5 54.0
Békés 78.0 42.0 50.0 47.0 50.0
Csongrád 78.0 53.0 59.0 59.0 56.0
Fejér 79.5 47.0 57.0 51.5 48.0
Gyõr-Moson-Sopron 81.0 52.0 59.0 63.0 56.0
Hajdú-Bihar 68.5 42.0 55.0 54.0 57.0
Heves 43.0 42.0 48.0 48.0 50.0
Jász-Nagykun-Szolnok 40.0 39.0 54.0 52.0 51.0
Komárom-Esztergom 75.0 56.0 60.0 57.0 55.0
Nógrád 67.5 49.0 63.0 55.0 46.0
Pest 67.0 55.0 63.0 60.0 55.0
Somogy 76.0 60.0 56.5 53.5 56.0
Szabolcs-Szatmár-Bereg 53.0 40.0 45.0 43.0 49.0
Tolna 66.0 46.0 51.0 49.0 54.0
Vas 82.5 42.0 60.0 53.5 53.0
Veszprém 72.0 54.5 56.0 58.0 58.0
Zala 84.0 54.0 64.0 59.0 57.0