%pylab inline
import pandas as pd
Reading the examination data again:
df=pd.read_csv("data/examination.csv.gz",sep=";")
We create a smaller table that is easier to rewad for demonstration purposes.
subdf=df[['city','school_type',
'gender','level','points_written']].head(2)
We can concatenate two DataFrame
s, if we give a list containing DataFrame
s with the same number of columns to the pd.concat()
function.
In our example, we write the same DataFrame
two times under each other.
cdf=pd.concat([subdf,subdf])
cdf
The indices of the newly created table sill come form the original DataFrame
s.
cdf.loc[0]
We can make a new index, while also keeping the old ones.
cdf=pd.concat([subdf,subdf]).reset_index()
cdf
Or we could throw them away.
cdf=pd.concat([subdf,subdf]).reset_index(drop=True)
cdf
We can concatenate tables along the other axis as well.
df1=pd.DataFrame(np.ones((2,2)))
df2=pd.DataFrame(2*np.ones((2,2)))
pd.concat([df1,df2],axis=1)
Then, concat
uses the indices to copy the DataFrame
s next to each other.
df1=pd.DataFrame(np.ones((2,2)))
df2=pd.DataFrame(2*np.ones((2,2)),index=[3,4])
pd.concat([df1,df2],axis=1)
If we have a common identifier or columns with common values in two tables, it is possible to merge them. For example, we can take two columns from both tables, and we put those rows next to each other, where the value of these two columns is the same.
We omit those lines that do not have a pair in the other table.
We read a data series about cities downloaded from KSH.
h=pd.read_csv('data/settlements.csv')
h.head()
Let's plot the points achieved as a function of the city population.
We merge the two tables by assigning the row from the KSH data to each examination based on the name of the city.
mdf=df.merge(h,how='inner',
left_on='city',
right_on='Name')
mdf.head()
We exclude Budapest, beacuse it is too big.
mdf=mdf[mdf['city']!='Budapest']
We load seaborn for this plot.
import seaborn as sns
In the written examinations, bigger cities perform significantly better.
x,y='points_written','Population'
sns.jointplot(x=x, y=y, data=mdf,kind="kde", color="k")
But in the oral examination part, smaller places do better...
import seaborn as sns
x,y='points_oral','Population'
sns.jointplot(x=x, y=y, data=mdf,kind="kde", color="k",xlim=(0,60))
We can also merge two tables by using all of the rows from one table, and where the second table has no corresponding rows, we still leave the original row in the resulting table.
For example, we can look for cities where there has been no Physics exam in 2014 (or where it is not in our data).
First, we look for cities where there was an examination.
df14=df[df['year']==2014]
fizcit=df14[['city']].drop_duplicates().reset_index(drop=True)
fizcit.head()
Next, we look for all possible cities.
# város means city
allcit=h[h['Type']=='város'][
['Name']].reset_index(drop=True)
allcit.head()
Then, we merge them such that we keep all existing cities.
lj=allcit.merge(fizcit,how='left',right_on='city',
left_on='Name')
lj
We look for cities, where the value of the examination table is lacking.
lj[pd.isnull(lj['city'])]
We can only join on concrete values, we cannot use any transformation functions such as 'give me the rows where the absolute value of the x column in table A minus the absolute value of the y column of table B is less than 5'. Though this operation would be trivial in SQL. If we need such a thing, we have to turn to the package sqlite
.
If we have to took whether a character string is in the elements of a string columns:
df[df["city"].str.contains("Buda")].tail()['city']
If we want to use an arbitrary function to each and every element of a column, we can do it by using apply
. We write the function that we wany to use into the brackets after apply.
First we define a function that adds 1 to an arbitrary number:
def hozzaad(x):
return x+1
Then, we build houses.
h['Number of living units'].head()
h['Number of living units'].apply(hozzaad).head()
We can use apply
rowwise if we use the axis=1
keyword. For example, let us write a normalization function by hand, that divides a row by the sum of its elements. This function will be very slow!
def norm(sor):
return sor/sor.sum()
df[['points_written','points_oral']].apply(norm,axis=1).head()
We can easily rearrange our table even when we had only column-wise values. Now we want to have a look at the median of the points of the written examination for each county in each year.
df.pivot_table(values='points_written',
columns='year',
index='county',
aggfunc=np.median)