Let us begin the notebook with the usual imports.
%pylab inline
import pandas as pd
DataFrame
¶We can already read in a DataFrame
, what if we wanted to create it ourselves?
Is is easiest to create one from columns stored in a dict
, where a list
or array
belongs to the different keys.
df=pd.DataFrame({'random1':random.random(4),
'zeroes':[0 for i in range(4)]})
df
We can also give an index by hand.
df=pd.DataFrame({'random1':random.random(4),
'zeroes':[0 for i in range(4)]},
index=['a','b','c','d'])
df
If we want to create a DataFrame
from rows, we can make it from a list of rows. The column names can be given in the columns
keyword argument.
sor1=random.random(4)
sor2=[0 for i in range(4)]
df=pd.DataFrame([sor1,sor2],columns=['a','b','c','d'],
index=['random','zeroes'])
df
It is possible to create a DataFrame
from a 2D numpy array.
df=pd.DataFrame(random.random((2,4)),columns=['a','b','c','d'],
index=['random','random2'])
df
df=pd.read_csv("data/smallpeople.csv",index_col=0)
The values stored in a pandas.DataFrame
are mostly accessed by the column headers and row names.
We have already seen that if we write a column name as a string in square brackets after the variable that stores a DataFrame
, we can retrieve the column.
df["Grade_John"]
If we want to retrieve more columns, then we have to give them as a list of strings after the square brackets behind the DataFrame
.
df[["Grade_John","Gender","Age"]]
We've already seen that we can access a row with the .loc[]
construction.
df.loc['Valentine']
We can also ask for multiple rows similarly to multiple columns.
df.loc[['Valentine','Sarah']]
If one wanted to index a DataFrame
by numbers, like an array
, then is is possible using .iloc[]
. Let's redo the previous operations with .iloc[]
!
df.iloc[:,0] # the first (0th) column
df.iloc[0,0] # first element of the first column
We can use all the indexing structures from numpy such as slicing.
df.iloc[::-1,3:5]
We cab even transform the insides of a DataFrame
to a numpy.array
, and we can use methods from the previous notebooks on them.
df.values
Asking for columns besed on their names is much safer than accessing them by indices, because it diminished the possibility of using the wrong column.
If we want to add a new row to our table, then we have to give a list of similar length than that of the column names to the .loc['index_of_new_row']
variable.
df.loc["David"]=[5,5,"male",20,'12:32']
df
If we create a new column, we use a similar notation, but without the .loc
, because that indexes the rows.
df["Advanced"]=[0,0,1,1,0,0]
df
If we want to delete a row, we can do it by using the drop
function. Here, we can use the option inplace
, that always controls whether our function returns a new DataFrame
, or whether it overwrites the already existing one.
df.drop("Valentine",inplace=True)
df
If we want to delete a column, then we can do it simliarly bu using drop
, we just have to use another axis
. Let us observe, that here, without the inplace
option, we get a new DataFrame
as return value.
df.drop("Advanced",axis=1)
We write out the column names as follows.
df.columns
We write out the row names as follows.
df.index
Sometimes, we might need the above lists as list
type objects.
df.columns.tolist()
list(df.columns)
We can make operations with all of the DataFrame
, if they make sense.
sub_df=df[["Grade_Paul","Age"]]
sub_df+1
Columns can be used just like numpy.array
s.
(df['Grade_Paul']+2)/3
df['Grade_Paul']/=2
df
df['Grade_Paul']*df['Age']
String operations work as well/
df['Gender']+' person'
These hold for rows, too.
sub_df.loc["David"]+3
Some built-in functions work for aggregating values in a DataFrame
.
This is for example the columnwise sum:
df.sum()
What to do if we want to calculate it row by row? We can modify the axis
of the aggregation. The previous case was the default axis=0
, that does columnwise calculations. We only add up the columns containing notes.
df[["Grade_Paul","Grade_John"]].sum(axis=1)
Let's count how many elements are there in the columns or in the rows.
df.count()
df.count(axis=1)
We could have done it in an array
-like fashion.
df.shape
Further ideas for built-in functions are mean, median, min, max, std
.
It is very common that we only want to have a look at certain rows from our table fulfilling a certain condition. If we give a True/False list inside the square brackets after a DataFrame
, then the command only returns the elements corresponding to the True values.
First, let's have a look what happens if we test whether a column is equal to a value.
df
df["Gender"]=="female"
We see that we got a True/False value for each row. Now we put the above expression into the square brackets.
df[df["Gender"]=="female"]
But we can test for other conditions, for example, to whom did Grade_Paul give a grade better than 2.
df[df["Grade_Paul"]>2]
We can also concatenate conditions, we have to use the &
and |
operators instead of and
and or
, because they cannot compare two lists element by element. We have to put the conditions into brackets, otherwise we get an error message.
Those people who are older than 19 and who got a grade better than 2 from Grade_Paul:
df[(df["Grade_Paul"]>2) & (df["Age"]>19)]
We might need to order our table according to one of the columns. We can use the .sort_values(by="column_name")
for this operation, where we can give whether we want an ascending (ascending=True
), or a descending order (ascending=False
).
The return value of the function is the ordered table:
df.sort_values(by="Age",ascending=False)
We can also order based on multiple columns:
df.sort_values(by=["Grade_Paul","Age"],ascending=True)
If we want to store the ordered rows in the original DataFrame
, we have to add the inplace=True
keyword argument to the function that overwrites the DataFrame
after the ordering.
df.sort_values(by="Age",ascending=False,inplace=True)
Of course, we could have achieved this by using the usual value setting.
df=df.sort_values(by="Age",ascending=False)
If we want to sort by DataFrame
index, then the sort_index()
funciton helps. Here, it is again an option to set inplace=True
for sorting the original DataFrame
.
df.sort_index(inplace=True)
df