Simple operations, masking¶

Let us begin the notebook with the usual imports.

%pylab inline

Populating the interactive namespace from numpy and matplotlib

import pandas as pd

Creating a `DataFrame`¶

We can already read in a DataFrame, what if we wanted to create it ourselves?

Is is easiest to create one from columns stored in a dict, where a list or array belongs to the different keys.

df=pd.DataFrame({'random1':random.random(4),
                 'zeroes':[0 for i in range(4)]})
df

We can also give an index by hand.

df=pd.DataFrame({'random1':random.random(4),
                 'zeroes':[0 for i in range(4)]},
                 index=['a','b','c','d'])
df

If we want to create a DataFrame from rows, we can make it from a list of rows. The column names can be given in the columns keyword argument.

sor1=random.random(4) 
sor2=[0 for i in range(4)]

df=pd.DataFrame([sor1,sor2],columns=['a','b','c','d'],
                index=['random','zeroes'])
df

It is possible to create a DataFrame from a 2D numpy array.

df=pd.DataFrame(random.random((2,4)),columns=['a','b','c','d'],
               index=['random','random2'])
df

Getting the data¶

df=pd.read_csv("data/smallpeople.csv",index_col=0)

The values stored in a pandas.DataFrame are mostly accessed by the column headers and row names.

We have already seen that if we write a column name as a string in square brackets after the variable that stores a DataFrame, we can retrieve the column.

df["Grade_John"]

Name
Valentine    .
Sarah        4
Stephen      4
Jane         5
Charles      .
Name: Grade_John, dtype: object

If we want to retrieve more columns, then we have to give them as a list of strings after the square brackets behind the DataFrame.

df[["Grade_John","Gender","Age"]]

We've already seen that we can access a row with the .loc[] construction.

df.loc['Valentine']

Grade_Paul        2
Grade_John        .
Gender         male
Age              20
Date          12:31
Name: Valentine, dtype: object

We can also ask for multiple rows similarly to multiple columns.

df.loc[['Valentine','Sarah']]

If one wanted to index a DataFrame by numbers, like an array, then is is possible using .iloc[]. Let's redo the previous operations with .iloc[]!

df.iloc[:,0] # the first (0th) column

Name
Valentine    2
Sarah        4
Stephen      5
Jane         3
Charles      4
Name: Grade_Paul, dtype: int64

df.iloc[0,0]  # first element of the first column

2

We can use all the indexing structures from numpy such as slicing.

df.iloc[::-1,3:5]

We cab even transform the insides of a DataFrame to a numpy.array, and we can use methods from the previous notebooks on them.

df.values

array([[2, '.', 'male', 20, '12:31'],
       [4, '4', 'female', 22, '13:20'],
       [5, '4', 'male', 19, '12:35'],
       [3, '5', 'female', 20, '14:50'],
       [4, '.', 'male', 21, '14:55']], dtype=object)

Asking for columns besed on their names is much safer than accessing them by indices, because it diminished the possibility of using the wrong column.

Adding new row/column, deletion¶

If we want to add a new row to our table, then we have to give a list of similar length than that of the column names to the .loc['index_of_new_row'] variable.

df.loc["David"]=[5,5,"male",20,'12:32']
df

If we create a new column, we use a similar notation, but without the .loc, because that indexes the rows.

df["Advanced"]=[0,0,1,1,0,0]
df

If we want to delete a row, we can do it by using the drop function. Here, we can use the option inplace, that always controls whether our function returns a new DataFrame, or whether it overwrites the already existing one.

df.drop("Valentine",inplace=True)
df

If we want to delete a column, then we can do it simliarly bu using drop, we just have to use another axis. Let us observe, that here, without the inplace option, we get a new DataFrame as return value.

df.drop("Advanced",axis=1)

Accessing row and column names¶

We write out the column names as follows.

df.columns

Index(['Grade_Paul', 'Grade_John', 'Gender', 'Age', 'Date', 'Advanced'], dtype='object')

We write out the row names as follows.

df.index

Index(['Sarah', 'Stephen', 'Jane', 'Charles', 'David'], dtype='object', name='Name')

Sometimes, we might need the above lists as list type objects.

df.columns.tolist()

['Grade_Paul', 'Grade_John', 'Gender', 'Age', 'Date', 'Advanced']

list(df.columns)

['Grade_Paul', 'Grade_John', 'Gender', 'Age', 'Date', 'Advanced']

Simple operations¶

We can make operations with all of the DataFrame, if they make sense.

sub_df=df[["Grade_Paul","Age"]]
sub_df+1

Columns can be used just like numpy.arrays.

(df['Grade_Paul']+2)/3

Name
Sarah      2.000000
Stephen    2.333333
Jane       1.666667
Charles    2.000000
David      2.333333
Name: Grade_Paul, dtype: float64

df['Grade_Paul']/=2
df

df['Grade_Paul']*df['Age']

Name
Sarah      44.0
Stephen    47.5
Jane       30.0
Charles    42.0
David      50.0
dtype: float64

String operations work as well/

df['Gender']+' person'

Name
Sarah      female person
Stephen      male person
Jane       female person
Charles      male person
David        male person
Name: Gender, dtype: object

These hold for rows, too.

sub_df.loc["David"]+3

Grade_Paul     8
Age           23
Name: David, dtype: int64

Built-in aggregating functions¶

Some built-in functions work for aggregating values in a DataFrame.

This is for example the columnwise sum:

df.sum()

Grade_Paul                         10.5
Gender         femalemalefemalemalemale
Age                                 102
Date          13:2012:3514:5014:5512:32
Advanced                              2
dtype: object

What to do if we want to calculate it row by row? We can modify the axis of the aggregation. The previous case was the default axis=0, that does columnwise calculations. We only add up the columns containing notes.

df[["Grade_Paul","Grade_John"]].sum(axis=1)

Name
Sarah      2.0
Stephen    2.5
Jane       1.5
Charles    2.0
David      2.5
dtype: float64

Let's count how many elements are there in the columns or in the rows.

df.count()

Grade_Paul    5
Grade_John    5
Gender        5
Age           5
Date          5
Advanced      5
dtype: int64

df.count(axis=1)

Name
Sarah      6
Stephen    6
Jane       6
Charles    6
David      6
dtype: int64

We could have done it in an array-like fashion.

df.shape

(5, 6)

Further ideas for built-in functions are mean, median, min, max, std.

Boolean indexing¶

It is very common that we only want to have a look at certain rows from our table fulfilling a certain condition. If we give a True/False list inside the square brackets after a DataFrame, then the command only returns the elements corresponding to the True values.

First, let's have a look what happens if we test whether a column is equal to a value.

df

df["Gender"]=="female"

Name
Sarah       True
Stephen    False
Jane        True
Charles    False
David      False
Name: Gender, dtype: bool

We see that we got a True/False value for each row. Now we put the above expression into the square brackets.

df[df["Gender"]=="female"]

But we can test for other conditions, for example, to whom did Grade_Paul give a grade better than 2.

df[df["Grade_Paul"]>2]

We can also concatenate conditions, we have to use the & and | operators instead of and and or, because they cannot compare two lists element by element. We have to put the conditions into brackets, otherwise we get an error message.

Those people who are older than 19 and who got a grade better than 2 from Grade_Paul:

df[(df["Grade_Paul"]>2) & (df["Age"]>19)]

Ordering¶

We might need to order our table according to one of the columns. We can use the .sort_values(by="column_name") for this operation, where we can give whether we want an ascending (ascending=True), or a descending order (ascending=False).

The return value of the function is the ordered table:

df.sort_values(by="Age",ascending=False)

We can also order based on multiple columns:

df.sort_values(by=["Grade_Paul","Age"],ascending=True)

If we want to store the ordered rows in the original DataFrame, we have to add the inplace=True keyword argument to the function that overwrites the DataFrame after the ordering.

df.sort_values(by="Age",ascending=False,inplace=True)

Of course, we could have achieved this by using the usual value setting.

df=df.sort_values(by="Age",ascending=False)

If we want to sort by DataFrame index, then the sort_index() funciton helps. Here, it is again an option to set inplace=True for sorting the original DataFrame.

df.sort_index(inplace=True)

df

	random1	zeroes
0	0.413749	0
1	0.102535	0
2	0.284745	0
3	0.620163	0

	random1	zeroes
a	0.422387	0
b	0.030502	0
c	0.221540	0
d	0.551693	0

	a	b	c	d
random	0.638204	0.767368	0.167672	0.83644
zeroes	0.000000	0.000000	0.000000	0.00000

	a	b	c	d
random	0.369865	0.398327	0.054619	0.657612
random2	0.002765	0.296481	0.770153	0.586725

	Grade_John	Gender	Age
Name
Valentine	.	male	20
Sarah	4	female	22
Stephen	4	male	19
Jane	5	female	20
Charles	.	male	21

	Age	Date
Name
Charles	21	14:55
Jane	20	14:50
Stephen	19	12:35
Sarah	22	13:20
Valentine	20	12:31