Simple operations, masking

Let us begin the notebook with the usual imports.

In [1]:
%pylab inline
Populating the interactive namespace from numpy and matplotlib
In [2]:
import pandas as pd

Creating a DataFrame

We can already read in a DataFrame, what if we wanted to create it ourselves?

Is is easiest to create one from columns stored in a dict, where a list or array belongs to the different keys.

In [3]:
df=pd.DataFrame({'random1':random.random(4),
                 'zeroes':[0 for i in range(4)]})
df
Out[3]:
random1 zeroes
0 0.413749 0
1 0.102535 0
2 0.284745 0
3 0.620163 0

We can also give an index by hand.

In [4]:
df=pd.DataFrame({'random1':random.random(4),
                 'zeroes':[0 for i in range(4)]},
                 index=['a','b','c','d'])
df
Out[4]:
random1 zeroes
a 0.422387 0
b 0.030502 0
c 0.221540 0
d 0.551693 0

If we want to create a DataFrame from rows, we can make it from a list of rows. The column names can be given in the columns keyword argument.

In [5]:
sor1=random.random(4) 
sor2=[0 for i in range(4)]

df=pd.DataFrame([sor1,sor2],columns=['a','b','c','d'],
                index=['random','zeroes'])
df
Out[5]:
a b c d
random 0.638204 0.767368 0.167672 0.83644
zeroes 0.000000 0.000000 0.000000 0.00000

It is possible to create a DataFrame from a 2D numpy array.

In [6]:
df=pd.DataFrame(random.random((2,4)),columns=['a','b','c','d'],
               index=['random','random2'])
df
Out[6]:
a b c d
random 0.369865 0.398327 0.054619 0.657612
random2 0.002765 0.296481 0.770153 0.586725

Getting the data

In [7]:
df=pd.read_csv("data/smallpeople.csv",index_col=0)

The values stored in a pandas.DataFrame are mostly accessed by the column headers and row names.

We have already seen that if we write a column name as a string in square brackets after the variable that stores a DataFrame, we can retrieve the column.

In [8]:
df["Grade_John"]
Out[8]:
Name
Valentine    .
Sarah        4
Stephen      4
Jane         5
Charles      .
Name: Grade_John, dtype: object

If we want to retrieve more columns, then we have to give them as a list of strings after the square brackets behind the DataFrame.

In [9]:
df[["Grade_John","Gender","Age"]]
Out[9]:
Grade_John Gender Age
Name
Valentine . male 20
Sarah 4 female 22
Stephen 4 male 19
Jane 5 female 20
Charles . male 21

We've already seen that we can access a row with the .loc[] construction.

In [10]:
df.loc['Valentine']
Out[10]:
Grade_Paul        2
Grade_John        .
Gender         male
Age              20
Date          12:31
Name: Valentine, dtype: object

We can also ask for multiple rows similarly to multiple columns.

In [11]:
df.loc[['Valentine','Sarah']]
Out[11]:
Grade_Paul Grade_John Gender Age Date
Name
Valentine 2 . male 20 12:31
Sarah 4 4 female 22 13:20

If one wanted to index a DataFrame by numbers, like an array, then is is possible using .iloc[]. Let's redo the previous operations with .iloc[]!

In [12]:
df.iloc[:,0] # the first (0th) column
Out[12]:
Name
Valentine    2
Sarah        4
Stephen      5
Jane         3
Charles      4
Name: Grade_Paul, dtype: int64
In [13]:
df.iloc[0,0]  # first element of the first column
Out[13]:
2

We can use all the indexing structures from numpy such as slicing.

In [14]:
df.iloc[::-1,3:5] 
Out[14]:
Age Date
Name
Charles 21 14:55
Jane 20 14:50
Stephen 19 12:35
Sarah 22 13:20
Valentine 20 12:31

We cab even transform the insides of a DataFrame to a numpy.array, and we can use methods from the previous notebooks on them.

In [15]:
df.values
Out[15]:
array([[2, '.', 'male', 20, '12:31'],
       [4, '4', 'female', 22, '13:20'],
       [5, '4', 'male', 19, '12:35'],
       [3, '5', 'female', 20, '14:50'],
       [4, '.', 'male', 21, '14:55']], dtype=object)

Asking for columns besed on their names is much safer than accessing them by indices, because it diminished the possibility of using the wrong column.

Adding new row/column, deletion

If we want to add a new row to our table, then we have to give a list of similar length than that of the column names to the .loc['index_of_new_row'] variable.

In [16]:
df.loc["David"]=[5,5,"male",20,'12:32']
df
Out[16]:
Grade_Paul Grade_John Gender Age Date
Name
Valentine 2 . male 20 12:31
Sarah 4 4 female 22 13:20
Stephen 5 4 male 19 12:35
Jane 3 5 female 20 14:50
Charles 4 . male 21 14:55
David 5 5 male 20 12:32

If we create a new column, we use a similar notation, but without the .loc, because that indexes the rows.

In [17]:
df["Advanced"]=[0,0,1,1,0,0]
df
Out[17]:
Grade_Paul Grade_John Gender Age Date Advanced
Name
Valentine 2 . male 20 12:31 0
Sarah 4 4 female 22 13:20 0
Stephen 5 4 male 19 12:35 1
Jane 3 5 female 20 14:50 1
Charles 4 . male 21 14:55 0
David 5 5 male 20 12:32 0

If we want to delete a row, we can do it by using the drop function. Here, we can use the option inplace, that always controls whether our function returns a new DataFrame, or whether it overwrites the already existing one.

In [18]:
df.drop("Valentine",inplace=True)
df
Out[18]:
Grade_Paul Grade_John Gender Age Date Advanced
Name
Sarah 4 4 female 22 13:20 0
Stephen 5 4 male 19 12:35 1
Jane 3 5 female 20 14:50 1
Charles 4 . male 21 14:55 0
David 5 5 male 20 12:32 0

If we want to delete a column, then we can do it simliarly bu using drop, we just have to use another axis. Let us observe, that here, without the inplace option, we get a new DataFrame as return value.

In [19]:
df.drop("Advanced",axis=1)
Out[19]:
Grade_Paul Grade_John Gender Age Date
Name
Sarah 4 4 female 22 13:20
Stephen 5 4 male 19 12:35
Jane 3 5 female 20 14:50
Charles 4 . male 21 14:55
David 5 5 male 20 12:32

Accessing row and column names

We write out the column names as follows.

In [20]:
df.columns
Out[20]:
Index(['Grade_Paul', 'Grade_John', 'Gender', 'Age', 'Date', 'Advanced'], dtype='object')

We write out the row names as follows.

In [21]:
df.index
Out[21]:
Index(['Sarah', 'Stephen', 'Jane', 'Charles', 'David'], dtype='object', name='Name')

Sometimes, we might need the above lists as list type objects.

In [22]:
df.columns.tolist()
Out[22]:
['Grade_Paul', 'Grade_John', 'Gender', 'Age', 'Date', 'Advanced']
In [23]:
list(df.columns)
Out[23]:
['Grade_Paul', 'Grade_John', 'Gender', 'Age', 'Date', 'Advanced']

Simple operations

We can make operations with all of the DataFrame, if they make sense.

In [25]:
sub_df=df[["Grade_Paul","Age"]]
sub_df+1
Out[25]:
Grade_Paul Age
Name
Sarah 5 23
Stephen 6 20
Jane 4 21
Charles 5 22
David 6 21

Columns can be used just like numpy.arrays.

In [26]:
(df['Grade_Paul']+2)/3
Out[26]:
Name
Sarah      2.000000
Stephen    2.333333
Jane       1.666667
Charles    2.000000
David      2.333333
Name: Grade_Paul, dtype: float64
In [27]:
df['Grade_Paul']/=2
df
Out[27]:
Grade_Paul Grade_John Gender Age Date Advanced
Name
Sarah 2.0 4 female 22 13:20 0
Stephen 2.5 4 male 19 12:35 1
Jane 1.5 5 female 20 14:50 1
Charles 2.0 . male 21 14:55 0
David 2.5 5 male 20 12:32 0
In [28]:
df['Grade_Paul']*df['Age']
Out[28]:
Name
Sarah      44.0
Stephen    47.5
Jane       30.0
Charles    42.0
David      50.0
dtype: float64

String operations work as well/

In [31]:
df['Gender']+' person'
Out[31]:
Name
Sarah      female person
Stephen      male person
Jane       female person
Charles      male person
David        male person
Name: Gender, dtype: object

These hold for rows, too.

In [32]:
sub_df.loc["David"]+3
Out[32]:
Grade_Paul     8
Age           23
Name: David, dtype: int64

Built-in aggregating functions

Some built-in functions work for aggregating values in a DataFrame.

This is for example the columnwise sum:

In [33]:
df.sum()
Out[33]:
Grade_Paul                         10.5
Gender         femalemalefemalemalemale
Age                                 102
Date          13:2012:3514:5014:5512:32
Advanced                              2
dtype: object

What to do if we want to calculate it row by row? We can modify the axis of the aggregation. The previous case was the default axis=0, that does columnwise calculations. We only add up the columns containing notes.

In [34]:
df[["Grade_Paul","Grade_John"]].sum(axis=1)
Out[34]:
Name
Sarah      2.0
Stephen    2.5
Jane       1.5
Charles    2.0
David      2.5
dtype: float64

Let's count how many elements are there in the columns or in the rows.

In [35]:
df.count()
Out[35]:
Grade_Paul    5
Grade_John    5
Gender        5
Age           5
Date          5
Advanced      5
dtype: int64
In [36]:
df.count(axis=1)
Out[36]:
Name
Sarah      6
Stephen    6
Jane       6
Charles    6
David      6
dtype: int64

We could have done it in an array-like fashion.

In [34]:
df.shape
Out[34]:
(5, 6)

Further ideas for built-in functions are mean, median, min, max, std.

Boolean indexing

It is very common that we only want to have a look at certain rows from our table fulfilling a certain condition. If we give a True/False list inside the square brackets after a DataFrame, then the command only returns the elements corresponding to the True values.

First, let's have a look what happens if we test whether a column is equal to a value.

In [37]:
df
Out[37]:
Grade_Paul Grade_John Gender Age Date Advanced
Name
Sarah 2.0 4 female 22 13:20 0
Stephen 2.5 4 male 19 12:35 1
Jane 1.5 5 female 20 14:50 1
Charles 2.0 . male 21 14:55 0
David 2.5 5 male 20 12:32 0
In [38]:
df["Gender"]=="female"
Out[38]:
Name
Sarah       True
Stephen    False
Jane        True
Charles    False
David      False
Name: Gender, dtype: bool

We see that we got a True/False value for each row. Now we put the above expression into the square brackets.

In [39]:
df[df["Gender"]=="female"]
Out[39]:
Grade_Paul Grade_John Gender Age Date Advanced
Name
Sarah 2.0 4 female 22 13:20 0
Jane 1.5 5 female 20 14:50 1

But we can test for other conditions, for example, to whom did Grade_Paul give a grade better than 2.

In [40]:
df[df["Grade_Paul"]>2]
Out[40]:
Grade_Paul Grade_John Gender Age Date Advanced
Name
Stephen 2.5 4 male 19 12:35 1
David 2.5 5 male 20 12:32 0

We can also concatenate conditions, we have to use the & and | operators instead of and and or, because they cannot compare two lists element by element. We have to put the conditions into brackets, otherwise we get an error message.

Those people who are older than 19 and who got a grade better than 2 from Grade_Paul:

In [41]:
df[(df["Grade_Paul"]>2) & (df["Age"]>19)]
Out[41]:
Grade_Paul Grade_John Gender Age Date Advanced
Name
David 2.5 5 male 20 12:32 0

Ordering

We might need to order our table according to one of the columns. We can use the .sort_values(by="column_name") for this operation, where we can give whether we want an ascending (ascending=True), or a descending order (ascending=False).

The return value of the function is the ordered table:

In [42]:
df.sort_values(by="Age",ascending=False)
Out[42]:
Grade_Paul Grade_John Gender Age Date Advanced
Name
Sarah 2.0 4 female 22 13:20 0
Charles 2.0 . male 21 14:55 0
Jane 1.5 5 female 20 14:50 1
David 2.5 5 male 20 12:32 0
Stephen 2.5 4 male 19 12:35 1

We can also order based on multiple columns:

In [44]:
df.sort_values(by=["Grade_Paul","Age"],ascending=True)
Out[44]:
Grade_Paul Grade_John Gender Age Date Advanced
Name
Jane 1.5 5 female 20 14:50 1
Charles 2.0 . male 21 14:55 0
Sarah 2.0 4 female 22 13:20 0
Stephen 2.5 4 male 19 12:35 1
David 2.5 5 male 20 12:32 0

If we want to store the ordered rows in the original DataFrame, we have to add the inplace=True keyword argument to the function that overwrites the DataFrame after the ordering.

In [45]:
df.sort_values(by="Age",ascending=False,inplace=True)

Of course, we could have achieved this by using the usual value setting.

In [46]:
df=df.sort_values(by="Age",ascending=False)

If we want to sort by DataFrame index, then the sort_index() funciton helps. Here, it is again an option to set inplace=True for sorting the original DataFrame.

In [47]:
df.sort_index(inplace=True)
In [48]:
df
Out[48]:
Grade_Paul Grade_John Gender Age Date Advanced
Name
Charles 2.0 . male 21 14:55 0
David 2.5 5 male 20 12:32 0
Jane 1.5 5 female 20 14:50 1
Sarah 2.0 4 female 22 13:20 0
Stephen 2.5 4 male 19 12:35 1