If we want to handle a large amount of data, we have to use the files stored on the external storages (HDD, other devices) of the computer and not only the memory. In the following notebook, we are going to familiarize ourselves with the file handling functions of Pyhton, then we get to know the routines of numpy
, and then we are going to introduce some formats used in scientific research (hdf5, fits). We begin our notebook with the usual line:
%pylab inline
In this chapter we are going to have a look at Python's function for file operations that are independent of the modules. These functions handle the content of files mostly as character strings, therefore, using, reading and writing formatted data might be painful. Usually, it is done through format-specific functions. For the examples, we are going to use a file that contains historic observations of sunspots.
Reading a whole file into one character string happens according to the following method.
First of all, we open the file with the open()
function. The open()
function prepares the system for reading the file from the external storage. In its easiest form, we just give the name of the datafile to this function.
file1 = open('data/SN_m_tot_V2.0.txt')
file1
The above line enables us to reach the content of the file through the file1
variable. The complete content of the file might be read into a character string variable called wholefile
.
wholefile=file1.read()
Attention, we should not mix the file1
and wholefile
variables. As we have alread mentioned, the wholefile
variable is a character string, and file1
enables the reading and/or writing of the file, that we also call "stream".
If we are finished with the reading/writing, we can close our stream with the close()
command, that interrupts the communication with the external storage.
print(file1.closed)
file1.close()
print(file1.closed)
Above, we have already loaded the content of the data file into the string wholefile
, for example, let's have a look at the first 100 characters of the file:
wholefile[:100]
If we have a look at the file itself, we can see that indeed, this is the text at its beginning. Observe the \n
character instead of the new line!
read()
can read only a specific number of characters. Let us open the previous file again.
file1 = open('data/SN_m_tot_V2.0.txt')
Let us read 10 characters in the following way:
chars=file1.read(10)
The variable chars1
contains the first 10 characters of the file.
chars
If we call again the read()
function of the file1
variable , then it is going to read the next 10 characters:
chars1=file1.read(10)
chars2
Thus, read makes us go through the characters of the file: if we once read something, we cannot read that same part again. Fortunately, this is not the case, since tell()
and seek()
can ask and/or set the position of an open filestream.
The tell()
function gives us the actual stream position in the file1
variable, which is now 20, because we used the .read(10)
method two times.
file1.tell()
The seek()
method can jump to an arbitrary place in the file. seek(n)
jumps to the n
th character of the file. If we apply the read function after this, it is going to read the file from this position.
file1.seek(5)
The next command reads 10 characters beginning from the 5th.
chars3=file1.read(10)
chars3
Let us close the file again!
file1.close()
In order to never forget closing a file, that might cause a hassle, Python has a very convenient construction, the with
. This calls the given objects __enter__()
and __exit()__
functions automatically. Variables stay there after the with
block, but files close automatically after it.
Advantages of with
:
with open('data/SN_m_tot_V2.0.txt') as f:
print(f.closed) # here the file is open
print(f.closed) # here the file is closed
In many cases, the information in our datafile is organized as a table. Each row of the file has the same marks for grouping our data. We can get all lines of an open file with the readlines()
function.
with open('data/SN_m_tot_V2.0.txt') as f:
lines = f.readlines()
The elements of the list lines
are lines of the file, for example here are the first ten lines:
lines[0:10]
As we see, the first six lines begin with #
, and each of these lines contain the description of the columns. After the 7th line, tough, there are numbers separated by spaces, which is the data itself, but unfortunately, still in the form of a string.
The first valuable line is the 7th:
lines[6]
A character string may be split with the .split()
method along given characters into a list. The .split()
method defaults to splitting on spaces, therefore, the first valuable line might be split by:
lines[6].split()
The result is thus a list that contains the valuable data, but they are still small strings! We have to convert these strings into float numbers using the float()
function. Here is the conversion of the fourth column of the first data line:
float(lines[6].split()[3])
If we can convert one element to a number, then using a for
cycle, we can arrange our data into lists using the known way:
num_sunspot=[]
measurement_time=[]
for sor in sorok[6:]:
num_sunspot.append( float(sor.split()[3]) )
measurement_time.append( float(sor.split()[2]) )
Let us plot the number of sunspots as a function of time:
plot(measurement_time,num_sunspot)
xlabel('T [year]')
ylabel('Number of sunspots')
We can see that the number of sunspots oscillates periodically in the last 250 years.
If we read very big files, it is not worth to, and it is also not possible to read the whole file into the memory. Python has a solution to this problem as well, we can iterate through the open file in a for cycle:
with open('data/SN_m_tot_V2.0.txt') as f:
num_sunspot,measurement_time=[],[]
for sor in f: # iterating through a data stream or file
if sor[0]!='#':
num_sunspot.append( float(sor.split()[3]) )
measurement_time.append( float(sor.split()[2]) )
plot(measurement_time,num_sunspot) # this plot is the same as the previous one
xlabel('T [year]')
ylabel('Number of sunspots')
Next to reading the data, we might need to write out the processed information. Python has the write()
function for this task for data streams. Let us observe writing on a simple example!
We open a file, in which we would like to write, we can signal this through the mode
parameter of the open()
function by setting mode='w'
. Here, 'w'
means writing.
We can write character strings now to the file2
stream using the write()
function.
with open('data/out_mentes1.dat',mode='w') as f:
print(f)
with open('data/out_mentes1.dat',mode='w') as f:
print(f.write('#Ez az elso kiirt fileom!\n'))
If we would like to append characters to an existing file, then we can set the mode
parameter of the open()
function to mode='a'
. Here, 'a'
meand append, as for example in the case of lists.
with open('data/out_mentes1.dat',mode='a') as f:
print(f)
Let us write the measurement time and number of sunspots data into the already existing data/out_save1.dat
file! We are going to do this using a for cycle.
with open('data/out_mentes1.dat',mode='a') as f:
for i in range(len(measurement_time)):
f.write(str(measurement_time[i])+' '+str(num_sunspot[i])+'\n')
The character string in the argument of the write()
function contains the elements we wanted with a space in between, and a linebreak given by '\n'
.
Similarly to reading, we can write out whole lines when writing. Not surprisingly, the function is called writelines()
. This function writes lists of character strings into a file through an open filestream.
If every element in the list ends by '\n'
, then after writing, each list element is going to be a single line in the file.
sorok=['This is the first line\n','This is the second line\n']
with open('data/out_mentes1.dat',mode='a') as f:
f.writelines(sorok)
What if there is no '\n'
?
sorok=['This is the third line','Where is it?\n']
with open('data/out_mentes2.dat',mode='a') as f:
f.writelines(sorok)
Let us observe the file itself!
If we want to write lines, and we don't want to add '\n'
each time, we can use the print()
function.
sorok=['This is the third line','Where is it?\n']
with open('data/out_mentes3.dat',mode='w') as f:
for s in sorok:
print(s,file=f)
character | meaning |
---|---|
\n | newline |
\r | carriage return |
\t | horizontal space (TAB) |
\v | vertical space (TAB) |
\xhh.. | hexadecimal character with hh value |
The effect of '\r'
, '\t'
, and '\n'
using the print()
function:
print("THESE ARE CAPITALSthese are lowercase")
print("THESE ARE CAPITALS\rthese are lowercase")
print("THESE ARE CAPITALS\tthese are lowercase")
print("THESE ARE CAPITALS\nthese are lowercase")
These characters behave the same way when writing to a file.
As we have already seen, the array
class of the numpy
package has many advantages compared to a simple list
variable. The numpy
package provides some very useful file handling routines, that can be used to read array
class variables from a file or write them to a file. Below, we are going to have a look at two interesting examples.
First, we are going to analyze the data from the jump of Felix Baumgartner. The distance vs time data can be found in the data/h_vs_t
file. Have a look at the file itself! It contains two columns of numbers. The first column is the time in seconds, the second is the height at that given timepoint in meters. Such simple filestructures are easy to read and load into array
s with the loadtxt
function of numpy
.
baum_data=loadtxt('data/h_vs_t.txt')
We store the first column of the baum_data
array in the variable t
, and the second column in the variable h
.
t=baum_data[:,0] # time
h=baum_data[:,1] # height
Let us plot the data! We pay attention to not leaving the labels from the axes.
plot(t,h)
xlabel('Time [s]')
ylabel('Height [m]')
It was an important question concerning the jump whether sound velocity will be achieved. Let's investigate from the data, whether it happened. First of all, we need the time dependence of velocity. We can achieve this through numberically derivating the height-time function. If we samply an $y(x)$ function with a finite number of $x_i,y_i$ pairs, then the numerical derivative is roughly the following difference:
$$\left . \frac{\mathrm{d}y}{\mathrm{d}x}\right|_{x_i} =\frac{y_{i+1}-y_i}{x_{i+1}-x_i} $$
Let us now define a function, that creates the numerical derivative from two arrays x
and y
.We have to take care of the first and the last elements of the arrays differently.
# numerical derivative function
def nderiv(y,x):
"Frist neighbor differetial"
n = len(y) # number of datapoints
d = zeros(n) # initializing an array with zeros
# handling endpoints
for i in range(1,n-1):
d[i] = (y[i+1]-y[i])/(x[i+1]-x[i]) # derivative of a general point
d[0] = (y[1]-y[0])/(x[1]-x[0]) # derivative of the first point
d[n-1] = (y[n-1]-y[n-2])/(x[n-1]-x[n-2]) # derivative of the last point
return d
We can now determine the velocity with the help of the nderiv
function.
v=nderiv(h,t) # The first variable is 'h' and the second is 't'!!!
How does the velocity-time function look like?
plot(t,v)
xlabel('Time [s]')
ylabel('Velocity [m/s]')
Because the speed of sound depends on height, it is reasonable to plot the velocity as a function of height in order to determine whether Felix Baumgartner has crossed the threshold.
plot(h,abs(v))
xlabel('Height [m]')
ylabel('Velocity [m/s]')
According to Wikipedia, around 25 km of height, the velocity of sound is slightly below 300 m/s. At this height, the velocity of Felix Baumgartner has been around 350 m/s, therefore he has succeeded in setting the record according to the measurements.
We can save the velocity-time data with the savetxt
function of numpy
.
savetxt('data/out_tv.txt',[t,v])
Observe the file!
Writing big arrays into textfiles might be very slow. If no human is ever going to read our files, only computers, we can choose to use the binary load/save functions of numpy
, that enables very fast reading/writing.
Let us now make a big array.
# big array
a=random.random((2048,2048))
Let us write it out as a text, and let's measure how long it takes.
%time savetxt('data/out_tmp.txt',a)
Let's save it in binary mode. This is simply done by using the save
function of the numpy
module. This function enables binary mode writing for variables.
%time save('data/out_tmp.npy',a)
Let's read it as text.
%time a1=loadtxt('data/out_tmp.txt')
Let's read it as a binary.
%time a2=load('data/out_tmp.npy')
Let's compare the two arrays.
all(a1==a2)
Let's compare the file sizes!
! du -h data/out_tmp.txt
! du -h data/out_tmp.npy
There can even be one order of magnitude difference in the writing when we use the binary mode. The size of binary files is also smaller.
It might happen that we want to save a simple table or anything else that can eb easily represented in a file. In such cases, we don't have to create difficult read/write functions, because we can use the pickle
package of Python. This can convert any python objects to bytes, and save it to a file.
Assume that we would like to save a complex number that we've already worked on, and we'd like to use it later.
class Complex:
"""
Our own complex class instead of that of Python.
We could still work on it...
"""
def __init__(self,re,im):
self.re,self.im=re,im
def __add__(self,other):
return Complex(self.re+other.re,self.im+other.im)
def __mul__(self,other):
re=self.re*other.re - self.im*other.im
im=self.re*other.im + self.im*other.re
return Complex(re,im)
def __abs__(self):
return (self.re**2 + self.im**2)**0.5
def __str__(self):
return str(self.re)+' + '+str(self.im)+'i'
__repr__=__str__
z=Complex(3,4)
'test:',z,abs(z),z+z,z*z
import pickle # loading the picke module as usual
We open a file and we can write into it. We use the function dump
from the pickle
module.
with open("data/out_complex.pkl","wb") as f:
pickle.dump(z,f)
Let's load it back. This is done by pickle
module's load
function:
with open("data/out_complex.pkl","rb") as f:
uj_z = pickle.load(f)
uj_z
It is very easy to manipulate images in Python. Now, we only have a look at the most common matplotlib
functions, but there are many other packages for handling image files, e.g. opencv,scikit-image,pillow.
Loading an image is easiest with the function matplotlib.image.imread
.
im=imread('data/Photo_51_x-ray_diffraction_image.jpg')
The loaded image has numpy.array
type, that we already know.
type(im)
It is possible to have a look at the image with other colorscales or at another scale.
imshow(im[:,:,2],cmap='Blues') # the blue channel with a colormap 'Blues'
colorbar() # colorscale
When manipulating images, we have to pay attention to the fact that they are mostly composed of uint8
type numbers. The values of this type are integer values represented on 8 bits, and therefore, they are in the [0-255] interval.
im.dtype
After we determine the spatial structure of the DNA from the above picture, we can save the image. The extension of the filename automatically determines the format of the picture.
imsave('data/out_saved_image.jpg',im) # saving to jpg
imsave('data/out_saved_image.png',im) # saving to png
imsave('data/out_saved_image.tiff',im) # saving to tiff
Hierarchical Data Format, or HDF is a widespread format for storing or sharing large amounts of scientific data. It has two main versions, that are not really compatible with each other. In this part, we are going to have a look at the newer HDF5 files.
The data stored in HDF5 files can be arranged into groups and subgroups, between which we can make references. We can imagine it as a directory structure that contains data everywhere.
There are two main parts of an HDF5 file, containing the descriptive metadata, and the data itself. The data is stored in multidimensional tables. We can refer to the subgroups of an HDF5 file as in a traditional directory structure by giving names to the subgroups. There is a "root" group, that is denoted by "/", and then a certain data array can be refferred to as "/groupname/data".
In Python, the package h5py
is the most common way of handling HDF5 files.
import h5py
We can open HDF5 files with the h5py.File
funciton. First, we have to decide whether we would like to read or write the given file through the 'r' or 'w' options that are similar to that of the built-in open()
function.
f = h5py.File("data/data.h5", "r")
The names of the elements of the data directory can be retrieved as dictionary keys with the keys()
method.
list(f.keys()) # if we don't transform it into a list, it won't be printed nicely
One step down the hierarchy, there are the following keys:
list(f["data"].keys())
Let's have a look at the data/data
group contanining the most information.
f["data/data"] # this contains a data table
f["/data/data"][:] # we get an array now
These files can store very complex data structures. The groups and subgroups can encode different relations. Of course, reading such files is more difficult than simple text files, and to be able to interpret a certain file, we have to have some information on its structure.
Next, we are going to show how to visualize data from HDF5 files.
f = h5py.File("data/h5ex_t_enum.h5", "r")
By applying the values()
method, we can see that in this file there is only one data group with a name DS1. Moreover, the system also shows that we have to read a 4x7 matrix.
list(f.values())
As we have already seen, the imshow
function of Python can represent an array as an image.
imshow(f["DS1"])
More information on HDF5 files:
h5py
FITS is the most common image format of modern astronomy. Apart from storing visual information, these files keep metadata that is indispensable for further processing, such as exact time or setting of the device the image has been taken with.
FITS file have two main parts. First is the header that stores the metadata, and the second is the observed data series. Mostly, these data series are 2D arrays corrsesponding to images captures by the CCD sensors of telescopes, but it is possible to store more dimensions as well, with a maximum of 999.
Out of these two main parts, multiple segments can be built and concatenated into one FITS file. These segments are called HDUs (Header/Data Units). One FITS may contain multiple HDUs. For example, one file can store images taken with different filters.
Sometimes, the first HDU is called Primary, and the further HDUs are called Extend. This meand that by default, the first HDU is used.
Three types of data can be stored in FITS images:
We can work with FITS files by loading the astropy
package. The module is gigantic, and its astropy.io
submodule handles the FITS images. The astropy.wcs
submodule can convert between different coordinate systems that can also be used for the representation of the images.
#from astropy.io import fits
import astropy.io.fits as fits
#from astropy import wcs as wcs
import astropy.wcs as wcs
If we have a look at the example datafile data/HorseHead.fits
, we can see that it cannot be loaded as a textfile with the cat
command of bash
, because it contains mostly unreadable binary data.
! cat data/HorseHead.fits | head -n 2 | tail -1
We can load the file with the fits.open
command of the astropy
package.
hdu=fits.open('data/HorseHead.fits')
The loaded data has several different properties. One of them is that we can have a look at the HDUs that are contained in it. In this case, out picture has two segments.
hdu.info()
After choosing a segment, it is possible to have a look at the content of the header with the header
option. We can see that there is much information encoded into the header.
len(hdu[0].header)
hdu[0].header[:10] # Ez csak az első 10 sor a header-ből
The visual information is stored in the data
option of the segment. This has the type of the already familiar numpy.array
.
image=hdu[0].data
image.shape
image
The advantage of the fits files used in anstronomy is that along the information encoded in the picture, the header contains important metadata such as the properties of the method the picture was made with. There is also information on the celestial coordinates of the picture. The coordinate information can be listed with the wcs.WCS
command.
kord=wcs.WCS(hdu[0].header)
kord
Knowing the coordinates enables us to make a figure whose axes are the celestial coordinate axes, therefore, we are able to read the celestial position of the given part of the picture. First, we make the box in which we are going to put our picture analogous to the drawing of 3D plots:
subplot(111,projection=kord) # telling matplotlib to use celestial coordinates
Let's load the box with the information from the image.
subplot(111,projection=kord)
plt.imshow(image, cmap="magma", origin='lower')
plt.xlabel('RA')
plt.ylabel('Dec')