|
|
...making Linux just a little more fun! Python for scientific use. Part I: Data Visualization
Motivation and outlineA first step towards qualitative understanding and interpretation of scientific data is visualization of the data. Also, in order to reach a quantitative understanding, the data needs to be analyzed, e.g. by fitting a physical model to the data. The raw data may also require some initial processing in order to become useful, e.g. filtering, scaling, calibration etc. Several open source programs for data analysis and visualization exist: gnuplot, grace, octave, R, and scigraphica. Each of these has its own pros and cons. However, it seems like you always end up using more than one program to cover all the different needs mentioned above, at least if you don't have the programming abilities to write your own custom programs using e.g., Fortran or C. Recently, I came across Python and found it to be a very powerful tool. In this article, I would like to share my experience and illustrate that even with basic (or less) programming skills it is still possible to create some very useful applications for data analysis and visualization using this language. The article is centered around a few illustrative examples and covers the visualization part — data analysis will be covered in a future article. Python: a brief reviewPython was originally created by Guido van Rossum and is an interpreted programming language (like e.g. Perl) with a clear and easy-to-read syntax. You can write stand-alone applications with Python, but one of it's strengths is its ability to act as glue between different kinds of programs. The standard introduction to any programming language is the
Hello world! program. In Python this is generated by first
opening the Python interpreter by typing Python 2.3.4 (#1, Jan 21 2005, 11:24:24) [GCC 3.3.3 20040412 (Gentoo Linux 3.3.3-r6, ssp-3.3.2-2, pie-8.7.6)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>>Then the following code is typed: print "Hello world!" Python code can also be stored in a file e.g. named
#! /usr/bin/python (assuming that the python executable or a symlink to it exists)
and giving the file executable mode with Python comes with many modules, either built-in or available for separate download and installation. In this article we will use SciPy, which is a very powerful library of modules for data visualization, manipulation and analysis. It adds functionality to Python making it comparable to e.g. octave and matlab. Plotting 2-D dataExample 1: Plotting x,y dataThe first example illustrates plotting a 2-D dataset. The data to be plotted is included in the file tgdata.dat and represents weight loss (in wt. %) as a function of time. The plotting routine is in the file tgdata.py and the python code is listed below. Line numbers have been added for readability.
1 from scipy import *
2
3 data=io.array_import.read_array('tgdata.dat')
4 plotfile='tgdata.png'
5
6 gplt.plot(data[:,0],data[:,1],'title "Weight vs. time" with points')
7 gplt.xtitle('Time [h]')
8 gplt.ytitle('Hydrogen release [wt. %]')
9 gplt.grid("off")
10 gplt.output(plotfile,'png medium transparent picsize 600 400')
To run the code, download the tgdata.py.txt file, rename it
to
In line 1, everything from the SciPy module is imported. In order
to make use of the various functions of a module, the module needs
to be imported by adding an The gnuplot png option In order to plot a file with a different file name, we have to open the python source in a text editor and manually change the name of the file to be imported. We also need to change the name of the file copy if we do not want to overwrite the previous plot. This is a little tedious. We can easily add this functionality to our python script by allowing filenames to be passed as command line arguments. The modified script is called tgdata1.py and is shown below.
1 import sys, glob
2 from scipy import *
3
4 plotfile = 'plot.png'
5
6 if len(sys.argv) > 2:
7 plotfile = sys.argv[2]
8 if len(sys.argv) > 1:
9 datafile = sys.argv[1]
10 else:
11 print "No data file name given. Please enter"
12 datafile = raw_input("-> ")
13 if len(sys.argv) <= 2:
14 print "No output file specified using default (plot.png)"
15 if len(glob.glob(datafile))==0:
16 print "Data file %s not found. Exiting" % datafile
17 sys.exit()
18
19 data=io.array_import.read_array(datafile)
20
21 gplt.plot(data[:,0],data[:,1],'title "Weight vs. time" with points')
22 gplt.xtitle('Time [h]')
23 gplt.ytitle('Hydrogen release [wt. %]')
24 gplt.grid("off")
25 gplt.output(plotfile,'png medium transparent picsize 600 400')
In the first line we have imported two new modules —
The script works as follows. A default file name for the hard
copy of the plot is stored in the variable Example 2: Plotting x,y data with error barsSo far we have shown that the The code below (also available in xrddata.py.txt, the data file
is available as xrddata.dat) more or less shows
how example 1 is reproduced using
1 import os
2
3 DATAFILE='xrddata.dat'
4 PLOTFILE='xrddata.png'
5 LOWER=35
6 UPPER=36.5
7
8 f=os.popen('gnuplot' ,'w')
9 print >>f, "set xrange [%f:%f]" % (LOWER,UPPER)
10 print >>f, "set xlabel 'Diffraction angle'; set ylabel 'Counts [a.u.]'"
11 print >>f, "plot '%s' using 1:2:(sqrt($2)) with errorbars title 'XRPD data' lw 3" % DATAFILE
12 print >>f, "set terminal png large transparent size 600,400; set out '%s'" % PLOTFILE
13 print >>f, "pause 2; replot"
14 f.flush()
The code of example 2 produces the output shown below. The error
bar plot is created with
Plotting 3D-dataExample 3:Now we look at how 3-D data can be represented by using a combination of Python/gnuplot. In order for gnuplot to represent 3-D data it requires that the data is either given by a mathematical expression or stored in a data file. The data file should have either a matrix-format where z-values are given as a matrix with x and y equal to the row and column number, respectively, corresponding to each z-value or a 3 column format where each row represents a data triple with x, y, and z given by the 1., 2., and 3. column respectively. See the gnuplot manual for further details. The data to be represented in a 3-D fashion in this example is actually a collection of 2-D data files (like the one shown in example 2). Each data file corresponds to an experiment at a different time (with 150 s intervals between experiments) so we have two independent variables and one dependent variable: x(file number/time), y (diffraction angle),and z(counts) distributed across several files. This makes the 3-D data not suitable for plotting — yet. The script 3ddata_1.py shown below finds
all files with a given extension (*.x_y in this case) in the
current working directory and creates a list containing their file
names (line 5, In line 14-22 the data file is written in the (x,y,z) format
with x corresponding to time (found by multiplying the file number
with the time step), y corresponding to diffraction angle as given
by
1 import os, glob
2 from scipy import *
3
4 EXT='*.x_y'
5 FILELIST=glob.glob(EXT)
6 SIZEX = len(io.array_import.read_array(FILELIST[0]))
7 DATAMATRIX = zeros((SIZEX,len(FILELIST)), Float)
8 TWOTHETA=io.array_import.read_array(FILELIST[0])[:,0]
9 TIMESTEP=150
10
11 for y in range(len(FILELIST)):
12 DATAMATRIX[:,y]=sqrt(io.array_import.read_array(FILELIST[y])[:,1])
13
14 file = open("3ddata.dat", "w")
15
16 for y in range(len(FILELIST)):
17 for x in range(1126,1968):
18 file.write(repr(TIMESTEP*y)+" "\
19 +repr(TWOTHETA[x])+" "+repr(DATAMATRIX[x,y]))
20 file.write("\n")
21 file.write("\n")
22 file.close()
23
24 f=os.popen('gnuplot' ,'w')
25 print >>f, "set ticslevel 0.0 ;set xlabel 'Time [s]'; set ylabel 'Diffraction angle'"
26 print >>f, "set pm3d; unset surface; set view 60,75; splot '3ddata.dat' notitle"
27 print >>f, "set terminal png large transparent size 600,400; set out '3ddata_1.png'"
28 print >>f, "replot"
29 f.flush()
If you wanted to write a 3-D data file in the matrix-format, lines 14 through 22 can be replaced with the following code.
file = open("3ddata_matrix.dat", "w")
for x in range(SIZEX):
for y in range(len(FILELIST)):
file.write(repr(DATAMATRIX[x,y])+" ")
file.write("\n")
file.close()
The 3-D plot produced by the above script is shown below.
The plot above shows that it can be difficult to produce a 3-D plot of non-monotonic data that shows all of the details of the data — some of the smaller peaks are hidden behind larger peaks. It is also difficult to see changes in peak positions as a function of time. In order to bring out these details it is sometimes better to create a 2-D contour plot by projecting the z-values down into the x,y plane. This is achieved by replacing lines 24-29 of 3ddata_1.py with the code below (3ddata_2.py).
f=os.popen('gnuplot' ,'w')
print >>f, "set pm3d map; set palette rgbformulae 30,31,32; set xrange[0:4500]"
print >>f, "set xlabel 'Time [s]'; set ylabel 'Diffraction angle'"
print >>f, "splot '3ddata.dat' notitle"
print >>f, "set terminal png large transparent size 600,500; set out '3ddata.png'"
print >>f, "replot"
f.flush()
The contour plot is shown below.
The 3-D example plots were made with 39 data files, each containing 4096
data rows. The data files are available at this offsite
link, and can be unpacked with SummaryIn this article a few examples have been given in order to
illustrate that Python is indeed a powerful tool for visualization
of scientific data by combining the plotting power of gnuplot with
the power of a real programming language. It should be noted that
all the examples given here could probably have been solved with a
combination of e.g. Suggestions for further readingManuals, Tutorials, Books etc:
See also previous articles about Python published in LG. Other graphical libraries for PythonGnuplot version 4.0 has been used in the examples throughout this article. Python also works with other plotting engines/libraries:
Anders has been using Linux for about 6 years. He started out with RH
6.2, moved on to RH 7.0, 7.1, 8.0, Knoppix, has been experimenting a little
with Mandrake, Slackware, and FreeBSD, and is now running Gentoo on his
workstation (no dual boot :-) at work and Debian Sarge on his laptop at
home. Anders has (a little) programming experience in C, Pascal, Bash,
HTML, LaTeX, Python, and Matlab/Octave.
Anders has a Masters degree in Chemical Engineering and is currently
employed as a Ph.D. student at the Materials Research Department, Risö
National Laborary in Denmark. Anders is also the webmaster of Hydrogen storage at Risö.
|