Basic Data Analysis using Python: Pt I
In this series, I am going to outline a basic data analysis exercise using a real world data set. This example is a direct result of a relatively simple physics experiment I was a part of, and we required this analysis in order to determine several parameters in order to move forward with the rest of the experiment. This exercise is not simply an example, but in fact uses real data captured in a real lab from which the results were required to proceed. We will start from the data set and demonstrate using only Python to turn that data into presentation quality graphic results. This series will not demonstrate using Python to clean large data sets, this is a capability that has already been well established. In this first part, I discuss the experimental setup and also discuss the tools and software used.
The basis of the experiment is about measuring how much optically active compounds "rotate" plane polarized light. The experiment measures the rotation of plane polarized light caused by different sugars like fructose. The experiment, shown below, takes a 532 nm laser (green) and shoots it through a polarizing filter to plane polarize the beam. It then interacts with the compound in the sample cell which will rotate the beam by a certain angle. The beam then interacts with the second polarizing filter, to remove interference and the detector measures intensity. This intensity can be linked to the rotation angle from the sample.
In order to proceed with the experiment, the polarizing filters must be adjusted before a sample measurement is taken to maximize the intensity of the light. In order to do this, we measure the intensity without a sample and plot these intensities vs the angle. A non-linear curve fit is plotted against the data points in order to find the maximum of the curve. This max intensity corresponds to the angle needed to set the filters to. This is the basis for this data exercise.
The tools used to perform the analysis is based entirely on python. I also have used Atom, a highly configurable text editor. If you have used Microsoft Visual Studio Code, then you'll find Atom very similar in layout. Atom allows for configuration down to the base code and is open source (and free). Along side Atom, I rqn several add-ons. The first of these is platform-ide-terminal. This places a Linux terminal window at the bottom of the editor window and the use for this will become apparent a little later. The next set of packages is a new concept to Atom and it brings IDE functionality to the editor. The base package is called atom-ide-ui which includes langauge support for C#, Java, PHP, Flow, JSON and TypeScript. However several community packages exist to extend language support. Among these, is support for Haskell, Rust, and Python of course. C/C++ has a package but is in it's infancy. The python-ide package requires that you install the python language server from pip.
What the language package does is help considerably with formatting the code, following language conventions and of course error correction. This is an invaluable feature for anyone programming in Python. Another package that I have used is called Hydrogen. This brings Jupyter Notebook functionality to Atom. However I have depreciated the use of such notebooks in my programming routine. I find that, while the notebooks are novel, there are several conflicts that make it hard to really use for serious analysis and this mostly centers around how multiple iterations of the code can corrupt variable definitions requiring a restart of the kernel. This is where Atom shines, with the aforementioned packages, I can write the code to output text and graphics to files that I open in Atom and use the terminal to run the code via python itself. With these output files opened in Atom, if we rerun the code those output files will update automatically, as if from a Jupyter notebook. The only difference is that it simply just looks at the new file with the same name. This avoid many conflicts seen in Jupyter. Below is a screenshot of my Atom setup, click the photo for a full-screen view.
It is from the terminal window shown in the screen shot that I would run the python code as well as read any print statements from the python code. This helps considerably when writing the code, debugging, and the usual programming functions we do. The terminal is also a fully functional Linux terminal. If I were to be coding in C, I can run my compiler functions just the same, or anything else I may be inclined to do from a Linux terminal. The editor can be found at atom.io along with the corresponding packages, themes, etc.
But I must confess, I do not routinely use Atom anymore. It is a very nice program that many people use, and the functionality from the main program and it's plugins can be very useful. However, my projects are never big enough to really need a full IDE. I'm not managing huge programs with tons of iinkages to other packages and libraries. I might as well use the Linux terminal and Emacs (what I do use), GEdit, or something similar. in fact, as an example, Richard Stallman [launched GNU Project and founded the Free Software Foundation] and Linus Torvalds [creator and long-time principle developer of the Linux Kernel, and created Git] reportedly principally use Emacs or a version there of in their extensive work.
It was something of a lesson I have read from a programmer and author named Zed Shaw about using IDEs when starting out as a programmer -- don't. The basic principle being that you need to cognitively hunt out errors (syntax, procedural, etc). His philosophy, stated in his The Hard Way series, is that stronger programmers don't use IDEs (reference Stallman and Torvalds above) and have no trouble producing code at the same speed IDE users do, they have a better fundamental grasp of the language. They, the ones use a text editor like Emacs or GEdit, have a leg up on learning new programming languages for which lesser coders would need to wait for an IDE to be developed. I find it somewhat simpler just to avoid the bells and whistles of an IDE and learn the language, and if that philosophy holds true it will make me a better programmer in the long run.
Much discussion takes place in the corners of the web about which programming software package is the better and there are just as many opinions as there are programs. Emacs tends to pop up a great deal in these discussions because it's old, very old, and still under a large active development community. Its also a deceptively deep program, everything from a built-in e-mail client to games. Its also probably the hardest to pick up right away, I mean the learning curve is massive but also worth it. As a parting comment, what you use is your choice, I just think the simpler approach to my editor gives me greater understanding of what I'm doing - - and my tools are quickly available on just about any OS still being developed.
Here I have setup the initial experiment that is the basis of our analysis, why we are doing this in the first place. I have spoken at length about the tools used to program the code for the analysis itself. It is, perhaps, an important step for anyone looking into data analysis to consider. After all, you are becoming a programmer, maybe not a software developer, but you are programming code to perform a job that produces results, perhaps very important results. Someone else might need to see your work and understand it, change it. Therefore, you should learn to program properly and the choice of tools is not arbitrary, at least to me, because you also need to be comfortable with that work flow. One parting comment on not using Jupyter notebooks for this analysis, a traditional use for them, is that as I stated the definitions seem to screw up occasionally as I rerun the code, don't know why. Maybe its because Python wasn't really written to be implemented in a JIT (just in time) fashion. Regardless, I can simply "revert" to a more FORTRAN approach and write the code to read input files and write output files directly that is more universal rather than sticking them inside a notebook. Besides, you're very likely to need the output files anyway, such as the numerical result of a OLS regression or the trend line fit graph.
In the next part, we will get into the actual analysis itself.