Basic Data Analysis using Python: Pt II


Introduction

     In this series, I am going to outline a basic data analysis exercise using a real world data set. This example is a direct result of a relatively simple physics experiment I was a part of, and we required this analysis in order to determine several parameters in order to move forward with the rest of the experiment. This exercise is not simply an example, but in fact uses real data captured in a real lab from which the results were required to proceed. We will start from the data set and demonstrate using only Python to turn that data into presentation quality graphic results. This series will not demonstrate using Python to clean large data sets, this is a capability that has already been well established. This article will discuss the data generated in the experiment and go through the coding needed to perform the analysis.

 

The Data

     As was stated in part I, the purpose of this analysis is to find the maximum intensity in order to set the angle of the filters. Therefore, several measurements were taken and a set of data generated as angle difference of the filters vs the intensity of the light. The [actual] measurements are shown below.

theta_degrees intensity_mV
-23 90
-14 186
-11 207
-8 220
-5 253
-2 264
1 237
4 218
7 214
10 199
13 172
16 153
19 115

     Now, fortunately for us, this is a very simple data set. By enlarge, someone performing analysis on a set of data doesn't spend long performing the actual analysis. We spend the vast majority of our time doing one of two things. Either A, cleaning and parsing the data to make it usable, or B, generating presentation quality graphs and preparing the actual report itself. Comparatively little time is actually spent on the mathematical analysis. The first thing we should do is a simple scatter plot to see what the data looks like!

Simple Scatter Plot

     Hmm, well we can see in the data some significant variations. From the experiment we should have seen a more smooth curve has the intensity changes. It would seem that the error in our detector setup is showing up. We won't know how bad it is until the analysis is complete. The next question is what model do we use? At first glance it might seem that a standard parabolic curve given by the quadratic equation, ax2 + bx + c = y. It could also be Gaussian? But actually, these are polarizer filters which are based on Malus's Law. This takes the form I = I0 Cos(x - x0)2. Funny enough, a plot of the two functions, a Gaussian distribution and Malus's Law look extremely similar at the peak. It is only when away from maximum do the error's significantly propagate. It is useful to state at this point that we may not always have a physical phenomenon that we can base a model on, in this case, Malus's Law. Many times we are trying to find the best model outright. Under that assumption, lets compare all three models we have come up with, quadratic, Gaussian, as well as Malus's Law in our analysis.

 

The Code

     For this analysis, we're going to be using the SciPy stack and an additional package called LMFit. Scipy is perfectly capable of doing every one of these calculations however LMFit is a package designed to simplify non-linear least-squares minimization and curve fitting. It also allows us to quickly spit out statistics of our curve fit in order to help find the best one. Unfortunately it is not part of the default SciPy stack but is, by their admission,"builds upon and extends scipy.optimize". Essentially, its going to make our job much easier and reduce our coding a great deal. We won't have to code up the statistical analysis itself.

     We start in the usual fashion, importing our packages we want to use, which in this case is numpy, matplotlib, pandas, and lmfit. Pandas is a wonderful package that makes interfacing with spreadsheets and csv files a breeze. We don't have to concern ourselves with coding up parsers to move our data into lists or arrays to be used. Pandas will do that for us and we can simply reference the column headers in the spreadsheet or csv file. There is one thing to be aware of when using Pandas however, massive data sets (>100 MB to multi gigabyte) require us to be concerned with memory allocations, but when working with Big Data, working with such problems are part of the job. Eventually, Pandas won't be sufficient and we'd be better off moving to another tool for extremely large data sets. Not here however, ours is simple. We need numpy because we are calling mathematical functions and we also have a data type issue to deal with! Matplotlib is used for our plotting and we have already stated that LMFit is being utilized to perform the actual analysis.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from lmfit import Model
from lmfit.models import GaussianModel

     Obviously, the first thing we need to do is actually import our data so we can work with it. I prefer to code in Python 3. Partly because i'm used to it and one very nice feature of it is f-strings. An f-string is a variable definition that we can insert into a text string by indicating the character, f, at the head. This allows us to, for example, define file paths and we only need to do this once! We can just call the variable any time we need it. I've also used it to change file names, plot names, and the like when i'm using the same piece of code over several data sets. I can just change one variable definition rather than many. Time saved. We do actually have a data type issue to deal with. This echos the usual problems an analyst faces. In this case, we need our data to be in floating point form (decimals) rather than integers. We could attack this two ways. I could just as simply go back to the data set itself and attach a '.0' or define the cells to show decimals which is fine if you don't have hundreds or thousands of cells to change. That method is often impractical and we would be forced to deal with it in the code. As it happens, Pandas makes this easy and we just tell it to convert the data to float type.

     One last thing to mention about the data. The measurements were taken in degrees, however, you almost never want to do any calculations in degrees. In particular the trigonometric functions coded in Python require radians rather than degrees. So therefore, we will convert the degree measurements into radians by simply multiplying by pi/180.

path = '/home/zgwolfe/ownCloud/Conda/Python'

data = pd.read_csv(f'{path}/lab4_data_par.csv')
# Must convert the integers to floats! Thats what to_numeric and astype does
xdata = pd.to_numeric(data.theta_degrees).astype(float)
xdatarad = np.deg2rad(xdata)
ydata = pd.to_numeric(data.intensity_mV).astype(float)

     It is at this point we could verify that our data types are functioning correctly and also take a look at our data by doing a simple scatter plot, the same one shown above. 

plt.scatter.(xdata, ydata)

      Now... let the fun begin. We are finally ready to perform the actual analysis and find which model is best suited for our data set! We stated that we're going to compare three different models in order to find out what angle gives us the maximum intensity. Two of the models we're going to need to code up but the Gaussian model is built into LMFit and we can call it directly. So, for the first two models we're going to define as functions that returns the mathematical functions themselves. This is needed by LMFit.

def func(x, a, b, c):
    return a * -x**2 + b * x + c

def func2(x, a, b, c):
    return a * np.cos(x - b)**2 + c

     For the next part its basically the same deal. We tell LMFit to call the models, perform the fit, meaning it find what coefficients fit the model to the data the best, print the statistical analysis of these coefficients and plot the graphs too. More details can be found on LMFit's documentation found here. One thing to note, curve fitting can sometimes be tricky. We're required to give a starting value to the coefficients. Where it can be tricky is that any curve fitting software can get "stuck" around a value that doesn't produce good results, but is the best for the local points it's analyzing. So if your models aren't turning out correctly then you need to find out where the initial value declarations are closer to the actual. This generally means basing it on a real phenomenon or the tried and true method of guessing. SInce our's works outright we'll just define them all as 1.

lmmodel1 = Model(func)
result1 = lmmodel1.fit(ydata, x = xdatarad, a = 1, b = 1, c = 1)
print(result1.fit_report())

plt.plot(xdata, result1.best_fit, 'b:', label = 'LMFit x**2 Model')

lmmodel2 = Model(func2)
result2 = lmmodel2.fit(ydata, x = xdatarad, a = 1, b = 1, c = 1)
print(result2.fit_report())

plt.plot(xdata, result2.best_fit, 'r-.', label = 'LMFit a Cos(x - b)^2')

     For the final model, Gaussian, LMFit has a function built in. The general method is the same, call the model to perform the fit. but it's built in method requires slightly different declarations for the fit. Since we're not defining the coefficients ourselves we basically tell it to guess for us but its mostly the same. After the third model is coded up we can tell matplotlib to show the legend and give us our graphs!

lmmodel3 = GaussianModel()
pars = lmmodel3.guess(ydata, x=xdatarad)
result3 = lmmodel3.fit(ydata, pars, x=xdatarad)
print(result3.fit_report())
    
plt.plot(xdata, result3.best_fit, 'g-', label = 'LMFit Gaussian Model')

plt.legend()
plt.show()

     Now its time to tell Python to run this script we just wrote. We'd wanna verify that our path variable is actually pointed to where the data set is located. We then call the script. In Linux, the terminal is used for this and our results are going to be displayed both in the terminal window as well as a pop-out for the graph itself. I simply have to run

$ cd /path/to/script
$ python3 python_script.py

This returns to us the following results.

Python Curve Fitting

Results

Full Size Image: CLICK HERE

      Ok, that has given us quite a bit of information! Ah, I see one mistake we made. It would have been better if we'd named those functions after the mathematical functions themselves rather than func and func2. Now we have to recall which model we coded with each. In this case, func is the quadratic function and func2 is the Malus's law. It's worth mentioning that if you do this in a terminal, it just prints the fit report successively, its not displayed as I've shown above, but it needs to be more readable here so I've taken the liberty. If you're reading this on a smaller screen you're not likely to to be able to read those numbers. Find the full size image of the fit report above HERE.

     Now we have our raw analysis! You might wonder why the x-axis is still in degrees? Well, its just a label. It is sometimes easier to visualize an angle in terms of degrees but it is almost never useful to calculate in degrees. It would be just as well that we plot the x axis using the xdatarad variable rather than the original xdata variable that is in degrees. The main point is that the statistical analysis is done in radians and we need to remember that! In the next part, we'll dig through the results and try and make some sense of it. However, after we do that, we would usually be required to prepare some kind of report in a real-world scenario. While we won't actually write a report we do need to get some presentation quality graphics. We shall discuss both items in part III.

© 2016 Zachary G Wolfe -- Remember to turn your brain off for a reboot sometimes...