# Basic Data Analysis using Python: Pt III

## Introduction

In this series, I am going to outline a basic data analysis exercise using a real world data set. This example is a direct result of a relatively simple physics experiment I was a part of, and we required this analysis in order to determine several parameters in order to move forward with the rest of the experiment. This exercise is not *simply* an example, but in fact uses real data captured in a real lab from which the results were required to proceed. We will start from the data set and demonstrate using only Python to turn that data into presentation quality graphic results. This series will *not* demonstrate using Python to clean large data sets, this is a capability that has already been well established.In part III, we discuss the results of the analysis itself.

## The Results

Last time, we finished part II with the generation of the reports of the curve fits, called fit reports, and the initial graph of our models superimposed over the scatter plot. shown below. Just as a reminder, a full-sized image of the fit reports is linked just under the image itself for readability.

Full Size Image: CLICK HERE

Well, now that we have our results we need to determine which model actually *fits* the best. Well the first thing we notice from the graph is that the quadratic and Malus's Law model overlap, almost perfectly while the Gaussian model diverges somewhat. They however seem to generally have the same maxima, somewhere near zero. We remember that the purpose of this exercise is to figure out what angle will give us the maximum intensity. It would seem that all three models would be capable of giving us this answer, and we'll compare each of them in turn. But its look at some statistics real quick.

One measure of the goodness of fit of each model is the chi-squared statistic. Interestingly enough, they're all very high indicating that none of them fit that well, which we can see on that graph. But they're also all about the same. So other than the fact that the models don't fit the data well, it isn't going to tell us anything more. Obviously, we need a better detector setup and a more stable laser light source. The next thing we might want to focus on is the error in the coefficients. Now, the quadratic model's coefficients do not have any physical significance so the actual value isn't important. However, in each model the standard deviation of each coefficient will tell us something about how much we can say that figure varies. Excessively high standard deviations might indicate that, while the models overall don't fit well, we may not be able to trust the figures individually either. For both the quadratic model and the Gaussian model we find two coefficients with standard deviations in the 30% range which is not good at all.

This is particularly devastating for the Gaussian model because the associated coefficient is the figure for the center of the maxima (damn!) - the figure we in fact need. We see that the Malus's Law model has no excessively high standard deviations for each coefficient. We might say that, in this case, our experimental setup is demonstrating across the board error that is propagating through the results but that, as far as the models go, Malus's Law is probably our best choice overall. Which makes sense, the experiment is based on that law. Remember however that generally we might not have the luxury of that comparison.

We can get a sense of the behavior these models if we compare them to an idealized graph of each function. This way, we should be able to see how each function diverges from each other as it moves off the maxima. Its fairly simple to code that up using the methods we've already used.

So what does this tell us? Well, under ideal conditions we would observe the Gaussian function diverging from the other functions quite quickly, it would be first. If we see the fringes of the actual fits, that is what we see. We also notice the phenomenon that the quadratic and Malus's Law model diverge considerably slower from each other. With our data, it would be safe to assume that we are still sufficiently close to the maxima over our data range that we don't see any significant divergence. This also indicates that those two models likely have the same maxima! We could pick either one to find our result.

## Finding the Maxima

In our Gaussian model, the maxima is found by simply looking at the fit report and looking at the variable named "center". That is the center of the fit which corresponds to the maxima. We should remember, however, that that statistic has a 39.77% standard deviation. For Malus's Law model, the maxima is given as the initial angle parameter, b, in the fit report. But wait?! -113 Radians? That is not even close to what the graph indicates! We also see in the statistics high function evaluations, it took some time to find the fitting parameters. That means we need to rerun this model and see if we can zero in on a closer number. But we have a trick here, the quadratic model agrees very well with so lets find it's maxima and use that as the initial b parameter for Malus's Law. For this we need some basic calculus. take the quadratic model, with the fitting parameters, take it's derivative, set it equal to zero and solve for x! That gives us a setting of 0.0212368 radians for the quadratic model's maxima.

If we set the b = 0.02 for the initial parameter in the Malus's Law model, it returns with nearly the same information except the b parameter is in fact -0.02080533. For the astute reader, one might notice that the model for Malus's law isn't quite right. the parameter c doesn't usually appear in any texts and your right. But what are we to make of this? If you try eliminating the parameter, the model will never fit. If you subtract parameters a and c in the Malus's Law model you get a result of about 243, which is very close to the indicated final intensity in the graph. Playing with the function itself, if you vary only c, you just move the function up and down on the y axis whereas a, what is supposed to be the initial intensity not only traverses the y-axis as it varies, but so does the size of the curve. What we observe here is a correcting term. This is experimental, not theoretical. Firstly, the polarizers are not ideal and the experiment was not performed in a perfectly dark room. The parameter, c, represents yet another error, background & reflected light - or that is my hypothesis.

## Conclusions!

So in conclusion, what the data ends up telling is that our experiment was rather poorly performed. Errors are popping out all over the place. Our data points are very inconsistent and the model fits are crap. For an experiment where the results mattered, we'd go back and do everything over again. But we were at least able to extract the angle that gives us the maxima. For the three models, the angle difference we need to set the polarizers to are as follows:

Model | Radians | Degrees |

Quadratic | 0.0212368 | 1.21678 |

Malus's Law | -0.0208053 | -1.19206 |

Gaussian | -0.0189252 | -1.08433 |

For due diligence, after we repeated the experiment to tighten up the errors, we'd report along side those values the standard deviations as well. Now this would be somewhat of a pain with the quadratic model because we'd have to propagate the errors given by the fit report through our derivation of it's maximum. We can't extract it out right. We know or Malus's Law model is about the same goodness of fit as the others and the lowest standard deviation in it's parameters, at least at first. With each model at least agreeing on about where the maximum is, we shall officially report our angle at which the maximum occurs from our model of choice, Malus's Model, as follows.

Model | Radians | σ_{rad} |
Degrees | σ_{deg} |
% σ |

Malus's Law | -0.021 | 0.006 | -1.2 | 0.3 | 28.57% |

You might notice that our standard deviation on that parameter is now quite high, up there with the others. The variance in where that peak is just subject to high error in these models. The standard deviations shown first, when the b parameter was much higher was an aberration in the curve fit but now with the correct value, we see it join the others. It is not a coincidence that the parameters that have the highest error are related directly to where the position of the maximum is. The only thing left to do now is to tackle making those graphs look professional for our imaginary report.