The EPA's UCMR3 data set, an example of why Environmental Professionals (or any researcher) can benefit from a little Python Scripting.


The public and private sectors alike invest huge sums of money into software infrastructure to handle large databases for the average person to use effectively. However, if you're in a position where part of your job is research and understanding scientific or technical information, you likely find that dealing with large data sets pops up from time to time and maybe someone hasn't parsed that data for you yet.

If you're the person that's required to keep on top of industry trends or follow issues to help influence and make policy, you might run into data sets like the EPA's UCMR occurrence data. This stands for Unregulated Contaminate Monitoring Rule, and every so often the EPA requires every water system to measure a list of contaminates that are, as of yet, unregulated for data collection purposes. Some of these may eventually end up in the regulatory cross hairs.

What if you need to know values or a certain subset of facilities in this case? Maybe those along the Yadkin River? Or maybe you want to see an entire state's results by method or compound. The problem in this case is that, even though they subset the data by state or method, its just sorted that way. Its a gigantic national data set spread over a handful of text files but if you open these, you'll discover they're are hundreds of thousands of entries! Actually, there are exactly 1,075,414 entries in the UCMR3 Occurrence data. How the heck are you supposed to make heads or tails of that when all you have are these .txt files?! If you want to see what I mean, you can download the occurrence data from the EPA's website. Its all public information.

Actually, this problem could be solved quickly with just a few lines of python code, six to be exact. Even with a data set this large, there are tons of field names such as facility ID, method ID, compound name, state, etc. In my case, I was simply curious to see if my old facility participated in the UCMR3 and if so what did they test for. I wrote the code below to extract, by facility ID, all of the associated Facility's entries from the EPA's UCMR Occurrence data set.

Its useful to note at this point that simple sorting and search jobs could be performed in a spreadsheet program itself - if the data set is small enough. The entire UCMR3 occurrence data doesn't fit in either LibreOffice Calc 6 nor Excel 2016. Both programs have a max row size of 1,048,567 (2^20) and the number of entries is 1,075,414. Also, even if a slightly smaller data set does fit by number of rows, trying to sort and search in a data set that size will be very slow unless you're running this on a massive machine. Otherwise, that data will need to be imported in some fashion, to an SQL database for example which is not for the faint of heart or time-strapped person. You could also import it into Microsoft Access and then link to excel... or write 6 lines of Python to extract the info you want in minute. There are software options for this too that don't require coding, but honestly this is much faster!

Believe it or not, that short bit of code will parse a database with over a million entries (literally) and return in a new csv file with all rows where the field "PWSID" contains "NC0229025", in this case, it returned 140 entries. I am also happy to report that the UCMR3 listed per-fluro-alkyl compounds were all below the EPA's minimum reporting limits (something similar to a non-detect) in the system's water source.

I could have reduced it to four lines by not using path and name variables but it makes it easier to edit for multiple files. As you might imagine, such scripting is extremely flexible for whatever problems you run into or need to know and limited only by your skill and access to StackExchange. If you're running to data sets like this often (even if they're not as big) with no way to sort them, you can benefit from this basic understanding of Python and get the data you want much faster. There are a number of good sites out there to learn Python, Linkdin has a some videos particularly for data analysis, amazon has some books, find one you like and dig in.

Don't be afraid to hit up StackExchange.com for any coding issues you run into. Whatever problems you find, someone else has first and it's probably there. Every coder, even veteran programmers use StackExchange for that reason. As a parting comment, if you're parcing data sets like that often, that script can be easily extended to run as an command line program with user inputs.

© 2016 Zachary G Wolfe -- Remember to turn your brain off for a reboot sometimes...