Data Analysis Workflow with Python

Author: Austin Pursley
Date: 9/13/2022

Here my goal is to document my current understanding of a data analysis workflow with Python. This demo will reflect a generic / made-up scenario, similar to device testing scenarios I'm familiar with. Nothing too substantive, just enough to illustrate how this goes. In this workflow, the data is given in the form of CSV files or similar, and will be collected, analyzed, and reported using pandas, matplotlib, Jinja, etc.

Step 1: Collect Data

We want to first organize data as best as we can for analysis. This can be a challenge because 'raw' data can come from disparate sources. I set-up the dummy data to reflect this; in the made-up scenario, there are (1) three types of sensors, (2) each sensor has three units, (3) each unit has undergone three trials of data collection, and (4) each trial has three data points. Every sensor has created its own CSV file, so data from those files will have to be collected and formed into one whole.

File list

The first step is to get a list of all of the data files. The CSV files are all in a "data" folder, and then themselves within their own sub folders. A recursive glob method can find all the CSV files within this folder hierarchy.

Read Files

Then I can read the CSVs into pandas dataframes.

Combine

Next I will combine the data into one dataframe (this could of been done in one step, but I like breaking it down like this.)

Let's examine the combined dataframe to make sure the data looks okay.

Clean

Looks pretty good. One change that needs fixing is the index should be reset.

There is also a typo, "Trail" should be "Trial" (whoops). Instead of editing the CSV files, I will just make the change here.

Now let's look at the data types that was atomically determined when the CSV was read.

The 'Time' column does not have the correct type, it should be 'datetime'. It's not critical here but incorrect date/time data seems to be a common issue, so I'll make a point to correct it.

There we go, we've got our data neatly collected in one place and cleaned up a bit.

This dummy data is uniform, so it's straightforward to combine them. But 'real' data can have more variation and quirks. Different Date-time formatting, CSVs with extra columns, data that needs many more corrections, etc. In all cases, care should taken to be sure the data is congruent and correct.

Step 2: Analysis

This will be a fairly straightforward analysis of the collected data using Python's pandas library. I've tried to highlight some tricks / capabilities of pandas to demonstrate its power.

Grouping

With pandas, it's easy to group data for analysis. As an example, let's look at descriptive statistics. As seen in the tables below, I can either get statistics for each individual sensor unit or for each sensor type.

Scoring

Assigning a pass/fail or similar scoring metric may be part of the analysis.

I've made up a scoring system for this scenario:

With the ability to group data, I can build out the scoring analysis throughout the different "levels" of data, starting with "by trail" and ending with "by sensor".

Plots

Pandas has built in methods to plot data, which use the matplotlib library by default. However, these methods can be cumbersome. I almost always skip them and use matplotlib "as is".

Below I've made two plots to demo matplotlib. The first was a result of my intuition to first try plotting every trial. It's a good example of how to get pandas grouping and matplotlib plotting to work together. However, I realized it could be improved by plotting by sensor unit, which is seen in the second plot.

Step 3: Report

Now that I've done the analysis with Python, I want to present those results in a LaTeX document. I just use LaTeX because I'm familiar with it through work. I think a Juypter notebook like this one would work just as well.

Tables

Panda has a 'to_latex' method for converting dataframes to LaTeX table code. It is decent, but not as good as its HTML/CSS counterpart due to lacking features and has a few bugs. I've managed to make it work through some workarounds. I've especially found it works out better to aim for a "half-way" solution that you can be tailored and polished later in the LaTeX editor (e.g. Overleaf).

I'm going to show how two different styles of LaTeX tables can be achieved. One is the booktabs style, which is used for a lot of scientific and technical publications. The other is what might be considered the default "lines" table.

Booktabs style

The code below is going to generate LaTeX table code for each set of data (i.e. dataframes) I've selected.

Here is screenshot of what one of these booktabs styled tables will look like after being compiled:

Example of booktabs styled table

"Lines" Style

A screenshot of one of the "lines" styled tables:

Example of 'lines' styled table

Figures

The plots can be added through figures within LaTeX. Even though the code is simple, I've went ahead and created a separate template for them.

Report Template

I've used Jinja templates to create LaTeX tables and figures, and now I'm going to use another template to generate a full LaTeX document with those tables and figures. The resulting document will be very basic, good for testing elements will compile correctly, but it could be a nearly formed report. Generating the report with a template is especially useful for when there are a high number of tables and/or plots, although of course that's not the case here. For actual documents that are simple, it may be make more sense to skip this report template.

The generated report is printed below and the resulting PDF (compiled in Overleaf) can be seen here. Looking at the report, some tables are not set exactly how I'd want them. But, again, the goal was not to generate perfect table and plots, it was just to get them mostly complete and ready for refinement.