In [1]:

import os
import sys
import pathlib
from datetime import datetime

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from ipywidgets import interact, FloatSlider


# Print versions for reproducibility
print("Python version:", sys.version.split()[0])
print("pandas version:", pd.__version__)
print("numpy version:", np.__version__)
print("matplotlib version:", plt.matplotlib.__version__)

%load_ext autoreload
%autoreload 2

Python version: 3.12.3
pandas version: 2.3.3
numpy version: 2.3.4
matplotlib version: 3.10.7

Survey Analysis Pipeline (MS Forms → Clean Data → Plots)¶

Date: Oct 2025

This project is to document a pipeline for analyzing survey results end-to-end: ingesting raw exports, cleaning and restructuring responses, and producing report-ready visualizations (including diverging stacked bar charts for Likert-scale questions). This was inspired by a scenario where I had to analyze survey results exported from MS Forms. I wanted the workflow to be (1) repeatable, (2) auditable from raw data → final charts, and (3) easy to extend as questions, groups, and response options evolve. Through this project I've also tried to mature my project and code structure, as well as understanding how to "collaborate" with LLMs like ChatGPT (in a sensible, non-vibe-coding sense).

Project highlights:

Practical data engineering for messy human-entered survey data (multi-select parsing, metadata, tidy formats)
Visualization design choices for categorical and Likert-scale responses
Reproducible project structure (clean separation of raw/, interim/, processed/, and figures/)
“Human-readable analysis”: notebooks as documentation, with helper functions to keep code maintainable

Note: this notebook uses a dummy scenario + dummy data.

Series navigation: Part 0 (Notes) → Part 1 (Pre-Processing) → Part 2 (Multiple Choice Analysis) → Part 3 (Likert-Scaled Visualizations)

Part 0 — Notes, References, and Design Decisions¶

This notebook is the “project notebook”: a place to capture the setup, references, and visualization ideas that influenced the implementation in later parts. I keep this separate on purpose so the processing/analysis notebooks stay focused and runnable, while this one holds the reasoning and breadcrumbs I’d want if I revisited the work months later.

In particular, it includes:

Repro / environment notes (versions, repo structure, and how to run the notebooks)
Visualization references for Likert-scale chart design (diverging stacked bars)
A small interactive plotting scratchpad (used to sanity-check interactivity and layout mechanics before applying them to survey charts)

Project Structure, Repro, Environment¶

Python/pandas/numpy/matplotlib versions printed above.
Install & run: make venv && uv pip sync or pip install -e . then open the notebook

├──
├── data
│   ├── 0_raw
│   │   ├── msforms_dummy_survey_data_dictionary.csv
│   │   ├── msforms_dummy_survey.xlsx
│   │   └── survey_plot_color_dictionary.csv
│   ├── 1_interim
│   │   ├── 0_survey_raw_clean.csv
│   │   └── 1_survey_data_cleaned.csv
│   └── 2_processed
│       ├── 0_survey_responses_daily.csv
│       ├── 1_survey_time.csv
│       └── 2_survey_multi_vc.csv
├── figures
├── Makefile
├── notebooks
│   ├── 2025-survey-pipeline-part0.ipynb
│   ├── 2025-survey-pipeline-part1.ipynb
│   ├── 2025-survey-pipeline-part2.ipynb
│   └── 2025-survey-pipeline-part3.ipynb
├── pyproject.toml
├── README.md
├── requirements.txt
├── src
│   └── survey_pipeline
│       ├── analysis_utils.py
├── tests
│   └── test_utils.py

In [8]:

from pathlib import Path
import mermaidian as mm

# 2) Mermaid code (use <br/> for line breaks in labels)
diagram_code = r"""
flowchart LR
  %% Inputs
  RawXLSX["Survey export<br/>(raw_xlsx .xlsx)"]
  DDict["Data dictionary<br/>(ddict)"]
  Colors["Colors dictionary<br/>(colors)"]

  %% Notebooks
  P0["Part 0<br/>Notes / Design"]
  P1["Part 1<br/>Pre-Processing"]
  P2["Part 2<br/>Multiple Choice"]
  P3["Part 3<br/>Likert"]

  %% Outputs
  Clean["interim/1_survey_data_cleaned.csv"]
  Timing["processed/0_survey_responses_daily.csv<br/>processed/1_survey_time.csv"]
  Fig1["figures/<br/>QC timing plots"]

  MultiVC["processed/2_survey_multi_vc.csv"]
  Fig2["figures/<br/>bar charts"]

  Fig3["figures/<br/>Likert plots"]

  %% Flow
  RawXLSX --> P1
  DDict --> P1
  Colors --> P1
  P1 --> Clean
  P1 --> Timing
  P1 --> Fig1

  Clean --> P2
  DDict --> P2
  Colors --> P2
  P2 --> MultiVC
  P2 --> Fig2

  MultiVC --> P3
  DDict --> P3
  Colors --> P3
  P3 --> Fig3

  %% Notes influence
  P0 -.-> P1
  P0 -.-> P2
  P0 -.-> P3
"""

png = mm.get_mermaid_diagram(
    "png",
    diagram_code,
    theme="default",
)

# 4) Display + save
mm.show_image_ipython_centered(png, margin_top=10, margin_bottom=10)

outdir = Path("..")
outdir.mkdir(parents=True, exist_ok=True)
mm.save_diagram_as_image(outdir / "notebook_pipeline_io.png", png)

print("Saved:", outdir / "notebook_pipeline_io.png")

No description has been provided for this image

Saved: ../notebook_pipeline_io.png

Qualitative Research¶

Codes¶

Providing some context for codes because I ran into the term codebook (below). While I am not really coding any qualitative data here, it's an interesting enough concept for me to document.

From Johnny Saldana (2016). The Coding Manual for Qualitative Researchers (3rd ed.) London, UK: Sage.

A code in qualitative inquiry is most often a word or short phrase that symbolically assigns a summative, salient, essence-capturing, and/or evocative attribute for a portion of language-based or visual data. The data can consist of interview transcripts, participant observation field notes, journals, documents, open-ended survey responses, drawings, artifacts, photographs, video, Internet sites, e-mail correspondence, academic and fictional literature, and so on. The portion of data coded during first cycle coding processes can range in magnitude from a single word to a full paragraph, an entire page of text or a stream of moving images. In second cycle coding processes, the portions coded can be the exact same units, longer passages of text, analytic memos about the data, and even a reconfiguration of the codes themselves developed thus far.

In qualitative data analysis, a code is a researcher-generated construct that symbolizes or “translates” data (Vogt, Vogt, Gardner, & Haeffele, 2014, p. 13) and thus attributes interpreted meaning to each individual datum for later purposes of pattern detection, categorization, assertion or proposition development, theory building, and other analytic processes.

Data Dictionaries / Codebooks¶

A data dictionary is useful for this project, especially for adding metadata to raw survey export from MS Forms.

From: https://guides.library.upenn.edu/c.php?g=564157&p=9554907

Data dictionaries and codebooks are essential documentation of the variables, structure, content, and layout of your datasets. A good dictionary/codebook has enough information about each variable for it to be self explanatory and interpreted properly by someone outside of your research group. The terms are often used interchangeably, but codebooks tend to for survey data and allow the reader to follow the structured format of the survey and possible response value.

From: https://dataworks.faseb.org/helpdesk/kb/creating-codebook

According to The Encyclopedia of Survey Research Methods, “Codebooks are used by survey researchers to serve two main purposes: to provide a guide for coding responses and to serve as documentation of the layout and code definitions of a data file…. At the most basic level, a codebook describes the layout of the data in the data file and describes what the data codes mean.” A codebook is analogous to a data dictionary, but for qualitative data instead of quantitative. However, you will sometimes see the terms used interchangeably.

Analysis (Python, Matplotlib, Jupyter)¶

I have a "analysis_utils.py" script which contains some helper functions that are useful across notebooks.

Repo Management¶

For this project I've tried to level up my repo management. I'm using something similiar to Cookie Cutter.

See: https://www.reddit.com/r/datascience/comments/1i9shbm/seeking_advice_on_organizing_a_sprawling_jupyter/ https://cookiecutter-data-science.drivendata.org/#with-pipx-recommended

Likert-Scaled Plotting¶

The below paper was useful for learning about designing a divergent bar chart for Likert-scaled questions.

Heiberger, R., & Robbins, N. (2014). Design of Diverging Stacked Bar Charts for Likert Scales and Other Applications. Journal of Statistical Software, 57(5), 1-32. https://doi.org/10.18637/jss.v057.i05

Interactive Plots¶

Playing around with interactive plots with this project for the first time (see part 3 for how I used this with Likert-scaled question plots). Below is a samlple.

In [ ]:

%matplotlib widget

x = np.linspace(0, 2 * np.pi, 800)

fig, ax = plt.subplots()
(line,) = ax.plot(x, np.sin(x))
ax.set_ylim(-1.3, 1.3)
ax.set_title("Live sine wave")


def update(freq=1.0, phase=0.0, amp=1.0):
    line.set_ydata(amp * np.sin(freq * x + phase))
    fig.canvas.draw_idle()  # efficient redraw


interact(
    update,
    freq=FloatSlider(1.0, min=0.1, max=5.0, step=0.1, description="freq"),
    phase=FloatSlider(0.0, min=0.0, max=2 * np.pi, step=0.1, description="phase"),
    amp=FloatSlider(1.0, min=0.1, max=2.0, step=0.1, description="amp"),
)