Survey Analysis Pipeline (MS Forms → Clean Data → Plots)

27 Dec 2025

This project is to document a pipeline for analyzing survey results end-to-end: ingesting raw exports, cleaning and restructuring responses, and producing report-ready visualizations (including diverging stacked bar charts for Likert-scale questions). This was inspired by a scenario where I had to analyze survey results exported from MS Forms. I wanted the workflow to be (1) repeatable, (2) auditable from raw data → final charts, and (3) easy to extend as questions, groups, and response options evolve. Through this project I’ve also tried to mature my project and code structure, as well as understand how to “collaborate” with LLMs like ChatGPT (in a sensible, non-vibe-coding sense).

Project highlights:

Note: this notebook uses a dummy scenario + dummy data.

Series navigation: Part 0 (Notes) → Part 1 (Pre-Processing) → Part 2 (Multiple Choice Analysis & Bar Charts) → Part 3 (Likert-Scaled Visualizations)

Part 0 — Notes, References, and Design Decisions

Link to Notebook (HTML)

This notebook is the “project notebook”: a place to capture the setup, references, and visualization ideas that influenced the implementation in later parts. I keep this separate on purpose so the processing/analysis notebooks stay focused and runnable, while this one holds the reasoning and breadcrumbs I’d want if I revisited the work months later.

In particular, it includes:

├── 2025-survey-pipeline-10-03.zip
├── data
│   ├── 0_raw
│   │   ├── msforms_dummy_survey_data_dictionary.csv
│   │   ├── msforms_dummy_survey.xlsx
│   │   └── survey_plot_color_dictionary.csv
│   ├── 1_interim
│   │   ├── 0_survey_raw_clean.csv
│   │   └── 1_survey_data_cleaned.csv
│   └── 2_processed
│       ├── 0_survey_responses_daily.csv
│       ├── 1_survey_time.csv
│       └── 2_survey_multi_vc.csv
├── figures
│   ├── \[lots of figures\]
├── Makefile
├── notebooks
│   ├── 2025-survey-pipeline-part0.ipynb
│   ├── 2025-survey-pipeline-part1.ipynb
│   ├── 2025-survey-pipeline-part2.ipynb
│   └── 2025-survey-pipeline-part3.ipynb
├── pyproject.toml
├── README.md
├── requirements.txt
├── src
│   └── survey_pipeline
│       ├── analysis_utils.py
│       └── __init__.py
├── tests
│   └── test_utils.py

Survey notebook pipeline: inputs → outputs

Part 1 — Pre-Processing (Ingest + Clean + Restructure)

Link to Notebook (HTML)

This notebook turns raw MS Forms exports into clean, analysis-ready datasets and runs some lightweight QC summaries. The main idea is traceability: keep the original export intact, apply minimal/explicit transforms, and write out interim artifacts that downstream analysis can rely on.

Overview:

Inputs

Primary outputs:

Part 2 — Multiple Choice Questions (Value Counts + Bar Charts)

Link to Notebook (HTML)

This notebook focuses on categorical questions, especially multiple choice / multi-select responses. It generates results and bar charts that should make it easy to compare patterns across groups (e.g., org, role, experience level).

Overview:

Inputs:

Primary outputs:

Example Plots

Below are example bar charts for questions 2 (expertise) and 6 (tool usage). The unique colors for each response are defined in the colors dictionary.

Survey notebook pipeline: inputs → outputs

Survey notebook pipeline: inputs → outputs

Survey notebook pipeline: inputs → outputs

The above plots reflect all responses. Plots are also generated for various groupings. Below is a plot that reflect only those with bioinformatics expertise.

Survey notebook pipeline: inputs → outputs

Part 3 — Likert-Scaled Questions (Diverging Stacked Bars)

Link to Notebook (HTML)

This notebook produces the “headline” Likert visuals: diverging stacked bar charts that show the balance of agreement vs disagreement at a glance, and make it easy to compare groups. This notebook is also captures how to set up an interactive plot which allows for tuning figure layout, ordering, and labeling before finalization.

Overview:

Inputs:

Primary outputs:

End of this pipeline. From here, figures and processed tables are ready for reporting or publishing.

Example Plots

Below is the plot for all likert-scaled respones. Quesitons are grouped according to a category.

Survey notebook pipeline: inputs → outputs

Plots are also generated by question. The figure below focuses only on question 7.

Survey notebook pipeline: inputs → outputs