Survey Analysis Pipeline (MS Forms → Clean Data → Plots)

27 Dec 2025

This project is to document a pipeline for analyzing survey results end-to-end: ingesting raw exports, cleaning and restructuring responses, and producing report-ready visualizations (including diverging stacked bar charts for Likert-scale questions). This was inspired by a scenario where I had to analyze survey results exported from MS Forms. I wanted the workflow to be (1) repeatable, (2) auditable from raw data → final charts, and (3) easy to extend as questions, groups, and response options evolve. Through this project I’ve also tried to mature my project and code structure, as well as understand how to “collaborate” with LLMs like ChatGPT (in a sensible, non-vibe-coding sense).

Project highlights:

Data analysis for survey data (multi-select parsing, metadata, tidy formats)
Visualization design choices for categorical and Likert-scale responses
Reproducible project structure (clean separation of raw/, interim/, processed/, and figures/)
“Human-readable analysis”: notebooks as documentation, with helper functions to keep code maintainable

Note: this notebook uses a dummy scenario + dummy data.

Series navigation: Part 0 (Notes) → Part 1 (Pre-Processing) → Part 2 (Multiple Choice Analysis & Bar Charts) → Part 3 (Likert-Scaled Visualizations)

Part 0 — Notes, References, and Design Decisions

Link to Notebook (HTML)

This notebook is the “project notebook”: a place to capture the setup, references, and visualization ideas that influenced the implementation in later parts. I keep this separate on purpose so the processing/analysis notebooks stay focused and runnable, while this one holds the reasoning and breadcrumbs I’d want if I revisited the work months later.

In particular, it includes:

Repro / environment notes (versions, repo structure, and how to run the notebooks)
Visualization references for Likert-scale chart design (diverging stacked bars)
A small interactive plotting scratchpad (used to sanity-check interactivity and layout mechanics before applying them to survey charts)

├── 2025-survey-pipeline-10-03.zip
├── data
│   ├── 0_raw
│   │   ├── msforms_dummy_survey_data_dictionary.csv
│   │   ├── msforms_dummy_survey.xlsx
│   │   └── survey_plot_color_dictionary.csv
│   ├── 1_interim
│   │   ├── 0_survey_raw_clean.csv
│   │   └── 1_survey_data_cleaned.csv
│   └── 2_processed
│       ├── 0_survey_responses_daily.csv
│       ├── 1_survey_time.csv
│       └── 2_survey_multi_vc.csv
├── figures
│   ├── \[lots of figures\]
├── Makefile
├── notebooks
│   ├── 2025-survey-pipeline-part0.ipynb
│   ├── 2025-survey-pipeline-part1.ipynb
│   ├── 2025-survey-pipeline-part2.ipynb
│   └── 2025-survey-pipeline-part3.ipynb
├── pyproject.toml
├── README.md
├── requirements.txt
├── src
│   └── survey_pipeline
│       ├── analysis_utils.py
│       └── __init__.py
├── tests
│   └── test_utils.py

Survey notebook pipeline: inputs → outputs

Part 1 — Pre-Processing (Ingest + Clean + Restructure)

Link to Notebook (HTML)

This notebook turns raw MS Forms exports into clean, analysis-ready datasets and runs some lightweight QC summaries. The main idea is traceability: keep the original export intact, apply minimal/explicit transforms, and write out interim artifacts that downstream analysis can rely on.

Overview:

Load raw survey exports and the data dictionary (question metadata, response mappings, etc.)
Normalize/clean a few common survey quirks (e.g., multi-select delimiter cleanup)
Produce basic survey metadata plots (response timing, completion time) for quick QA
Merge responses + metadata into a single cleaned dataset for analysis

Inputs

Survey Data Export (raw_xlsx): Mimics an MS Forms export (.xlsx) from a survey with mixed response types, including Likert-scaled questions.
Data dictionary (ddict): Maps questions and responses to additional metadata
Colors dictionary (colors): Maps question responses to colors for plotting

Primary outputs:

Main cleaned dataset used downstream (interim/1_survey_data_cleaned.csv)
Timing metadata (processed/0_survey_responses_daily.csv and processed/1_survey_time.csv)
Figures saved under figures/ (e.g., responses-per-day, completion-time)

Part 2 — Multiple Choice Questions (Value Counts + Bar Charts)

Link to Notebook (HTML)

This notebook focuses on categorical questions, especially multiple choice / multi-select responses. It generates results and bar charts that should make it easy to compare patterns across groups (e.g., org, role, experience level).

Overview:

Load the cleaned survey dataset from Part 1
Identify and normalize multi-select response fields (split, trim, standardize)
Build value-count tables (overall + grouped) in a consistent long-form structure
Generate bar charts / stacked bar charts for quick comparison across groups
Write processed tables that downstream notebooks can reuse (especially Likert plotting)

Inputs:

Cleaned survey dataset (interim/1_survey_data_cleaned.csv): Output from Part 1
Data dictionary (ddict): Question metadata (types, labels, ordering, groupings)
Colors dictionary (colors): Response → color mapping used for consistent plotting

Primary outputs:

Multiple choice value-count table (processed/2_survey_multi_vc.csv)
Figures saved under figures/ (bar charts / stacked bars by question and group)

Example Plots

Below are example bar charts for questions 2 (expertise) and 6 (tool usage). The unique colors for each response are defined in the colors dictionary.

Survey notebook pipeline: inputs → outputs

The above plots reflect all responses. Plots are also generated for various groupings. Below is a plot that reflect only those with bioinformatics expertise.

Survey notebook pipeline: inputs → outputs

Part 3 — Likert-Scaled Questions (Diverging Stacked Bars)

Link to Notebook (HTML)

This notebook produces the “headline” Likert visuals: diverging stacked bar charts that show the balance of agreement vs disagreement at a glance, and make it easy to compare groups. This notebook is also captures how to set up an interactive plot which allows for tuning figure layout, ordering, and labeling before finalization.

Overview:

Load processed response counts from Part 2
Filter/select the Likert-scaled question set and enforce consistent response ordering
Convert counts → percentages (and optionally include sample sizes for context)
Generate diverging stacked bar charts:
- “All questions” views for scanning patterns across a category
- Per-question plots for reporting, captions, and embedding on the website
Save out final figures in a consistent naming scheme for reuse

Inputs:

Multiple choice / Likert value-count table (processed/2_survey_multi_vc.csv): Output from Part 2
Data dictionary (ddict): Defines which questions are Likert, response ordering, labels, and any grouping metadata
Colors dictionary (colors): Response → color mapping for consistent Likert segment colors

Primary outputs:

Likert figures saved under figures/ (all-question dashboards + per-question plots)
(Optional) additional processed summaries for reuse (e.g., percent tables / grouped Likert tables), depending on what you choose to persist

End of this pipeline. From here, figures and processed tables are ready for reporting or publishing.

Example Plots

Below is the plot for all likert-scaled respones. Quesitons are grouped according to a category.

Survey notebook pipeline: inputs → outputs

Plots are also generated by question. The figure below focuses only on question 7.

Survey notebook pipeline: inputs → outputs