import survey_pipeline.analysis_utils as utils
import sys
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as pltcolors
import numpy as np
from textwrap import wrap
import os
from mpl_toolkits.axes_grid1 import Divider, Size
import ipywidgets as w
from ipywidgets import interact, FloatSlider, Checkbox, fixed
import matplotlib.pyplot as plt
print("Python version:", sys.version.split()[0])
print("pandas version:", pd.__version__)
print("numpy version:", np.__version__)
print("matplotlib version:", plt.matplotlib.__version__)
%load_ext autoreload
%autoreload 2
Python version: 3.12.3 pandas version: 2.3.3 numpy version: 2.3.4 matplotlib version: 3.10.7
Survey Analysis Pipeline (MS Forms → Clean Data → Plots)¶
Date: Oct 2025
This project is to document a pipeline for analyzing survey results end-to-end: ingesting raw exports, cleaning and restructuring responses, and producing report-ready visualizations (including diverging stacked bar charts for Likert-scale questions). This was inspired by a scenario where I had to analyze survey results exported from MS Forms. I wanted the workflow to be (1) repeatable, (2) auditable from raw data → final charts, and (3) easy to extend as questions, groups, and response options evolve. Through this project I've also tried to mature my project and code structure, as well as understanding how to "collaborate" with LLMs like ChatGPT (in a sensible, non-vibe-coding sense).
Project highlights:
- Practical data engineering for messy human-entered survey data (multi-select parsing, metadata, tidy formats)
- Visualization design choices for categorical and Likert-scale responses
- Reproducible project structure (clean separation of
raw/,interim/,processed/, andfigures/) - “Human-readable analysis”: notebooks as documentation, with helper functions to keep code maintainable
Note: this notebook uses a dummy scenario + dummy data.
Series navigation: Part 0 (Notes) → Part 1 (Pre-Processing) → Part 2 (Multiple Choice Analysis & Bar Charts) → Part 3 (Likert-Scaled Visualizations)
Part 3 — Likert-Scaled Questions (Diverging Stacked Bars)¶
This notebook produces the “headline” Likert visuals: diverging stacked bar charts that show the balance of agreement vs disagreement at a glance, and make it easy to compare groups. This notebook is also captures how to set up an interactive plot which allows for tuning figure layout, ordering, and labeling before finalization.
Overview:
- Load processed response counts from Part 2
- Filter/select the Likert-scaled question set and enforce consistent response ordering
- Convert counts → percentages (and optionally include sample sizes for context)
- Generate diverging stacked bar charts:
- “All questions” views for scanning patterns across a category
- Per-question plots for reporting, captions, and embedding on the website
- Save out final figures in a consistent naming scheme for reuse
Inputs:
- Multiple choice / Likert value-count table (
processed/2_survey_multi_vc.csv): Output from Part 2 - Data dictionary (
ddict): Defines which questions are Likert, response ordering, labels, and any grouping metadata - Colors dictionary (
colors): Response → color mapping for consistent Likert segment colors
Primary outputs:
- Likert figures saved under
figures/(all-question dashboards + per-question plots) - (Optional) additional processed summaries for reuse (e.g., percent tables / grouped Likert tables), depending on what you choose to persist
Next: end of this pipeline. From here, figures and processed tables are ready for reporting or publishing.
Input Data¶
paths = utils.get_paths()
df = pd.read_csv(paths.processed / "2_survey_multi_vc.csv")
likerts_qs = ["Q" + str(x) for x in range(7, 19)]
df_likert_vc = df.loc[df["question_index"].isin(likerts_qs)].copy()
df_likert_vc["question_number"] = df_likert_vc["question_index"].apply(lambda x: int(x[1:]))
df_likert_vc = df_likert_vc.sort_values(by=["question_number"], ascending=True)
print("Check value counts per question number:")
display(df_likert_vc["question_number"].value_counts())
Check value counts per question number:
question_number 7 175 8 175 9 175 10 175 11 175 12 175 13 175 14 175 15 175 16 175 17 175 18 175 Name: count, dtype: int64
Likert-Scaled Question Plotting¶
Questions Grouped by Organization, Role, etc.
%matplotlib widget
utils._plot_v1_interact(
df_likert_vc,
fig_width=13.0,
fig_height=6.0,
wratio_1=0.05,
wratio_2=0.15,
wratio_3=0.8,
xlim=75,
title_height=1.15,
legend_height=1.15,
)
interactive(children=(Dropdown(description='group-by', options=('all', 'expertise', 'organization', 'role'), v…
We'll loop through all the groups to atuomatically create plots. When using automation to create a lot of plots like this, they may not be as nicely formatted or readable compared to manually producing them. But doing this can be helpful for exploratory analysis. The interative plot method above could be used to manually inspected and adjust any of these particular plots.
%matplotlib inline
plt.ioff()
grouped = df_likert_vc.groupby(by=["group-by", "group-value"], sort=False)
for groups, df in grouped:
groupby, groupval = groups
# display(df.head())
qnum = len(df["question"].unique())
utils.group_likert_plot_v1(
df,
title="Survey Likert-Scaled Questions - " + groupval,
group=groupby,
fig_width=14,
fig_height=8,
xlim=75,
barlabel_type="both",
save=True,
)
Plot Per Question¶
%matplotlib widget
# Interactive v2: pick question + group-by, tweak layout
utils._plot_v2_interact(
df_likert_vc,
fig_width=10.0,
fig_height=8.0,
wratio_1=0.05,
wratio_2=0.25,
wratio_3=0.75,
xlim=75,
title_height=1.15,
legend_height=1.15,
)
interactive(children=(Dropdown(description='question', layout=Layout(width='70%'), options=('Q7: The documenta…
%matplotlib inline
plt.ioff()
df_likert_vc_dropall = df_likert_vc.loc[df_likert_vc['group-by'] != 'all']
grouped = df_likert_vc_dropall.groupby(by=["group-by", "question_index", "question"], sort=False)
for groups, df in grouped:
groupby, question_index, question = groups
utils.group_likert_plot_v2(
df,
title= question_index + " " + question,
group=groupby,
fig_width=14,
fig_height=8,
xlim=75,
barlabel_type="both",
save=True,
)