In [1]:
import survey_pipeline.analysis_utils as utils
import sys
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as pltcolors
import numpy as np
from textwrap import wrap
import os

from mpl_toolkits.axes_grid1 import Divider, Size
import ipywidgets as w
from ipywidgets import interact, FloatSlider, Checkbox, fixed
import matplotlib.pyplot as plt

print("Python version:", sys.version.split()[0])
print("pandas version:", pd.__version__)
print("numpy version:", np.__version__)
print("matplotlib version:", plt.matplotlib.__version__)

%load_ext autoreload
%autoreload 2
Python version: 3.12.3
pandas version: 2.3.3
numpy version: 2.3.4
matplotlib version: 3.10.7

Survey Analysis Pipeline (MS Forms → Clean Data → Plots)¶

Date: Oct 2025

This project is to document a pipeline for analyzing survey results end-to-end: ingesting raw exports, cleaning and restructuring responses, and producing report-ready visualizations (including diverging stacked bar charts for Likert-scale questions). This was inspired by a scenario where I had to analyze survey results exported from MS Forms. I wanted the workflow to be (1) repeatable, (2) auditable from raw data → final charts, and (3) easy to extend as questions, groups, and response options evolve. Through this project I've also tried to mature my project and code structure, as well as understanding how to "collaborate" with LLMs like ChatGPT (in a sensible, non-vibe-coding sense).

Project highlights:

  • Practical data engineering for messy human-entered survey data (multi-select parsing, metadata, tidy formats)
  • Visualization design choices for categorical and Likert-scale responses
  • Reproducible project structure (clean separation of raw/, interim/, processed/, and figures/)
  • “Human-readable analysis”: notebooks as documentation, with helper functions to keep code maintainable

Note: this notebook uses a dummy scenario + dummy data.

Series navigation: Part 0 (Notes) → Part 1 (Pre-Processing) → Part 2 (Multiple Choice Analysis & Bar Charts) → Part 3 (Likert-Scaled Visualizations)

Part 3 — Likert-Scaled Questions (Diverging Stacked Bars)¶

This notebook produces the “headline” Likert visuals: diverging stacked bar charts that show the balance of agreement vs disagreement at a glance, and make it easy to compare groups. This notebook is also captures how to set up an interactive plot which allows for tuning figure layout, ordering, and labeling before finalization.

Overview:

  • Load processed response counts from Part 2
  • Filter/select the Likert-scaled question set and enforce consistent response ordering
  • Convert counts → percentages (and optionally include sample sizes for context)
  • Generate diverging stacked bar charts:
    • “All questions” views for scanning patterns across a category
    • Per-question plots for reporting, captions, and embedding on the website
  • Save out final figures in a consistent naming scheme for reuse

Inputs:

  • Multiple choice / Likert value-count table (processed/2_survey_multi_vc.csv): Output from Part 2
  • Data dictionary (ddict): Defines which questions are Likert, response ordering, labels, and any grouping metadata
  • Colors dictionary (colors): Response → color mapping for consistent Likert segment colors

Primary outputs:

  • Likert figures saved under figures/ (all-question dashboards + per-question plots)
  • (Optional) additional processed summaries for reuse (e.g., percent tables / grouped Likert tables), depending on what you choose to persist

Next: end of this pipeline. From here, figures and processed tables are ready for reporting or publishing.

Input Data¶

In [2]:
paths = utils.get_paths()
df = pd.read_csv(paths.processed / "2_survey_multi_vc.csv")
likerts_qs = ["Q" + str(x) for x in range(7, 19)]
df_likert_vc = df.loc[df["question_index"].isin(likerts_qs)].copy()
df_likert_vc["question_number"] = df_likert_vc["question_index"].apply(lambda x: int(x[1:]))
df_likert_vc = df_likert_vc.sort_values(by=["question_number"], ascending=True)
print("Check value counts per question number:")
display(df_likert_vc["question_number"].value_counts())
Check value counts per question number:
question_number
7     175
8     175
9     175
10    175
11    175
12    175
13    175
14    175
15    175
16    175
17    175
18    175
Name: count, dtype: int64

Likert-Scaled Question Plotting¶

Questions Grouped by Organization, Role, etc.

Plots with All Questions¶

All Responses¶

First we'll do a plot for all likert-scaled responses. We'll make the plot interactive so we can quickly adjust the varibles to tune the plot for the data.

In [3]:
%matplotlib widget

utils._plot_v1_interact(
    df_likert_vc,
    fig_width=13.0,
    fig_height=6.0,
    wratio_1=0.05,
    wratio_2=0.15,
    wratio_3=0.8,
    xlim=75,
    title_height=1.15,
    legend_height=1.15,
)
interactive(children=(Dropdown(description='group-by', options=('all', 'expertise', 'organization', 'role'), v…

We'll loop through all the groups to atuomatically create plots. When using automation to create a lot of plots like this, they may not be as nicely formatted or readable compared to manually producing them. But doing this can be helpful for exploratory analysis. The interative plot method above could be used to manually inspected and adjust any of these particular plots.

In [4]:
%matplotlib inline
plt.ioff()
grouped = df_likert_vc.groupby(by=["group-by", "group-value"], sort=False)

for groups, df in grouped:
    groupby, groupval = groups
    # display(df.head())
    qnum = len(df["question"].unique())
    utils.group_likert_plot_v1(
        df,
        title="Survey Likert-Scaled Questions - " + groupval,
        group=groupby,
        fig_width=14,
        fig_height=8,
        xlim=75,
        barlabel_type="both",
        save=True,
    )
No description has been provided for this image

Plot Per Question¶

In [5]:
%matplotlib widget
# Interactive v2: pick question + group-by, tweak layout
utils._plot_v2_interact(
    df_likert_vc,
    fig_width=10.0,
    fig_height=8.0,
    wratio_1=0.05,
    wratio_2=0.25,
    wratio_3=0.75,
    xlim=75,
    title_height=1.15,
    legend_height=1.15,
)
interactive(children=(Dropdown(description='question', layout=Layout(width='70%'), options=('Q7: The documenta…
In [6]:
%matplotlib inline
plt.ioff()
df_likert_vc_dropall = df_likert_vc.loc[df_likert_vc['group-by'] != 'all']
grouped = df_likert_vc_dropall.groupby(by=["group-by", "question_index", "question"], sort=False)

for groups, df in grouped:
    groupby, question_index, question = groups
    utils.group_likert_plot_v2(
        df,
        title= question_index + " " + question,
        group=groupby,
        fig_width=14,
        fig_height=8,
        xlim=75,
        barlabel_type="both",
        save=True,
    )