InĀ [1]:
import sys
import itertools
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import survey_pipeline.analysis_utils as utils

print("Python version:", sys.version.split()[0])
print("pandas version:", pd.__version__)
print("numpy version:", np.__version__)
print("matplotlib version:", plt.matplotlib.__version__)

%load_ext autoreload
%autoreload 2
Python version: 3.12.3
pandas version: 2.3.3
numpy version: 2.3.4
matplotlib version: 3.10.7

Survey Analysis Pipeline (MS Forms → Clean Data → Plots)¶

Date: Oct 2025

This project is to document a pipeline for analyzing survey results end-to-end: ingesting raw exports, cleaning and restructuring responses, and producing report-ready visualizations (including diverging stacked bar charts for Likert-scale questions). This was inspired by a scenario where I had to analyze survey results exported from MS Forms. I wanted the workflow to be (1) repeatable, (2) auditable from raw data → final charts, and (3) easy to extend as questions, groups, and response options evolve. Through this project I've also tried to mature my project and code structure, as well as understanding how to "collaborate" with LLMs like ChatGPT (in a sensible, non-vibe-coding sense).

Project highlights:

  • Practical data engineering for messy human-entered survey data (multi-select parsing, metadata, tidy formats)
  • Visualization design choices for categorical and Likert-scale responses
  • Reproducible project structure (clean separation of raw/, interim/, processed/, and figures/)
  • ā€œHuman-readable analysisā€: notebooks as documentation, with helper functions to keep code maintainable

Note: this notebook uses a dummy scenario + dummy data.

Series navigation: Part 0 (Notes) → Part 1 (Pre-Processing) → Part 2 (Multiple Choice Analysis) → Part 3 (Likert-Scaled Visualizations)

Part 2 — Multiple Choice Questions (Value Counts + Bar Charts)¶

This notebook focuses on categorical questions, especially multiple choice / multi-select responses. It generates results and bar charts that should make it easy to compare patterns across groups (e.g., org, role, experience level).

Overview:

  • Load the cleaned survey dataset from Part 1
  • Identify and normalize multi-select response fields (split, trim, standardize)
  • Build value-count tables (overall + grouped) in a consistent long-form structure
  • Generate bar charts / stacked bar charts for quick comparison across groups
  • Write processed tables that downstream notebooks can reuse (especially Likert plotting)

Inputs:

  • Cleaned survey dataset (interim/1_survey_data_cleaned.csv): Output from Part 1
  • Data dictionary (ddict): Question metadata (types, labels, ordering, groupings)
  • Colors dictionary (colors): Response → color mapping used for consistent plotting

Primary outputs:

  • Multiple choice value-count table (processed/2_survey_multi_vc.csv)
  • Figures saved under figures/ (bar charts / stacked bars by question and group)

Next: Part 3 — Likert-Scaled Questions and Diverging Stacked Bar Charts

Data Input¶

InĀ [2]:
paths = utils.get_paths()
df_clrs = pd.read_csv(paths.colors)
df_ddict = pd.read_csv(paths.ddict)
df = pd.read_csv(paths.interim / "1_survey_data_cleaned.csv")
multi_options = utils.get_multi_options(paths.ddict)
df_multi = utils.get_multi_responses(df)
display(df_multi.reset_index()["question"].value_counts())
question
Which tools do you use regularly? (Select all that apply)              385
What is your organization?                                             150
What is your area of expertise?                                        150
How many years of experience do you have in your area of expertise?    150
What is your current role level?                                       150
What is your geographic region?                                        150
The documentation is clear and sufficient.                             150
The tool meets my day-to-day needs.                                    150
The UI is intuitive and easy to learn.                                 150
Customer support is responsive.                                        150
The feature set covers my key workflows.                               150
Integration with my existing tools is seamless.                        150
System performance meets my expectations.                              150
I trust the product’s security and privacy.                            150
Training materials helped me get started quickly.                      150
The roadmap aligns with my needs over the next year.                   150
The product remains responsive during peak usage.                      150
Overall, I am satisfied with the product.                              150
Name: count, dtype: int64

Bar Charts (Regular)¶

Analysis¶

Gets the response value counts and proportion for all relevant questions. Analyses are done for all responses and according to groups.

InĀ [3]:
base_cols = ["question_index", "question", "question_text_or_description", "category"]
group_cols = ["organization", "expertise", "role"]
df_multi_vc_list = []
df_multi_vc = utils.vc_for_multi_groups(df_multi, group_cols, base_cols, multi_options)
display(df_multi_vc.head())
display(df_multi_vc[["group-by", "group-value"]].value_counts())
df_multi_vc.to_csv(paths.processed / "2_survey_multi_vc.csv", index=False)
question_index question question_text_or_description category response count proportion group-by group-value
0 Q1 What is your organization? What is your organization? demographics Acme Corp 14.0 0.093333 all all
1 Q1 What is your organization? What is your organization? demographics Beta Analytics 6.0 0.040000 all all
2 Q1 What is your organization? What is your organization? demographics Blue River Labs 10.0 0.066667 all all
3 Q1 What is your organization? What is your organization? demographics Contoso 11.0 0.073333 all all
4 Q1 What is your organization? What is your organization? demographics Fabrikam 16.0 0.106667 all all
group-by      group-value                 
all           all                             114
expertise     Bioinformatics                  114
              Data Science / Analytics        114
              Electrical Engineering          114
              Mechanical Engineering          114
              Operations / DevOps             114
              Other                           114
              Prefer not to say               114
              Product Management              114
              Program / Project Management    114
              Research (Academic)             114
              Software Engineering            114
              Systems Engineering             114
              UX / Human Factors              114
organization  Acme Corp                       114
              Beta Analytics                  114
              Blue River Labs                 114
              Contoso                         114
              Fabrikam                        114
              Globex                          114
              Initech                         114
              Northwind Traders               114
              Other                           114
              Prefer not to say               114
              Stark Industries                114
              Umbrella                        114
              Vandelay Industries             114
              Wayne Enterprises               114
role          C-suite                         114
              Director                        114
              Individual Contributor          114
              Manager                         114
              Senior IC                       114
              Senior Manager                  114
              VP                              114
Name: count, dtype: int64

Plotting¶

Plots regular bar charts. Plots will be made according to the group by value. Plots with all responses will be displayed in this notebook, the rest will be saved.

InĀ [9]:
%matplotlib inline
grouped = df_multi_vc.groupby(base_cols + ["group-value"])
df_vcs = [group for _, group in grouped]
for df_vc in df_vcs:
    fig, ax = utils.plot_multi_response_barh(df_vc, save_fig=True)
    if df_vc['group-value'].iloc[0] == 'all':
        plt.show(fig)
    plt.close(fig)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Stacked Bar Charts¶

Creates stacked bar charts for a subset of questions: organization, expertise, role, and tool selection. Unlike above, all stacked plots will reflect all responses.

InĀ [10]:
all_groups_meta = {
    "organization": {"qidx": "Q1", "multi_options": multi_options["Q1"]},
    "expertise": {"qidx": "Q2", "multi_options": multi_options["Q2"]},
    "role": {"qidx": "Q4", "multi_options": multi_options["Q4"]},
    "response": {"qidx": "Q6", "multi_options": multi_options["Q6"]},
}
groups = all_groups_meta.keys()
group_pairs = ordered_combinations = list(itertools.permutations(groups, 2))
display(df_multi.head())
display(base_cols)
response
index ID organization expertise role category question_index question question_text_or_description
4 R-00001 Globex Bioinformatics Senior IC demographics Q1 What is your organization? What is your organization? Globex
5 R-00001 Globex Bioinformatics Senior IC demographics Q2 What is your area of expertise? What is your area of expertise? Bioinformatics
6 R-00001 Globex Bioinformatics Senior IC demographics Q3 How many years of experience do you have in your area of expertise? How many years of experience do you have in your area of expertise? 2–5
7 R-00001 Globex Bioinformatics Senior IC demographics Q4 What is your current role level? What is your current role level? Senior IC
8 R-00001 Globex Bioinformatics Senior IC demographics Q5 What is your geographic region? What is your geographic region? North America
['question_index', 'question', 'question_text_or_description', 'category']

Analysis¶

InĀ [6]:
df_vcs = utils.vc_for_multi_groups_pair(df_multi, base_cols, group_pairs, multi_options)
['organization', 'expertise']
['organization', 'role']
['organization']
['expertise', 'organization']
['expertise', 'role']
['expertise']
['role', 'organization']
['role', 'expertise']
['role']
['organization']
['expertise']
['role']

Plotting¶

InĀ [7]:
utils.plot_all_stacked_bars_charts(df_vcs, group_pairs, all_groups_meta)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
InĀ [8]:
# review of file structure (very long)
# utils.print_tree("../figures", exts={".png"})

Proceed to Part 3 - Likert-Scaled Questions

The next notebbook will plot more complex diverging bar charts for the Likert-scaled questions.