import sys
import itertools
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import survey_pipeline.analysis_utils as utils
print("Python version:", sys.version.split()[0])
print("pandas version:", pd.__version__)
print("numpy version:", np.__version__)
print("matplotlib version:", plt.matplotlib.__version__)
%load_ext autoreload
%autoreload 2
Python version: 3.12.3 pandas version: 2.3.3 numpy version: 2.3.4 matplotlib version: 3.10.7
Survey Analysis Pipeline (MS Forms ā Clean Data ā Plots)¶
Date: Oct 2025
This project is to document a pipeline for analyzing survey results end-to-end: ingesting raw exports, cleaning and restructuring responses, and producing report-ready visualizations (including diverging stacked bar charts for Likert-scale questions). This was inspired by a scenario where I had to analyze survey results exported from MS Forms. I wanted the workflow to be (1) repeatable, (2) auditable from raw data ā final charts, and (3) easy to extend as questions, groups, and response options evolve. Through this project I've also tried to mature my project and code structure, as well as understanding how to "collaborate" with LLMs like ChatGPT (in a sensible, non-vibe-coding sense).
Project highlights:
- Practical data engineering for messy human-entered survey data (multi-select parsing, metadata, tidy formats)
- Visualization design choices for categorical and Likert-scale responses
- Reproducible project structure (clean separation of
raw/,interim/,processed/, andfigures/) - āHuman-readable analysisā: notebooks as documentation, with helper functions to keep code maintainable
Note: this notebook uses a dummy scenario + dummy data.
Series navigation: Part 0 (Notes) ā Part 1 (Pre-Processing) ā Part 2 (Multiple Choice Analysis) ā Part 3 (Likert-Scaled Visualizations)
Part 2 ā Multiple Choice Questions (Value Counts + Bar Charts)¶
This notebook focuses on categorical questions, especially multiple choice / multi-select responses. It generates results and bar charts that should make it easy to compare patterns across groups (e.g., org, role, experience level).
Overview:
- Load the cleaned survey dataset from Part 1
- Identify and normalize multi-select response fields (split, trim, standardize)
- Build value-count tables (overall + grouped) in a consistent long-form structure
- Generate bar charts / stacked bar charts for quick comparison across groups
- Write processed tables that downstream notebooks can reuse (especially Likert plotting)
Inputs:
- Cleaned survey dataset (
interim/1_survey_data_cleaned.csv): Output from Part 1 - Data dictionary (
ddict): Question metadata (types, labels, ordering, groupings) - Colors dictionary (
colors): Response ā color mapping used for consistent plotting
Primary outputs:
- Multiple choice value-count table (
processed/2_survey_multi_vc.csv) - Figures saved under
figures/(bar charts / stacked bars by question and group)
Next: Part 3 ā Likert-Scaled Questions and Diverging Stacked Bar Charts
Data Input¶
paths = utils.get_paths()
df_clrs = pd.read_csv(paths.colors)
df_ddict = pd.read_csv(paths.ddict)
df = pd.read_csv(paths.interim / "1_survey_data_cleaned.csv")
multi_options = utils.get_multi_options(paths.ddict)
df_multi = utils.get_multi_responses(df)
display(df_multi.reset_index()["question"].value_counts())
question Which tools do you use regularly? (Select all that apply) 385 What is your organization? 150 What is your area of expertise? 150 How many years of experience do you have in your area of expertise? 150 What is your current role level? 150 What is your geographic region? 150 The documentation is clear and sufficient. 150 The tool meets my day-to-day needs. 150 The UI is intuitive and easy to learn. 150 Customer support is responsive. 150 The feature set covers my key workflows. 150 Integration with my existing tools is seamless. 150 System performance meets my expectations. 150 I trust the productās security and privacy. 150 Training materials helped me get started quickly. 150 The roadmap aligns with my needs over the next year. 150 The product remains responsive during peak usage. 150 Overall, I am satisfied with the product. 150 Name: count, dtype: int64
Bar Charts (Regular)¶
Analysis¶
Gets the response value counts and proportion for all relevant questions. Analyses are done for all responses and according to groups.
base_cols = ["question_index", "question", "question_text_or_description", "category"]
group_cols = ["organization", "expertise", "role"]
df_multi_vc_list = []
df_multi_vc = utils.vc_for_multi_groups(df_multi, group_cols, base_cols, multi_options)
display(df_multi_vc.head())
display(df_multi_vc[["group-by", "group-value"]].value_counts())
df_multi_vc.to_csv(paths.processed / "2_survey_multi_vc.csv", index=False)
| question_index | question | question_text_or_description | category | response | count | proportion | group-by | group-value | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Q1 | What is your organization? | What is your organization? | demographics | Acme Corp | 14.0 | 0.093333 | all | all |
| 1 | Q1 | What is your organization? | What is your organization? | demographics | Beta Analytics | 6.0 | 0.040000 | all | all |
| 2 | Q1 | What is your organization? | What is your organization? | demographics | Blue River Labs | 10.0 | 0.066667 | all | all |
| 3 | Q1 | What is your organization? | What is your organization? | demographics | Contoso | 11.0 | 0.073333 | all | all |
| 4 | Q1 | What is your organization? | What is your organization? | demographics | Fabrikam | 16.0 | 0.106667 | all | all |
group-by group-value
all all 114
expertise Bioinformatics 114
Data Science / Analytics 114
Electrical Engineering 114
Mechanical Engineering 114
Operations / DevOps 114
Other 114
Prefer not to say 114
Product Management 114
Program / Project Management 114
Research (Academic) 114
Software Engineering 114
Systems Engineering 114
UX / Human Factors 114
organization Acme Corp 114
Beta Analytics 114
Blue River Labs 114
Contoso 114
Fabrikam 114
Globex 114
Initech 114
Northwind Traders 114
Other 114
Prefer not to say 114
Stark Industries 114
Umbrella 114
Vandelay Industries 114
Wayne Enterprises 114
role C-suite 114
Director 114
Individual Contributor 114
Manager 114
Senior IC 114
Senior Manager 114
VP 114
Name: count, dtype: int64
Plotting¶
Plots regular bar charts. Plots will be made according to the group by value. Plots with all responses will be displayed in this notebook, the rest will be saved.
%matplotlib inline
grouped = df_multi_vc.groupby(base_cols + ["group-value"])
df_vcs = [group for _, group in grouped]
for df_vc in df_vcs:
fig, ax = utils.plot_multi_response_barh(df_vc, save_fig=True)
if df_vc['group-value'].iloc[0] == 'all':
plt.show(fig)
plt.close(fig)
Stacked Bar Charts¶
Creates stacked bar charts for a subset of questions: organization, expertise, role, and tool selection. Unlike above, all stacked plots will reflect all responses.
all_groups_meta = {
"organization": {"qidx": "Q1", "multi_options": multi_options["Q1"]},
"expertise": {"qidx": "Q2", "multi_options": multi_options["Q2"]},
"role": {"qidx": "Q4", "multi_options": multi_options["Q4"]},
"response": {"qidx": "Q6", "multi_options": multi_options["Q6"]},
}
groups = all_groups_meta.keys()
group_pairs = ordered_combinations = list(itertools.permutations(groups, 2))
display(df_multi.head())
display(base_cols)
| response | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| index | ID | organization | expertise | role | category | question_index | question | question_text_or_description | |
| 4 | R-00001 | Globex | Bioinformatics | Senior IC | demographics | Q1 | What is your organization? | What is your organization? | Globex |
| 5 | R-00001 | Globex | Bioinformatics | Senior IC | demographics | Q2 | What is your area of expertise? | What is your area of expertise? | Bioinformatics |
| 6 | R-00001 | Globex | Bioinformatics | Senior IC | demographics | Q3 | How many years of experience do you have in your area of expertise? | How many years of experience do you have in your area of expertise? | 2ā5 |
| 7 | R-00001 | Globex | Bioinformatics | Senior IC | demographics | Q4 | What is your current role level? | What is your current role level? | Senior IC |
| 8 | R-00001 | Globex | Bioinformatics | Senior IC | demographics | Q5 | What is your geographic region? | What is your geographic region? | North America |
['question_index', 'question', 'question_text_or_description', 'category']
Analysis¶
df_vcs = utils.vc_for_multi_groups_pair(df_multi, base_cols, group_pairs, multi_options)
['organization', 'expertise'] ['organization', 'role'] ['organization'] ['expertise', 'organization'] ['expertise', 'role'] ['expertise'] ['role', 'organization'] ['role', 'expertise'] ['role'] ['organization'] ['expertise'] ['role']
Plotting¶
utils.plot_all_stacked_bars_charts(df_vcs, group_pairs, all_groups_meta)
# review of file structure (very long)
# utils.print_tree("../figures", exts={".png"})
Proceed to Part 3 - Likert-Scaled Questions
The next notebbook will plot more complex diverging bar charts for the Likert-scaled questions.