import os
import sys
import pathlib
from datetime import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
import survey_pipeline.analysis_utils as utils
print("Python version:", sys.version.split()[0])
print("pandas version:", pd.__version__)
print("numpy version:", np.__version__)
print("matplotlib version:", plt.matplotlib.__version__)
%load_ext autoreload
%autoreload 2
Python version: 3.12.3 pandas version: 2.3.3 numpy version: 2.3.4 matplotlib version: 3.10.7
Survey Analysis Pipeline (MS Forms → Clean Data → Plots)¶
Date: Oct 2025
This project is to document a pipeline for analyzing survey results end-to-end: ingesting raw exports, cleaning and restructuring responses, and producing report-ready visualizations (including diverging stacked bar charts for Likert-scale questions). This was inspired by a scenario where I had to analyze survey results exported from MS Forms. I wanted the workflow to be (1) repeatable, (2) auditable from raw data → final charts, and (3) easy to extend as questions, groups, and response options evolve. Through this project I've also tried to mature my project and code structure, as well as understanding how to "collaborate" with LLMs like ChatGPT (in a sensible, non-vibe-coding sense).
Project highlights:
- Practical data engineering for messy human-entered survey data (multi-select parsing, metadata, tidy formats)
- Visualization design choices for categorical and Likert-scale responses
- Reproducible project structure (clean separation of
raw/,interim/,processed/, andfigures/) - “Human-readable analysis”: notebooks as documentation, with helper functions to keep code maintainable
Note: this notebook uses a dummy scenario + dummy data.
Series navigation: Part 0 (Notes) → Part 1 (Pre-Processing) → Part 2 (Multiple Choice Analysis) → Part 3 (Likert-Scaled Visualizations)
Part 1 — Pre-Processing (Ingest + Clean + Restructure)¶
This notebook turns raw MS Forms exports into clean, analysis-ready datasets and runs some lightweight QC summaries. The main idea is traceability: keep the original export intact, apply minimal/explicit transforms, and write out interim artifacts that downstream analysis can rely on.
Overview:
- Load raw survey exports and the data dictionary (question metadata, response mappings, etc.)
- Normalize/clean a few common survey quirks (e.g., multi-select delimiter cleanup)
- Produce basic survey metadata plots (response timing, completion time) for quick QA
- Merge responses + metadata into a single cleaned dataset for analysis
Inputs
- Survey Data Export (
raw_xlsx): Mimics an MS Forms export (.xlsx) from a survey with mixed response types, including Likert-scaled questions. - Data dictionary (
ddict): Maps questions and responses to additional metadata - Colors dictionary (
colors): Maps question responses to colors for plotting
Primary outputs:
- Main cleaned dataset used downstream (
interim/1_survey_data_cleaned.csv) - Timing metadata (
processed/0_survey_responses_daily.csvandprocessed/1_survey_time.csv) - Figures saved under
figures/(e.g., responses-per-day, completion-time)
Next: Part 2 — Multiple Choice Question Analysis and Bar Charts
paths = utils.get_paths()
print("Review Paths: ")
print(paths.raw_xlsx)
print(paths.ddict)
print(paths.interim)
print(paths.colors)
print(paths.processed)
Review Paths: /home/autrpy/projects/2025-survey-analysis-pipeline/data/0_raw/msforms_dummy_survey.xlsx /home/autrpy/projects/2025-survey-analysis-pipeline/data/0_raw/msforms_dummy_survey_data_dictionary.csv /home/autrpy/projects/2025-survey-analysis-pipeline/data/1_interim /home/autrpy/projects/2025-survey-analysis-pipeline/data/0_raw/survey_plot_color_dictionary.csv /home/autrpy/projects/2025-survey-analysis-pipeline/data/2_processed
Cleaning and Quality Checks¶
Ensure basic shape and santity checks.
Checks
- Column presence: Start time, Completion time, ID, key question fields.
- Duplicates
- Multi-select delimiter normalization (";" vs "; ").
df_raw = pd.read_excel(paths.raw_xlsx, parse_dates=["Start time", "Completion time"])
df_ddict = pd.read_csv(paths.ddict)
display(df_raw.head())
display(df_ddict.head())
| ID | Start time | Completion time | Name | What is your organization? | What is your area of expertise? | How many years of experience do you have in your area of expertise? | What is your current role level? | What is your geographic region? | ... | Integration with my existing tools is seamless. | System performance meets my expectations. | I trust the product’s security and privacy. | Training materials helped me get started quickly. | The roadmap aligns with my needs over the next year. | The product remains responsive during peak usage. | Overall, I am satisfied with the product. | Would you recommend this to a colleague? | Consent to contact for follow-up? | Additional comments | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | R-00001 | 2025-10-27 12:20:21 | 2025-10-27 12:42:01 | reese.patel@globex.com | Reese Patel | Globex | Bioinformatics | 2–5 | Senior IC | North America | ... | Agree | Agree | Strongly agree | Agree | Agree | Neutral | Agree | No | Yes | Small biotech startup |
| 1 | R-00002 | 2025-10-15 03:34:21 | 2025-10-15 03:48:45 | parker.patel@starkindustries.com | Parker Patel | Stark Industries | Mechanical Engineering | 0–1 | Senior IC | Europe | ... | Agree | Strongly agree | Agree | Strongly disagree | Strongly agree | Agree | Strongly agree | Yes | No | Great overall—UI feels snappy. |
| 2 | R-00003 | 2025-10-03 11:51:26 | 2025-10-03 12:07:36 | drew.turnipshire@acme.com | Drew Turnipshire | Acme Corp | Data Science / Analytics | 2–5 | Manager | Middle East & Africa | ... | Neutral | Disagree | Agree | Agree | Agree | Disagree | Agree | Not sure | No | Small biotech startup |
| 3 | R-00004 | 2025-10-22 15:05:01 | 2025-10-22 15:17:15 | avery.gonzalez@betaanalytics.io | Avery Gonzalez | Beta Analytics | Research (Academic) | 11–15 | Senior IC | Europe | ... | Strongly agree | Neutral | Strongly agree | Disagree | Strongly agree | Strongly agree | Strongly agree | Not sure | Yes | Docs could be deeper on integration examples. |
| 4 | R-00005 | 2025-10-08 03:14:59 | 2025-10-08 03:31:14 | blake.martin@initech.com | Blake Martin | Initech | Software Engineering | 16+ | Senior IC | Latin America | ... | Agree | Strongly agree | Strongly disagree | Neutral | Agree | Agree | Strongly agree | Not sure | Yes | Great overall—UI feels snappy. |
5 rows × 26 columns
| variable_name | question_index | category | question_text_or_description | type | allowed_values_or_codes | notes | example_value | |
|---|---|---|---|---|---|---|---|---|
| 0 | ID | NaN | metadata | Record identifier | identifier (string) | Pattern: R-00001 … | Unique per response. | R-00001 |
| 1 | Start time | NaN | metadata | Survey start timestamp | datetime (YYYY-MM-DD HH:MM:SS) | Format: YYYY-MM-DD HH:MM:SS (24h) | Local timezone when captured; ensure consisten... | 2025-10-11 19:25:56 |
| 2 | Completion time | NaN | metadata | Survey completion timestamp | datetime (YYYY-MM-DD HH:MM:SS) | Format: YYYY-MM-DD HH:MM:SS (24h) | Local timezone when captured; ensure consisten... | 2025-10-11 19:35:17 |
| 3 | NaN | metadata | Respondent email address | email (string) | name.surname@organization-domain | NaN | morgan.turnipshire@vandelay-industries.example | |
| 4 | Name | NaN | metadata | Respondent full name | name (string) | NaN | NaN | Morgan Turnipshire |
Value Counts¶
Inspect responses and data types. Looking for typos/strays, appropriate “Other” responses, singletons, and unexpected responses.
df_raw_c = df_raw.copy()
# cleaning here... if there were some
display(df_raw_c.info())
print("\n\n")
display(df_raw_c["What is your organization?"].value_counts())
print("\n\n")
display(df_raw_c["What is your area of expertise?"].value_counts())
print("\n\n")
display(df_raw_c["What is your current role level?"].value_counts())
print("\n\n")
tool_q = ["Which tools do you use regularly? (Select all that apply)"]
df_raw_c.loc[:, tool_q] = df_raw_c.loc[:, tool_q].replace("; ", ";", regex=True)
display(df_raw_c.loc[:, tool_q].stack().str.split(";").explode().value_counts().to_frame())
print("\n\n")
df_raw_c.to_csv(paths.interim / "0_survey_raw_clean.csv")
<class 'pandas.core.frame.DataFrame'> RangeIndex: 150 entries, 0 to 149 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 150 non-null object 1 Start time 150 non-null datetime64[ns] 2 Completion time 150 non-null datetime64[ns] 3 Email 150 non-null object 4 Name 150 non-null object 5 What is your organization? 150 non-null object 6 What is your area of expertise? 150 non-null object 7 How many years of experience do you have in your area of expertise? 150 non-null object 8 What is your current role level? 150 non-null object 9 What is your geographic region? 150 non-null object 10 Which tools do you use regularly? (Select all that apply) 150 non-null object 11 The documentation is clear and sufficient. 150 non-null object 12 The tool meets my day-to-day needs. 150 non-null object 13 The UI is intuitive and easy to learn. 150 non-null object 14 Customer support is responsive. 150 non-null object 15 The feature set covers my key workflows. 150 non-null object 16 Integration with my existing tools is seamless. 150 non-null object 17 System performance meets my expectations. 150 non-null object 18 I trust the product’s security and privacy. 150 non-null object 19 Training materials helped me get started quickly. 150 non-null object 20 The roadmap aligns with my needs over the next year. 150 non-null object 21 The product remains responsive during peak usage. 150 non-null object 22 Overall, I am satisfied with the product. 150 non-null object 23 Would you recommend this to a colleague? 150 non-null object 24 Consent to contact for follow-up? 150 non-null object 25 Additional comments 127 non-null object dtypes: datetime64[ns](2), object(24) memory usage: 30.6+ KB
None
What is your organization? Globex 18 Fabrikam 16 Other 14 Acme Corp 14 Wayne Enterprises 12 Contoso 11 Vandelay Industries 10 Northwind Traders 10 Blue River Labs 10 Prefer not to say 9 Stark Industries 7 Initech 7 Beta Analytics 6 Umbrella 6 Name: count, dtype: int64
What is your area of expertise? Software Engineering 16 Other 16 Research (Academic) 15 Program / Project Management 15 Bioinformatics 12 UX / Human Factors 12 Mechanical Engineering 12 Product Management 12 Prefer not to say 10 Data Science / Analytics 9 Systems Engineering 8 Electrical Engineering 8 Operations / DevOps 5 Name: count, dtype: int64
What is your current role level? Individual Contributor 23 Senior Manager 22 Senior IC 22 VP 22 Director 21 Manager 20 C-suite 20 Name: count, dtype: int64
| count | |
|---|---|
| RStudio | 42 |
| Docker | 42 |
| Power BI | 42 |
| Git | 41 |
| MATLAB | 40 |
| VS Code | 40 |
| JupyterLab | 40 |
| Tableau | 33 |
| Excel | 33 |
| Other | 32 |
Timing Metadata¶
%matplotlib inline
# Survey Responses per day
daily = df_raw_c["Completion time"]
fig, axs = utils.plot_responses_over_time(daily)
daily.to_csv(paths.processed / "0_survey_responses_daily.csv")
fig.savefig(paths.figures / "survey-meta-responses-per-day.png")
# Average Completion Time
%matplotlib inline
df_t = df_raw_c[["ID", "Start time", "Completion time"]].set_index("ID")
df_t["survey time (min)"] = df_t["Completion time"] - df_t["Start time"]
df_t["survey time (min)"] = df_t["survey time (min)"].dt.total_seconds() / 60.0
utils.plot_completion_time(df_t["survey time (min)"])
df_t.to_csv(paths.processed / "1_survey_time.csv")
fig.savefig(paths.figures / "survey-meta-completion-time.png")
Plot Colors¶
A data dictionary maps question response to colors. Here we plot that mapping for review.
response_plot_colors = utils.get_plot_colors(paths.colors)
utils.visualize_colors(list(response_plot_colors.values()), list(response_plot_colors.keys()))
(<Figure size 400x2950 with 1 Axes>,
<Axes: title={'center': 'Visualizing Colors'}>)
Restructuring and Combine¶
We merge cleaned responses with the data dictionary to produce a single dataset for analysis.
plt.close()
%matplotlib inline
df = utils.survey_ddict_combine(df_raw_c, df_ddict)
df.to_csv(paths.interim / "1_survey_data_cleaned.csv")
display(df.head(), df.describe())
| response | ||||||||
|---|---|---|---|---|---|---|---|---|
| ID | organization | expertise | role | category | question_index | question | question_text_or_description | |
| R-00001 | Globex | Bioinformatics | Senior IC | metadata | NaN | Start time | Survey start timestamp | 2025-10-27 12:20:21 |
| Completion time | Survey completion timestamp | 2025-10-27 12:42:01 | ||||||
| Respondent email address | reese.patel@globex.com | |||||||
| Name | Respondent full name | Reese Patel | ||||||
| demographics | Q1 | What is your organization? | What is your organization? | Globex |
| response | |
|---|---|
| count | 3962 |
| unique | 644 |
| top | Agree |
| freq | 710 |
Next¶
Proceed to Part 2 - Multiple Choice Question Analysis and Bar Charts