Organizing and Combining LaTeX Acronym/Glossary Entries

27 Sep 2023

Author: Austin Pursley

Note: this post was generated with JupyterLab. A better looking version is here. GitHub here.

Introduction

Problem:

You use LaTeX for developing your documents and you use the glossaries package for defining acronym and glossary entries. You organize the entries into .tex files for each projects, e.g., “Acronyms.tex”, “Glossary.tex”. However, you’ve ended up with multiple versions of these .tex files from multiple projects and now you need ALL the unique entries in one file for a new project. How do you go about doing that? You could manually do this, but that could get tedious if they’re are a lot of differences between files. Instead, you could use Python to automate the task. In addition, the entries can be organized along the way.

Set-up

Going to use acronyms in this project, but they could also be glossary entries as the glossaries package handles both nearly the same.

First I’ll set-up some example Acronym.tex files. Note how they are unsorted, something we can improve on later.

acro1 = """
\\newabbreviation{abba}{ABBA}{Björn & Benny, Agnetha & Frida}
\\newabbreviation{unsc}{UNSC}{United Nations Space Command}
\\newabbreviation{odst}{ODST}{Orbital Drop Shock Trooper}
\\newabbreviation{fish}{FISH}{F' It, Stuff Happens}
\\newabbreviation{gps}{GPS}{Go Pound Sound}
\\newabbreviation{evil}{EVIL}{Every Villian is Lemon}
\\newabbreviation{otr}{OTR}{over the rainbow}
\\newabbreviation{sc}{SC}{Snack Club}
\\newabbreviation{mo}{MO}{modus operandi}
\\newabbreviation{ul}{UL}{ultralight}
\\newabbreviation{blt}{BLT}{bacon lettuce tomato}
"""

acro2 = """
\\newabbreviation{abba}{ABBA}{Björn & Benny, Agnetha & Frida}
\\newabbreviation{otr}{OTR}{Optimal Test Ruminant}
\\newabbreviation{pre}{PRE}{Prototype Ruminant Evaluation}
\\newabbreviation[longplural="Ruminants Under Test"]{rut}{RUT}{Ruminant Under Test}
\\newabbreviation{EVIL}{EVIL}{Every Villian is Lemon}
\\newabbreviation{irbh}{IRBH}{I'd Rather Be Hiking!}
\\newabbreviation{hh}{HH}{hobbit head}
\\newabbreviation{gps}{GPS}{Go Pound Sound}
\\newabbreviation{crud}{CRUD}{create, read, update, and delete}
"""

with open('Acronyms1.tex','w') as f:
    f.write(acro1)

with open('Acronyms2.tex','w') as f:
    f.write(acro2)

# Pretending we didn't just create these files.
files = ['Acronyms1.tex','Acronyms2.tex']

Match Pattern

We need a pattern that captures the two different cases of acronym entries for the glossaries package. In one instance is standard/normal, where there are no optional parameters set:

\newabbreviation{abba}{ABBA}{Björn & Benny, Agnetha & Frida}

The other has optional parameters:

\newabbreviation[longplural=”Ruminant Under Test”]{rut}{RUT}{Ruminant Under Test}

In the pattern defined below, the optional portion is covered by ([.*?])?

The other three parameters are covered by the three {(.*?)}

import regex as re
pattern = re.compile(r'\\newabbreviation(\[.*?\])?{(.*?)}{(.*?)}{(.*?)}\n')

Run pattern against file content

matches_all = set()
for fn in files:
    f = open(fn)
    f_str = f.read()
    matches_f = re.findall(pattern, f_str)
    matches_all = matches_all.union(set(matches_f))
    f.close()
matches_all

{('', 'EVIL', 'EVIL', 'Every Villian is Lemon'),
 ('', 'abba', 'ABBA', 'Björn & Benny, Agnetha & Frida'),
 ('', 'blt', 'BLT', 'bacon lettuce tomato'),
 ('', 'crud', 'CRUD', 'create, read, update, and delete'),
 ('', 'evil', 'EVIL', 'Every Villian is Lemon'),
 ('', 'fish', 'FISH', "F' It, Stuff Happens"),
 ('', 'gps', 'GPS', 'Go Pound Sound'),
 ('', 'hh', 'HH', 'hobbit head'),
 ('', 'irbh', 'IRBH', "I'd Rather Be Hiking!"),
 ('', 'mo', 'MO', 'modus operandi'),
 ('', 'odst', 'ODST', 'Orbital Drop Shock Trooper'),
 ('', 'otr', 'OTR', 'Optimal Test Ruminant'),
 ('', 'otr', 'OTR', 'over the rainbow'),
 ('', 'pre', 'PRE', 'Prototype Ruminant Evaluation'),
 ('', 'sc', 'SC', 'Snack Club'),
 ('', 'ul', 'UL', 'ultralight'),
 ('', 'unsc', 'UNSC', 'United Nations Space Command'),
 ('[longplural="Ruminants Under Test"]', 'rut', 'RUT', 'Ruminant Under Test')}

Dataframize

Because I like pandas

import pandas as pd
df = pd.DataFrame(matches_all,columns=['optional','acronym id','short','long'])
df

	optional	acronym id	short	long
0		evil	EVIL	Every Villian is Lemon
1		hh	HH	hobbit head
2		EVIL	EVIL	Every Villian is Lemon
3		mo	MO	modus operandi
4		pre	PRE	Prototype Ruminant Evaluation
5		ul	UL	ultralight
6		otr	OTR	over the rainbow
7	[longplural="Ruminants Under Test"]	rut	RUT	Ruminant Under Test
8		otr	OTR	Optimal Test Ruminant
9		crud	CRUD	create, read, update, and delete
10		irbh	IRBH	I'd Rather Be Hiking!
11		sc	SC	Snack Club
12		blt	BLT	bacon lettuce tomato
13		unsc	UNSC	United Nations Space Command
14		fish	FISH	F' It, Stuff Happens
15		abba	ABBA	Björn & Benny, Agnetha & Frida
16		gps	GPS	Go Pound Sound
17		odst	ODST	Orbital Drop Shock Trooper

Duplicate Handling

When you combine acronym entries from different documents, you’ll probably find at some point that some have the same ID or the same long form. Below I identify these and set up a flag for when we generate the final .tex file. Flagging them makes it easy to manually correct the file once its generates, which I’ve found was better than trying to automate a correction (e.g., adding “2” to end of duplicate entry ID).

df.loc[df['acronym id'].duplicated(keep=False),'duplicate flag'] = ' %%%%% DUPLICATE'
df.loc[df['long'].duplicated(keep=False),'duplicate flag'] = ' %%%%% DUPLICATE'
df

	optional	acronym id	short	long	duplicate flag
0		evil	EVIL	Every Villian is Lemon	%%%%% DUPLICATE
1		hh	HH	hobbit head	NaN
2		EVIL	EVIL	Every Villian is Lemon	%%%%% DUPLICATE
3		mo	MO	modus operandi	NaN
4		pre	PRE	Prototype Ruminant Evaluation	NaN
5		ul	UL	ultralight	NaN
6		otr	OTR	over the rainbow	%%%%% DUPLICATE
7	[longplural="Ruminants Under Test"]	rut	RUT	Ruminant Under Test	NaN
8		otr	OTR	Optimal Test Ruminant	%%%%% DUPLICATE
9		crud	CRUD	create, read, update, and delete	NaN
10		irbh	IRBH	I'd Rather Be Hiking!	NaN
11		sc	SC	Snack Club	NaN
12		blt	BLT	bacon lettuce tomato	NaN
13		unsc	UNSC	United Nations Space Command	NaN
14		fish	FISH	F' It, Stuff Happens	NaN
15		abba	ABBA	Björn & Benny, Agnetha & Frida	NaN
16		gps	GPS	Go Pound Sound	NaN
17		odst	ODST	Orbital Drop Shock Trooper	NaN

Create Organized Content, Write It

I take the dataframe and use it as a base for building a string which will be the contents of the final Acronyms.tex file.

I can organize the entries while I’m at it. The first letter of the entry ID is used to to alphabetize the entries. A large comment is written to clearly indicate in the file the letter groupings.

df['letter'] = df['acronym id'].str[0].str.upper()
letters = list(set(df['letter']))
letters.sort()

# for alphabetizing
df = df.sort_values(by=['letter','acronym id', 'long'])
df['entry'] = '\\newabbreviation' + df['optional'] + '{' + df['acronym id'] + '}{' + \
                df['short'] + '}{' + df['long'] + '}' + df['duplicate flag'].fillna('')

# top of the file
acronym_txt = """%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    % ACRONYMS
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
"""

# function for creating file section by letter
def to_latex_str_abc(x):
    l = list(x['letter'])[0]
    # print(l)
    abc_comment = """
%========================================================================================
%    """ + l + """
%========================================================================================
"""
    df_l = df.loc[df['letter'] == l]['entry']
    entry_txt = ''
    for entry in df_l:
        entry_txt += '\t' + entry + '\n'
    return abc_comment + entry_txt
df_by_l = list(df.groupby('letter').apply(to_latex_str_abc))
acronym_txt += ''.join(df_by_l)

# Review the result
print(acronym_txt)

# write content
with open('Acronyms.tex', 'w') as f:
    f.write(acronym_txt)

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    % ACRONYMS
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%========================================================================================
%    A
%========================================================================================
	\newabbreviation{abba}{ABBA}{Björn & Benny, Agnetha & Frida}

%========================================================================================
%    B
%========================================================================================
	\newabbreviation{blt}{BLT}{bacon lettuce tomato}

%========================================================================================
%    C
%========================================================================================
	\newabbreviation{crud}{CRUD}{create, read, update, and delete}

%========================================================================================
%    E
%========================================================================================
	\newabbreviation{EVIL}{EVIL}{Every Villian is Lemon} %%%%% DUPLICATE
	\newabbreviation{evil}{EVIL}{Every Villian is Lemon} %%%%% DUPLICATE

%========================================================================================
%    F
%========================================================================================
	\newabbreviation{fish}{FISH}{F' It, Stuff Happens}

%========================================================================================
%    G
%========================================================================================
	\newabbreviation{gps}{GPS}{Go Pound Sound}

%========================================================================================
%    H
%========================================================================================
	\newabbreviation{hh}{HH}{hobbit head}

%========================================================================================
%    I
%========================================================================================
	\newabbreviation{irbh}{IRBH}{I'd Rather Be Hiking!}

%========================================================================================
%    M
%========================================================================================
	\newabbreviation{mo}{MO}{modus operandi}

%========================================================================================
%    O
%========================================================================================
	\newabbreviation{odst}{ODST}{Orbital Drop Shock Trooper}
	\newabbreviation{otr}{OTR}{Optimal Test Ruminant} %%%%% DUPLICATE
	\newabbreviation{otr}{OTR}{over the rainbow} %%%%% DUPLICATE

%========================================================================================
%    P
%========================================================================================
	\newabbreviation{pre}{PRE}{Prototype Ruminant Evaluation}

%========================================================================================
%    R
%========================================================================================
	\newabbreviation[longplural="Ruminants Under Test"]{rut}{RUT}{Ruminant Under Test}

%========================================================================================
%    S
%========================================================================================
	\newabbreviation{sc}{SC}{Snack Club}

%========================================================================================
%    U
%========================================================================================
	\newabbreviation{ul}{UL}{ultralight}
	\newabbreviation{unsc}{UNSC}{United Nations Space Command}

Conclusions

I can take this Acronyms.tex file and plop it into my Overleaf+LaTeX project and optimize it from there. This script especially becomes handy when you want to combine several different large (200+ entry) acronyms lists floating around.

This little project also highlights one of the benefits of building LaTeX documents, which is how you can automate the manipulation of plain text inputs.