Organizing and Combining LaTeX Acronym/Glossary Entries

27 Sep 2023

Author: Austin Pursley

Note: this post was generated with JupyterLab. A better looking version is here. GitHub here.

Introduction

Problem:

You use LaTeX for developing your documents and you use the glossaries package for defining acronym and glossary entries. You organize the entries into .tex files for each projects, e.g., “Acronyms.tex”, “Glossary.tex”. However, you’ve ended up with multiple versions of these .tex files from multiple projects and now you need ALL the unique entries in one file for a new project. How do you go about doing that? You could manually do this, but that could get tedious if they’re are a lot of differences between files. Instead, you could use Python to automate the task. In addition, the entries can be organized along the way.

Set-up

Going to use acronyms in this project, but they could also be glossary entries as the glossaries package handles both nearly the same.

First I’ll set-up some example Acronym.tex files. Note how they are unsorted, something we can improve on later.

acro1 = """
\\newabbreviation{abba}{ABBA}{Björn & Benny, Agnetha & Frida}
\\newabbreviation{unsc}{UNSC}{United Nations Space Command}
\\newabbreviation{odst}{ODST}{Orbital Drop Shock Trooper}
\\newabbreviation{fish}{FISH}{F' It, Stuff Happens}
\\newabbreviation{gps}{GPS}{Go Pound Sound}
\\newabbreviation{evil}{EVIL}{Every Villian is Lemon}
\\newabbreviation{otr}{OTR}{over the rainbow}
\\newabbreviation{sc}{SC}{Snack Club}
\\newabbreviation{mo}{MO}{modus operandi}
\\newabbreviation{ul}{UL}{ultralight}
\\newabbreviation{blt}{BLT}{bacon lettuce tomato}
"""

acro2 = """
\\newabbreviation{abba}{ABBA}{Björn & Benny, Agnetha & Frida}
\\newabbreviation{otr}{OTR}{Optimal Test Ruminant}
\\newabbreviation{pre}{PRE}{Prototype Ruminant Evaluation}
\\newabbreviation[longplural="Ruminants Under Test"]{rut}{RUT}{Ruminant Under Test}
\\newabbreviation{EVIL}{EVIL}{Every Villian is Lemon}
\\newabbreviation{irbh}{IRBH}{I'd Rather Be Hiking!}
\\newabbreviation{hh}{HH}{hobbit head}
\\newabbreviation{gps}{GPS}{Go Pound Sound}
\\newabbreviation{crud}{CRUD}{create, read, update, and delete}
"""

with open('Acronyms1.tex','w') as f:
    f.write(acro1)

with open('Acronyms2.tex','w') as f:
    f.write(acro2)

# Pretending we didn't just create these files.
files = ['Acronyms1.tex','Acronyms2.tex']

Match Pattern

We need a pattern that captures the two different cases of acronym entries for the glossaries package. In one instance is standard/normal, where there are no optional parameters set:

\newabbreviation{abba}{ABBA}{Björn & Benny, Agnetha & Frida}

The other has optional parameters:

\newabbreviation[longplural=”Ruminant Under Test”]{rut}{RUT}{Ruminant Under Test}

In the pattern defined below, the optional portion is covered by ([.*?])?

The other three parameters are covered by the three {(.*?)}

import regex as re
pattern = re.compile(r'\\newabbreviation(\[.*?\])?{(.*?)}{(.*?)}{(.*?)}\n')

Run pattern against file content

matches_all = set()
for fn in files:
    f = open(fn)
    f_str = f.read()
    matches_f = re.findall(pattern, f_str)
    matches_all = matches_all.union(set(matches_f))
    f.close()
matches_all
{('', 'EVIL', 'EVIL', 'Every Villian is Lemon'),
 ('', 'abba', 'ABBA', 'Björn & Benny, Agnetha & Frida'),
 ('', 'blt', 'BLT', 'bacon lettuce tomato'),
 ('', 'crud', 'CRUD', 'create, read, update, and delete'),
 ('', 'evil', 'EVIL', 'Every Villian is Lemon'),
 ('', 'fish', 'FISH', "F' It, Stuff Happens"),
 ('', 'gps', 'GPS', 'Go Pound Sound'),
 ('', 'hh', 'HH', 'hobbit head'),
 ('', 'irbh', 'IRBH', "I'd Rather Be Hiking!"),
 ('', 'mo', 'MO', 'modus operandi'),
 ('', 'odst', 'ODST', 'Orbital Drop Shock Trooper'),
 ('', 'otr', 'OTR', 'Optimal Test Ruminant'),
 ('', 'otr', 'OTR', 'over the rainbow'),
 ('', 'pre', 'PRE', 'Prototype Ruminant Evaluation'),
 ('', 'sc', 'SC', 'Snack Club'),
 ('', 'ul', 'UL', 'ultralight'),
 ('', 'unsc', 'UNSC', 'United Nations Space Command'),
 ('[longplural="Ruminants Under Test"]', 'rut', 'RUT', 'Ruminant Under Test')}

Dataframize

Because I like pandas

import pandas as pd
df = pd.DataFrame(matches_all,columns=['optional','acronym id','short','long'])
df
optional acronym id short long
0 evil EVIL Every Villian is Lemon
1 hh HH hobbit head
2 EVIL EVIL Every Villian is Lemon
3 mo MO modus operandi
4 pre PRE Prototype Ruminant Evaluation
5 ul UL ultralight
6 otr OTR over the rainbow
7 [longplural="Ruminants Under Test"] rut RUT Ruminant Under Test
8 otr OTR Optimal Test Ruminant
9 crud CRUD create, read, update, and delete
10 irbh IRBH I'd Rather Be Hiking!
11 sc SC Snack Club
12 blt BLT bacon lettuce tomato
13 unsc UNSC United Nations Space Command
14 fish FISH F' It, Stuff Happens
15 abba ABBA Björn & Benny, Agnetha & Frida
16 gps GPS Go Pound Sound
17 odst ODST Orbital Drop Shock Trooper

Duplicate Handling

When you combine acronym entries from different documents, you’ll probably find at some point that some have the same ID or the same long form. Below I identify these and set up a flag for when we generate the final .tex file. Flagging them makes it easy to manually correct the file once its generates, which I’ve found was better than trying to automate a correction (e.g., adding “2” to end of duplicate entry ID).

df.loc[df['acronym id'].duplicated(keep=False),'duplicate flag'] = ' %%%%% DUPLICATE'
df.loc[df['long'].duplicated(keep=False),'duplicate flag'] = ' %%%%% DUPLICATE'
df
optional acronym id short long duplicate flag
0 evil EVIL Every Villian is Lemon %%%%% DUPLICATE
1 hh HH hobbit head NaN
2 EVIL EVIL Every Villian is Lemon %%%%% DUPLICATE
3 mo MO modus operandi NaN
4 pre PRE Prototype Ruminant Evaluation NaN
5 ul UL ultralight NaN
6 otr OTR over the rainbow %%%%% DUPLICATE
7 [longplural="Ruminants Under Test"] rut RUT Ruminant Under Test NaN
8 otr OTR Optimal Test Ruminant %%%%% DUPLICATE
9 crud CRUD create, read, update, and delete NaN
10 irbh IRBH I'd Rather Be Hiking! NaN
11 sc SC Snack Club NaN
12 blt BLT bacon lettuce tomato NaN
13 unsc UNSC United Nations Space Command NaN
14 fish FISH F' It, Stuff Happens NaN
15 abba ABBA Björn & Benny, Agnetha & Frida NaN
16 gps GPS Go Pound Sound NaN
17 odst ODST Orbital Drop Shock Trooper NaN

Create Organized Content, Write It

I take the dataframe and use it as a base for building a string which will be the contents of the final Acronyms.tex file.

I can organize the entries while I’m at it. The first letter of the entry ID is used to to alphabetize the entries. A large comment is written to clearly indicate in the file the letter groupings.

df['letter'] = df['acronym id'].str[0].str.upper()
letters = list(set(df['letter']))
letters.sort()

# for alphabetizing
df = df.sort_values(by=['letter','acronym id', 'long'])
df['entry'] = '\\newabbreviation' + df['optional'] + '{' + df['acronym id'] + '}{' + \
                df['short'] + '}{' + df['long'] + '}' + df['duplicate flag'].fillna('')

# top of the file
acronym_txt = """%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    % ACRONYMS
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
"""

# function for creating file section by letter
def to_latex_str_abc(x):
    l = list(x['letter'])[0]
    # print(l)
    abc_comment = """
%========================================================================================
%    """ + l + """
%========================================================================================
"""
    df_l = df.loc[df['letter'] == l]['entry']
    entry_txt = ''
    for entry in df_l:
        entry_txt += '\t' + entry + '\n'
    return abc_comment + entry_txt
df_by_l = list(df.groupby('letter').apply(to_latex_str_abc))
acronym_txt += ''.join(df_by_l)

# Review the result
print(acronym_txt)

# write content
with open('Acronyms.tex', 'w') as f:
    f.write(acronym_txt)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    % ACRONYMS
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%========================================================================================
%    A
%========================================================================================
	\newabbreviation{abba}{ABBA}{Björn & Benny, Agnetha & Frida}

%========================================================================================
%    B
%========================================================================================
	\newabbreviation{blt}{BLT}{bacon lettuce tomato}

%========================================================================================
%    C
%========================================================================================
	\newabbreviation{crud}{CRUD}{create, read, update, and delete}

%========================================================================================
%    E
%========================================================================================
	\newabbreviation{EVIL}{EVIL}{Every Villian is Lemon} %%%%% DUPLICATE
	\newabbreviation{evil}{EVIL}{Every Villian is Lemon} %%%%% DUPLICATE

%========================================================================================
%    F
%========================================================================================
	\newabbreviation{fish}{FISH}{F' It, Stuff Happens}

%========================================================================================
%    G
%========================================================================================
	\newabbreviation{gps}{GPS}{Go Pound Sound}

%========================================================================================
%    H
%========================================================================================
	\newabbreviation{hh}{HH}{hobbit head}

%========================================================================================
%    I
%========================================================================================
	\newabbreviation{irbh}{IRBH}{I'd Rather Be Hiking!}

%========================================================================================
%    M
%========================================================================================
	\newabbreviation{mo}{MO}{modus operandi}

%========================================================================================
%    O
%========================================================================================
	\newabbreviation{odst}{ODST}{Orbital Drop Shock Trooper}
	\newabbreviation{otr}{OTR}{Optimal Test Ruminant} %%%%% DUPLICATE
	\newabbreviation{otr}{OTR}{over the rainbow} %%%%% DUPLICATE

%========================================================================================
%    P
%========================================================================================
	\newabbreviation{pre}{PRE}{Prototype Ruminant Evaluation}

%========================================================================================
%    R
%========================================================================================
	\newabbreviation[longplural="Ruminants Under Test"]{rut}{RUT}{Ruminant Under Test}

%========================================================================================
%    S
%========================================================================================
	\newabbreviation{sc}{SC}{Snack Club}

%========================================================================================
%    U
%========================================================================================
	\newabbreviation{ul}{UL}{ultralight}
	\newabbreviation{unsc}{UNSC}{United Nations Space Command}

Conclusions

I can take this Acronyms.tex file and plop it into my Overleaf+LaTeX project and optimize it from there. This script especially becomes handy when you want to combine several different large (200+ entry) acronyms lists floating around.

This little project also highlights one of the benefits of building LaTeX documents, which is how you can automate the manipulation of plain text inputs.