Problem: you use LaTeX for developing your document and you use the glossaries package for defining acronym and glossary entries. You organize the entries into .tex files for each projects, e.g., "Acronyms.tex", "Glossary.tex". However, you've ended up with multiple versions of these .tex files from multiple projects and now you need ALL the unique acronyms in one file for a new project. How do you go about doing that? You could manually do this, but that could get tedious if they're are a lot of differences between files. Instead, you could use Python to automate the task. In addition, the entries can be organized along the way.
Going to use acronyms in this project, but they could also be glossary entries as the glossaries package handles both nearly the same.
First I'll set-up some example Acronym.tex files. Note how they are unsorted, something we can improve on later.
acro1 = """
\\newabbreviation{abba}{ABBA}{Björn & Benny, Agnetha & Frida}
\\newabbreviation{unsc}{UNSC}{United Nations Space Command}
\\newabbreviation{odst}{ODST}{Orbital Drop Shock Trooper}
\\newabbreviation{fish}{FISH}{F' It, Stuff Happens}
\\newabbreviation{gps}{GPS}{Go Pound Sound}
\\newabbreviation{evil}{EVIL}{Every Villian is Lemon}
\\newabbreviation{otr}{OTR}{over the rainbow}
\\newabbreviation{sc}{SC}{Snack Club}
\\newabbreviation{mo}{MO}{modus operandi}
\\newabbreviation{ul}{UL}{ultralight}
\\newabbreviation{blt}{BLT}{bacon lettuce tomato}
"""
acro2 = """
\\newabbreviation{abba}{ABBA}{Björn & Benny, Agnetha & Frida}
\\newabbreviation{otr}{OTR}{Optimal Test Ruminant}
\\newabbreviation{pre}{PRE}{Prototype Ruminant Evaluation}
\\newabbreviation[longplural="Ruminants Under Test"]{rut}{RUT}{Ruminant Under Test}
\\newabbreviation{EVIL}{EVIL}{Every Villian is Lemon}
\\newabbreviation{irbh}{IRBH}{I'd Rather Be Hiking!}
\\newabbreviation{hh}{HH}{hobbit head}
\\newabbreviation{gps}{GPS}{Go Pound Sound}
\\newabbreviation{crud}{CRUD}{create, read, update, and delete}
"""
with open('Acronyms1.tex','w') as f:
f.write(acro1)
with open('Acronyms2.tex','w') as f:
f.write(acro2)
# Pretending we didn't just create these files.
files = ['Acronyms1.tex','Acronyms2.tex']
We need a pattern that captures the two different cases of acronym entries for the glossaries package. In one instance is standard/normal, where there are no optional parameters set:
\newabbreviation{abba}{ABBA}{Björn & Benny, Agnetha & Frida}
The other has optional parameters:
\newabbreviation[longplural="Ruminant Under Test"]{rut}{RUT}{Ruminant Under Test}
In the pattern defined below, the optional portion is covered by ([.*?])?
The other three parameters are covered by the three {(.*?)}
import regex as re
pattern = re.compile(r'\\newabbreviation(\[.*?\])?{(.*?)}{(.*?)}{(.*?)}\n')
matches_all = set()
for fn in files:
f = open(fn)
f_str = f.read()
matches_f = re.findall(pattern, f_str)
matches_all = matches_all.union(set(matches_f))
f.close()
matches_all
{('', 'EVIL', 'EVIL', 'Every Villian is Lemon'), ('', 'abba', 'ABBA', 'Björn & Benny, Agnetha & Frida'), ('', 'blt', 'BLT', 'bacon lettuce tomato'), ('', 'crud', 'CRUD', 'create, read, update, and delete'), ('', 'evil', 'EVIL', 'Every Villian is Lemon'), ('', 'fish', 'FISH', "F' It, Stuff Happens"), ('', 'gps', 'GPS', 'Go Pound Sound'), ('', 'hh', 'HH', 'hobbit head'), ('', 'irbh', 'IRBH', "I'd Rather Be Hiking!"), ('', 'mo', 'MO', 'modus operandi'), ('', 'odst', 'ODST', 'Orbital Drop Shock Trooper'), ('', 'otr', 'OTR', 'Optimal Test Ruminant'), ('', 'otr', 'OTR', 'over the rainbow'), ('', 'pre', 'PRE', 'Prototype Ruminant Evaluation'), ('', 'sc', 'SC', 'Snack Club'), ('', 'ul', 'UL', 'ultralight'), ('', 'unsc', 'UNSC', 'United Nations Space Command'), ('[longplural="Ruminants Under Test"]', 'rut', 'RUT', 'Ruminant Under Test')}
Because I like pandas
import pandas as pd
df = pd.DataFrame(matches_all,columns=['optional','acronym id','short','long'])
df
optional | acronym id | short | long | |
---|---|---|---|---|
0 | evil | EVIL | Every Villian is Lemon | |
1 | hh | HH | hobbit head | |
2 | EVIL | EVIL | Every Villian is Lemon | |
3 | mo | MO | modus operandi | |
4 | pre | PRE | Prototype Ruminant Evaluation | |
5 | ul | UL | ultralight | |
6 | otr | OTR | over the rainbow | |
7 | [longplural="Ruminants Under Test"] | rut | RUT | Ruminant Under Test |
8 | otr | OTR | Optimal Test Ruminant | |
9 | crud | CRUD | create, read, update, and delete | |
10 | irbh | IRBH | I'd Rather Be Hiking! | |
11 | sc | SC | Snack Club | |
12 | blt | BLT | bacon lettuce tomato | |
13 | unsc | UNSC | United Nations Space Command | |
14 | fish | FISH | F' It, Stuff Happens | |
15 | abba | ABBA | Björn & Benny, Agnetha & Frida | |
16 | gps | GPS | Go Pound Sound | |
17 | odst | ODST | Orbital Drop Shock Trooper |
When you combine acronym entries from different documents, you'll probably find at some point that some have the same ID or the same long form. Below I identify these and set up a flag for when we generate the final .tex file. Flagging them makes it easy to manually correct the file once its generates, which I've found was better than trying to automate a correction (e.g., adding "2" to end of duplicate entry ID).
df.loc[df['acronym id'].duplicated(keep=False),'duplicate flag'] = ' %%%%% DUPLICATE'
df.loc[df['long'].duplicated(keep=False),'duplicate flag'] = ' %%%%% DUPLICATE'
df
optional | acronym id | short | long | duplicate flag | |
---|---|---|---|---|---|
0 | evil | EVIL | Every Villian is Lemon | %%%%% DUPLICATE | |
1 | hh | HH | hobbit head | NaN | |
2 | EVIL | EVIL | Every Villian is Lemon | %%%%% DUPLICATE | |
3 | mo | MO | modus operandi | NaN | |
4 | pre | PRE | Prototype Ruminant Evaluation | NaN | |
5 | ul | UL | ultralight | NaN | |
6 | otr | OTR | over the rainbow | %%%%% DUPLICATE | |
7 | [longplural="Ruminants Under Test"] | rut | RUT | Ruminant Under Test | NaN |
8 | otr | OTR | Optimal Test Ruminant | %%%%% DUPLICATE | |
9 | crud | CRUD | create, read, update, and delete | NaN | |
10 | irbh | IRBH | I'd Rather Be Hiking! | NaN | |
11 | sc | SC | Snack Club | NaN | |
12 | blt | BLT | bacon lettuce tomato | NaN | |
13 | unsc | UNSC | United Nations Space Command | NaN | |
14 | fish | FISH | F' It, Stuff Happens | NaN | |
15 | abba | ABBA | Björn & Benny, Agnetha & Frida | NaN | |
16 | gps | GPS | Go Pound Sound | NaN | |
17 | odst | ODST | Orbital Drop Shock Trooper | NaN |
I take the dataframe and use it as a base for building a string which will be the contents of the final Acronyms.tex file.
I can organize the entries while I'm at it. The first letter of the entry ID is used to to alphabetize the entries. A large comment is written to clearly indicate in the file the letter groupings.
df['letter'] = df['acronym id'].str[0].str.upper()
letters = list(set(df['letter']))
letters.sort()
# for alphabetizing
df = df.sort_values(by=['letter','acronym id', 'long'])
df['entry'] = '\\newabbreviation' + df['optional'] + '{' + df['acronym id'] + '}{' + \
df['short'] + '}{' + df['long'] + '}' + df['duplicate flag'].fillna('')
# top of the file
acronym_txt = """%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% ACRONYMS
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
"""
# function for creating file section by letter
def to_latex_str_abc(x):
l = list(x['letter'])[0]
# print(l)
abc_comment = """
%========================================================================================
% """ + l + """
%========================================================================================
"""
df_l = df.loc[df['letter'] == l]['entry']
entry_txt = ''
for entry in df_l:
entry_txt += '\t' + entry + '\n'
return abc_comment + entry_txt
df_by_l = list(df.groupby('letter').apply(to_latex_str_abc))
acronym_txt += ''.join(df_by_l)
# Review the result
print(acronym_txt)
# write content
with open('Acronyms.tex', 'w') as f:
f.write(acronym_txt)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % ACRONYMS %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %======================================================================================== % A %======================================================================================== \newabbreviation{abba}{ABBA}{Björn & Benny, Agnetha & Frida} %======================================================================================== % B %======================================================================================== \newabbreviation{blt}{BLT}{bacon lettuce tomato} %======================================================================================== % C %======================================================================================== \newabbreviation{crud}{CRUD}{create, read, update, and delete} %======================================================================================== % E %======================================================================================== \newabbreviation{EVIL}{EVIL}{Every Villian is Lemon} %%%%% DUPLICATE \newabbreviation{evil}{EVIL}{Every Villian is Lemon} %%%%% DUPLICATE %======================================================================================== % F %======================================================================================== \newabbreviation{fish}{FISH}{F' It, Stuff Happens} %======================================================================================== % G %======================================================================================== \newabbreviation{gps}{GPS}{Go Pound Sound} %======================================================================================== % H %======================================================================================== \newabbreviation{hh}{HH}{hobbit head} %======================================================================================== % I %======================================================================================== \newabbreviation{irbh}{IRBH}{I'd Rather Be Hiking!} %======================================================================================== % M %======================================================================================== \newabbreviation{mo}{MO}{modus operandi} %======================================================================================== % O %======================================================================================== \newabbreviation{odst}{ODST}{Orbital Drop Shock Trooper} \newabbreviation{otr}{OTR}{Optimal Test Ruminant} %%%%% DUPLICATE \newabbreviation{otr}{OTR}{over the rainbow} %%%%% DUPLICATE %======================================================================================== % P %======================================================================================== \newabbreviation{pre}{PRE}{Prototype Ruminant Evaluation} %======================================================================================== % R %======================================================================================== \newabbreviation[longplural="Ruminants Under Test"]{rut}{RUT}{Ruminant Under Test} %======================================================================================== % S %======================================================================================== \newabbreviation{sc}{SC}{Snack Club} %======================================================================================== % U %======================================================================================== \newabbreviation{ul}{UL}{ultralight} \newabbreviation{unsc}{UNSC}{United Nations Space Command}
I can take this Acronyms.tex file and plop it into my Overleaf+LaTeX file and optimize it from there. This script especially becomes handy when you want to combine several different large (200+ entry) acronyms lists floating around.
This little project also highlights one of the benefits of building LaTeX documents, which is how you can automate the manipulation of plain text inputs.