[Guest Post] “Do the magic: Regular Expressions in FrameMaker”, by Marek Pawelec

When working with text, whether authoring or editing, we often need to change something. Name of some feature from A to B. Spelling error. Numerical value. A phrase that does not fit the rest of text. And usually, we just delete the wrong text and type in the correct one. If a spelling error or incorrect spelling variant is repeated across the document, we may use Find/Change to correct all instances at once.

Regular Expressions with FrameMaker 2015

Things get a bit more difficult when we want to correct more than a single word at a time – but we can use wildcards. There are two commonly used wildcard symbols, and as you probably know, “*” stands for “any string” and “?” means “single character”. So searching for “anaestheti*” will return anaesthetic, anaesthetics, anaesthetised, anaesthetist etc., while searching for “anaestheti?” will return only “anaesthetic” (if “Whole Word” option is checked).

Wildcard-based search may be sufficient in a number of situations, where all you need is to find some text and change its attributes, for example character style. It is especially true in FrameMaker since FM wildcards support some features copied from regular expressions. However, if we need to conduct more advanced searches, conditional searches or perform some kind of operations on the text we find, we should reach for full power of regular expressions (“regex”).

In This Article

Meet regular expressions

What are they? Regular expressions are described as “a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. “find and replace”-like operations” [Wikipedia]. Somewhat like in the case of wildcards, there is a “vocabulary” – set of symbols that can be used to find a string of characters matching our expression, only the symbol set is much bigger – and a special kind of “grammar” which enables the creation of long and complex search patterns. Plus once we find something, we can usually modify it.

To give you some examples:

So, how do we actually accomplish all these things? You need to learn basic vocabulary and grammar of regular expressions. And you can do this either by reading the introduction below, or just skip to the reference table at the end and rifle through examples. All regular expressions in the text are written with \s{2,} font, and quotation marks around them (if present) are not part of the expressions. And remember – while the expressions may look scary, they are, in fact, quite regular. Once you grasp the meaning of symbols and basic rules, you’ll be able to use them at no time.

Alphabet for Beginners

If we want to find any particular symbol/letter/number using regular expression, we just type that symbol/letter/number. So anaesthetic is a regular expression which matches only that particular sequence of lower-case letters (regular expressions are case-sensitive).

If we need to find any character or symbol, we use period (.). Single dot means single symbol of any kind.

To get somewhat more specific matches we can define character class by using square brackets [ ]. For example [abcd] will match “a”, “b”, “c” or “d”. If you are looking for a symbol within continuous range, instead of defining all elements of class, we can simply state first and last: [a-d], or [0-9] to match any digit. The order in ranges is always the same as in ASCII character table, so to match any letter regardless of case we can use class [A-Za-z].

There is also a special type of class used for negation: if the first symbol within square brackets is the circumflex (^), expression will match anything except the class content. For example [^abcd] will match any single symbol except for a, b, c or d. Class can be seen as logical alternative for set with single symbol elements.

If we need to employ alternatives for longer elements, we can use pipe character “|” to separate elements: Monday|Tuesday|Wednesday|Thursday will match any of these four weekday names. Unfortunately there is no simple way to exclude longer strings for search – matching anything but some string is usually bit tricky.

Do we have to define these classes every time? That depends on the class. There are some shortcuts defined in regular expression vocabulary:

Of course this is just the most basic set. Instead of shortcuts you can also use Unicode categories, which are much more versatile with regard to various scripts and non-ASCII characters – see the reference tables at the end of this post.

Backslash by itself is a special escape character. You know that a period means any symbol, but what do we do if we want to match only period? Simply escape it with backslash: \. . But since backslash is special character too, if we need to match backslash, we have to escape it too: \\

Hang on – anchors

There are two special symbols used to define start (^) and end ($) of the line, plus shortcut defining word boundary (\b) they are called “anchors”. To recall our earlier example, anaesthetic will match that string anywhere in text, but ^anaesthetic will match only, if the string will appear at the beginning of line (paragraph). By itself aesthetic will also match “anaesthetic” and “anaesthetics”, aesthetic\b will match “anaesthetic” but not “anaesthetics” and \baesthetic\b will match neither. It’s a bit like searching with “Whole Word” option, but with more precise control.

Grammar rule[sz]

This was essential “vocabulary”, now we’ll see some “grammar”. Let’s start by introducing operators (quantifiers), since quite often we are interested in some longer strings, not just particular symbols. Operators works for the object immediately to their left. There are three basic operators:

We can use these quantifiers for example to find both spelling versions of an[a]esthetic with a single expression. ana?esthetic will match both anesthetic and anaesthetic, because letter “a” can occur 0 or 1 time.

Benefits of Laziness

When using operators, we need to remember, that by default they behavior is “greedy”, that is, they want to match the longest possible string within paragraph (in FM with default settings) or even whole text.

We can expand our search to cover both spelling variants and different word suffixes. ana?esthe.*\b will match anaesthe or anesthe followed by any letter occurring any number of times until the end of word. Unfortunately, the greedy behavior will make the expression match up to the last word boundary it can find (since period matches also a space):

clip_image004_538982365

Now, being eager like that can be a good thing, but as they say, laziness is one of the greatest programmer virtues. So in order to avoid such excessive matching we have to use “lazy” matching operator combinations:

If we modify our expression accordingly: ana?esthe.*?\b

clip_image006_-670033708

We can match exactly what we want. The alternative way to safely match only single word in this case would be to use ana?esthe\w+ ­that way matching will end at the first non-word character (e.g. space or punctuation mark).

Bean Counter

Are these the only quantifiers you have at your disposal? No, for precise matching we can define either the exact number of occurrences we want to match or a range of occurrences using curly brackets {}:

So for our (an)aesthetic example ana?esthe\w{3} will match anaesthesia and anesthetic, but not anesthetist (\w{3} will match exactly three of any “word” characters).

Now it’s time to familiarize ourselves with the use of parentheses ( ). I’ve already written that regular expression allow us to manipulate results of our searches. This is possible, because the match is stored in memory, numbered and can be recalled. Whole match is always numbered “0”. What’s important, we can define parts of search expression to be separate groups, and we do that by putting them inside parentheses. Content of the groups is recalled by $num, e.g. $1 for group number 1.

Numbers Crunching

Let’s say we want to find a date in American format: 12/24/2015. A simple expression for matching this can look like this:

\d\d/\d\d/\d\d\d\d – or, a more elegant and readable version – \d{2}/\d{2}/\d{4}

Match two digits, slash, again two digits, slash and four digits. Now, if we want to somehow manipulate this date, we have to define groups:

(\d{2})/(\d{2})/(\d{4})
$1-$2-$3
12/24/2015
12-24-2015

Of course we can freely manipulate the results, so if we need to change date format to European, we can use this:

(\d{2})/(\d{2})/(\d{4})
$2.$1.$3
12/24/2015
24-12-2015

Quite often when writing longer expression it’s convenient to create a group, that will not be numbered. In such case we use the so-called passive group: (?: ). An example of expression with passive groups:

`(\d{2})(?:/
-
:
.)(\d{2})(?:/

(And if you wonder – yes, it would be simpler to write (\d{2})[/-:\.](\d{2})[/-:\.](\d{4}), in this case grouping is not necessary, but it’s just an example.)

What if

The last “grammar” element I’m going to show you are assertions. Simply speaking, they are conditions or “ifs”. Whatever is inside assertion is not considered part of the match, but must (or cannot) be present before or after particular content to match it. There are four of them, two positive and two negative.

First Example

Second Example

Why do we need them? Quite often we need to match (find) some string (text) only, if it matches certain criteria, or it appears before or after some other text. Let’s say we need to remove all double spaces occurring between sentences. The simplest expression for this purpose would be (without quotation marks):

Find:{2,}” (space, occur at least twice)

Change: “” (single space)

Unfortunately this will convert all sequences of multiple spaces into single space. If for any reason we want to remove only double spaces between sentences, we need to be more precise:

Find:(\.) {2,}([A-Z])” (period, store it as group #1, at least two spaces, capital letter, store it as group #2)

Change:
$1 $2 (put back that period, insert single space, put back that capital letter)

Now, since the period and capital letters are only there to define matching criteria, we can make the Find rule a bit more complex, to simplify Replace:

Find:(?<=\.) {2,}(?=[A-Z])” (at least two spaces, but only if there is a period before and capital letter behind them)

Change: “”

A bit more complex example from the real life scenario: let’s say you have a text on “ACME super rocket engine”. However, the author was somewhat inconsistent and sometimes it’s referred in text as “ACME super rocket”. You need to insert “engine” into text, but only where it’s not already there to avoid double words. To make matter more complex, sometimes text refers to just “super rocket”, without ACME, and you should not insert “engine” there either. Fortunately, we have regex:

Find:(?<=ACME )(super rocket)(?! engine)” (match “super rocket” if there is “ACME” before but no “ engine” after)

Change:
$1 engine (insert back “super rocket”, but add a space and “engine” after)

(And if your text is written in an inflected language, like Polish, you can match all grammar variants in one go:(?<=ACME )(supe\w+ rock\w+)(?! engine)” –_ in Polish “rocket” is inflected as “rakieta, rakiety, rakiecie, rakietą”, and that’s just singular._)

Bring it on

Below you will find several examples of regex matching

HTML and XML tags:

Find “Chapter” line:

Match compound number (1.3.2): \d+(\.\d+)+ (match string of digits followed by period and string of digits – the sequence “period and string of digits” may appear any number of times).

However, consider what will happen if you want to use the expression for find and replace (\d+(\.\d+)+): for 3.5 you will get two numbered groups – $1 = 3.5, $2 = .5. For 3.5.9 you will get also two numbered groups, but with different result – $1 = 3.5.9, $2 = .9. Should your expression include more numbered groups, referencing them would become unpredictable. Therefore it is a good idea to use passive group in such case: (\d+(?:\.\d+)+) will return only $1 covering whole compound number.

Find numbers followed by “in” designation and replace with the same number, space and “in”:

Find:
(\d)(in\b) (“\b” is a safeguard against matching of longer string starting with “in” e.g. internal)

Change:
$1 $2

There is an important limitation you need to be aware of when changing/replacing. While we can use various symbols for matching (e.g. \u2212 for n-dash), in the change/replace field all text is treated literally (except for $num). So if we want to insert n-dash into changed text, we need to type/paste actual n-dash, not it’s code. On the other hand, we can type in period without escaping it.

Reference

Should you be interested in learning more, there is plenty of resources available on the Internet: both courses (e.g. http://www.regular-expressions.info/) and software for building and testing regular expressions – both online (e.g. http://www.regexpal.com/) and offline (e.g. http://www.ultrapico.com/expresso.htm)

Table of basic regular expression symbols

.
Any character
[ ]
Any of the given characters (class)
[ - ]
Range of characters
[^ ]
Exclusion of matching
`
`
\
Modifier
\d
Digit
\D
Anything but digit
\s
Whitespace character
\S
Anything but whitespace
\w
Any “word” character – includes letters, numbers and underscore (_)
\W
Anything but word
\r
Carriage return
\n
Newline_FrameMaker performs all Find operations within a single paragraph, but in other software regular expression will often match the whole file, so it is useful to know that for Windows files end of line is encoded as “\r\n”._
\t
Tab
\xnnnn
Unicode character nnnn

Operators and anchors

?
Zero or one occurrence of symbol on the left
*
Zero or more occurrences of symbol on the left
+
One or more occurrences of symbol on the left
*?
Zero or as little as possible – lazy matching
+?
One or as little as possible – lazy matching
{ , }
Range
^
Start of line
$
End of line
\b
Word boundary

Grouping

( )
Group
(?: )
Passive group (not numbered)
(?= )
Positive look ahead
(?! )
Negative look ahead
(?<= )
Positive look behind
(?<! )
Negative look behind
(?# )
Comment

Unicode categories

Note: Make sure to check the “Consider Case” check box.

\p{L}
Any letter from any language (equivalent to [A-Za-z] but also for non-Latin scripts)
\p{Lu}
Uppercase letter with lowercase variant (Letter uppercase)
\p{Ll}
Lowercase letter with uppercase variant (Letter lowercase)
\p{Z}
Any whitespace or separator
\p{S}
Math symbols, currency symbols, dingbatsm – math symbol, c – currency symbol
\p{N}
Numbers
\p{P}
Punctuation marksd – pause and dashes, s – opening brackets, e – closing brackets, i – opening parentheses, f – closing parentheses