Regular Expressions

<< Click to Display Table of Contents >>

Navigation:  Using SyncBackPro > Technical Reference >

Regular Expressions

 

Regular Expression Filters in SyncBackPro

 

This section of the help file provides information and guidance about regular expression filters. Regular expressions are a system for matching patterns in text data. They provide a powerful set of tools for finding particular words or combinations of characters in strings.

 

Note that by default SyncBackPro will be case insensitive with the filters and it is not recommended that you use case sensitivity (via the modifiers). You also need to keep in mind that regular expressions can match any part of a filename, unlike DOS expressions which must match the entire filename. SyncBackPro works with line separators as recommended at www.unicode.org, however there are no line separators within a filename. To add flexibility, SyncBackPro adds the backslash character (\) as a line separator. This means that filenames are essentially broken down into their parts with each part being treated as a separate line. See the Line Separator section below on how this is useful.

 

IMPORTANT: 2BrightSparks cannot provide technical support for helping you create regular expressions.

 

 

Simple matches

 

Any single character matches itself, unless it is a meta-character with a special meaning described below.

 

A series of characters matches that series of characters in the target string, so the pattern blah would match blah in the target string.

 

You can cause characters that normally function as meta-characters or escape sequences to be interpreted literally by 'escaping' them by preceding them with a backslash (\), for instance: meta-character ^ match beginning of string, but \^ match character ^, \\ match \ and so on.

 

Examples:

 

 foobar matches string foobar

 \^FooBarPtr matches ^FooBarPtr

 

 

Escape sequences

 

Characters may be specified using escape sequences syntax much like that used in C and Perl: \n matches a newline, \t a tab, etc. More generally, \xnn, where nn is a string of hexadecimal digits, matches the character whose ASCII value is nn. If you need a Unicode character code, you can use \x{nnnn} where nnnn is one or more hexadecimal digits.

 

 \xnn                character with hex code nn

 \x{nnnn}        character with hex code nnnn (one byte for plain text and two bytes for Unicode)

 \t                tab (HT/TAB), same as \x09

 \n                newline (NL), same as \x0a

 \r                carriage return (CR), same as \x0d

 \f                form feed (FF), same as \x0c

 \a                alarm (bell) (BEL), same as \x07

 \e                escape (ESC), same as \x1b

 

Examples:

 

 foo\x20bar matches foo bar (note space in the middle)

 \tfoobar matches foobar predefined by tab

 

 

Character classes

 

You can specify a character class, by enclosing a list of characters in square brackets ([ ]), which will match any one character from the list.

 

If the first character after the opening square bracket [ is ^, the class matches any character not in the list.

 

Examples:

 

 foob[aeiou]r finds strings foobar, foober, etc. but not foobbr, foobcr, etc.

 foob[^aeiou]r finds strings foobbr, foobcr, etc. but not foobar, foober, etc.

 

Within a list, the dash/minus character - is used to specify a range, so that a-z represents all characters between a and z, inclusive.

 

If you want - itself to be a member of a class, put it at the start or end of the list, or escape it with a backslash. If you want a closing square bracket ] then you may place it at the start of list or escape it with a backslash.

 

Examples:

 

 [-az]                matches a, z and -

 [az-]                matches a, z and -

 [a\-z]                matches a, z and -

 [a-z]                matches all twenty six small characters from a to z

 [\n-\x0D]        matches any of #10, #11, #12, #13

 [\d-t]                matches any digit, - or t

 []-a]                matches any character from ] to a

 

 

Meta-characters

 

Meta-characters are special characters which are the essence of Regular Expressions. There are different types of meta-characters, described below.

 

Meta-characters - line separators

 

 ^        start of line

 $        end of line

 \A        start of text

 \Z        end of text

 .        any character in line

 

Examples:

 

 ^foobar        matches string foobar only if it's at the beginning of line

 foobar$        matches string foobar only if it's at the end of line

 ^foobar$        matches string foobar only if it's the only string in line

 foob.r                matches strings like foobar, foobbr, foob1r and so on

 

The ^ meta-character by default is only guaranteed to match at the beginning of the input string/text, the $ meta-character only at the end. Embedded line separators will not be matched by ^ or $.

 

You may, however, wish to treat a string as a multi-line buffer, such that the ^ will match after any line separator within the string, and $ will match before any line separator. You can do this by switching on the modifier m.

 

The \A and \Z are just like ^ and $, except that they won't match multiple times when the modifier m is used, while ^ and $ will match at every internal line separator.

 

The . meta-character by default matches any character, but if you switch off the modifier s, then . won't match embedded line separators.

 

^ is at the beginning of a input string, and, if modifier m is on, also immediately following any occurrence of \, \x0D\x0A, \x0A, \x0D, \x2028, \x2029, \x0B, \x0C, or \x85. Note that there is no empty line within the sequence \x0D\x0A.

 

$ is at the end of a input string, and, if modifier m is on, also immediately preceding any occurrence of \, \x0D\x0A, \x0A, \x0D, \x2028, \x2029, \x0B, \x0C, or \x85. Note that there is no empty line within the sequence \x0D\x0A.

 

. matches any character, but if you switch off modifier s then . doesn't match \, \x0D\x0A, \x0A, \x0D, \x2028, \x2029, \x0B, \x0C, or \x85.

 

Note that ^.*$ (an empty line pattern) does not match the empty string within the sequence \x0D\x0A, but matches the empty string within the sequence \x0A\x0D.

 

 

Meta-characters - predefined classes

 

 \w        an alphanumeric character (including underscore _)

 \W        a non-alphanumeric

 \d        a numeric character

 \D        a non-numeric

 \s        any space (same as [ \t\n\r\f])

 \S        a non space

 

You may use \w, \d and \s within custom character classes.

 

Examples:

 

 foob\dr matches strings like foob1r, foob6r and so on but not foobar, foobbr and so on

 foob[\w\s]r matches strings like foobar, foob r, foobbr and so on but not foob1r, foob=r and so on

 

 

Meta-characters - word boundaries

 

 \b        Match a word boundary

 \B        Match a non-(word boundary)

 

A word boundary (\b) is a spot between two characters that has a \w on one side of it and a \W on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a \W.

 

 

Meta-characters - iterators

 

Any item of a regular expression may be followed by another type of meta-characters - iterators. Using these meta-characters you can specify number of occurrences of previous characters, meta-characters or sub-expressions.

 

 *        zero or more ("greedy"), similar to {0,}

 +        one or more ("greedy"), similar to {1,}

 ?        zero or one ("greedy"), similar to {0,1}

 {n}        exactly n times ("greedy")

 {n,}        at least n times ("greedy")

 {n,m}        at least n but not more than m times ("greedy")

 *?        zero or more ("non-greedy"), similar to {0,}?

 +?        one or more ("non-greedy"), similar to {1,}?

 ??        zero or one ("non-greedy"), similar to {0,1}?

 {n}?        exactly n times ("non-greedy")

 {n,}?        at least n times ("non-greedy")

 {n,m}?        at least n but not more than m times ("non-greedy")

 

So, digits in curly brackets of the form {n,m} specify the minimum number of times to match the item n and the maximum m. The form {n} is equivalent to {n,n} and matches exactly n times. The form {n,} matches n or more times. There is no limit to the size of n or m, but large numbers will chew up more memory and slow down execution.

 

If a curly bracket occurs in any other context, it is treated as a regular character.

 

Examples:

 

 foob.*r        matches strings like foobar, foobalkjdflkj9r and foobr

 foob.+r        matches strings like foobar, foobalkjdflkj9r but not foobr

 foob.?r        matches strings like foobar, foobbr and foobr but not foobalkj9r

 fooba{2}r        matches the string foobaar

 fooba{2,}r        matches strings like foobaar, foobaaar, foobaaaar etc.

 fooba{2,3}r        matches strings like foobaar, or foobaaar but not foobaaaar

 

A little explanation about greediness. "Greedy" takes as many as possible, "non-greedy" takes as few as possible. For example, b+ and b* applied to string abbbbc return bbbb, b+? returns b, b*? returns empty string, b{2,3}? returns bb, b{2,3} returns bbb.

 

You can switch all iterators into "non-greedy" mode (see the modifier g).

 

 

Meta-characters - alternatives

 

You can specify a series of alternatives for a pattern using | to separate them, so that fee|fie|foe will match any of fee, fie, or foe in the target string (as would f(e|i|o)e). The first alternative includes everything from the last pattern delimiter ((, [, or the beginning of the pattern) up to the first |, and the last alternative contains everything from the last | to the next pattern delimiter. For this reason, it's common practice to include alternatives in parentheses, to minimize confusion about where they start and end.

 

Alternatives are tried from left to right, so the first alternative found for which the entire expression matches, is the one that is chosen. This means that alternatives are not necessarily greedy. For example: when matching foo|foot against barefoot, only the foo part will match, as that is the first alternative tried, and it successfully matches the target string. (This might not seem important, but it is important when you are capturing matched text using parentheses.)

 

Also remember that | is interpreted as a literal within square brackets, so if you write [fee|fie|foe] you're really only matching [feio|].

 

Examples:

 

 foo(bar|foo) matches strings foobar or foofoo

 

 

Meta-characters - sub-expressions

 

The bracketing construct ( ... ) may also be used for defining sub-expressions. Sub-expressions are numbered based on the left to right order of their opening parenthesis. First sub-expression has number '1'.

 

Examples:

 

 (foobar){8,10} matches strings which contain 8, 9 or 10 instances of the foobar

 foob([0-9]|a+)r matches foob0r, foob1r, foobar, foobaar, foobaar etc.

 

 

Meta-characters - back-references

 

Meta-characters \1 through \9 are interpreted as back-references. \<n> matches previously matched sub-expression #<n>.

 

Examples:

 

 (.)\1+                matches aaaa and cc

 (.+)\1+                also match abab and 123123

 (['"]?)(\d+)\1        matches "13" (in double quotes), or '4' (in single quotes) or 77 (without quotes) etc

 

 

Modifiers

 

Modifiers are for changing behaviour of the regular expression engine. Any of these modifiers may be embedded within the regular expression itself using the (?...) construct. If the construction is in-lined into a sub-expression then it affects only that sub-expression.

 

i        By default this is on. Do case-insensitive pattern matching (using installed in you system locale settings). SyncBackPro uses case insensitive searches by default and it is not recommended that you use case sensitivity.

 

m        By default this is off. Treat string as multiple lines. That is, change ^ and $ from matching at only the very start or end of the string to the start or end of any line anywhere within the string. This is important because in SyncBackPro a backslash is treated as a line separator. See the Line Separator section.

 

s        By default this is on. Treat string as single line. That is, change . to match any character whatsoever, even a line separators, which it normally would not match. This is important because in SyncBackPro a backslash is treated as a line separator. See the Line Separator section.

 

g        Non standard modifier. Switching it off will switch all following operators into non-greedy mode (by default this modifier is on). So, if modifier g is off then + works as +?, * as *? and so on. By default this is on.

 

Examples:

 

(?i)Saint-Petersburg        matches Saint-petersburg and Saint-Petersburg

(?i)Saint-(?-i)Petersburg        matches Saint-Petersburg but not Saint-petersburg

(?i)(Saint-)?Petersburg        matches Saint-petersburg and saint-petersburg

((?i)Saint-)?Petersburg        matches saint-Petersburg, but not saint-petersburg

 

(?#text)

 

A comment, the text is ignored. Note that comment is closed at the first close bracket ), so there is no way to put a literal close bracket ) in the comment.

 

 

Line Separator

 

SyncBackPro treats the backslash character as a line separator. All filenames start with a backslash, and all folders end with a backslash. The backslash character also delineates the parts of a file. By treating the backslash character as a line separator you can change how the . meta-character works and so have more flexibility. For example, let's say you only want text files in any folder that start with the name temp. A first attempt would be:

 

\\$

.*\\temp.*\.*\.txt

 

The first one (\\$) makes sure all folders are scanned, which is the same as the DOS expression *\. Although the second expression looks correct, it won't work correctly. It would match \temp\folder\test.txt. A second try could be:

 

\\temp.*?\\[^\\]*\.txt$

 

But this would also match \temp\folder\test.txt. Why? The part \\temp.*?\\ will correctly match \temp\ but [^\\]*\.txt$ will match test.txt no matter what is after \temp\. This is where the point about the backslash being a line separator is important. Because it's a line separator you can change the way the meta-character . works. Normally it will match any character at all. But using the s modifier you can stop it matching line separators, so the following will work:

 

(?-s)\\temp.*\\.*\.txt$

 

To explain why this would work, let's use the example filename of \temp\folder\text.txt and see why it would not match:

 

(?-s) is a modifier that tells SyncBackPro to treat the filename as separate lines. Because backslash is a line separator it means you can think of the filename as being broken up into its parts with each part effectively on its own line:

 

temp

folder

text.txt

 

\\temp.*\\ matches \temp\, so we are now onto the next line/part (folder)

 

.*\.txt$ means the end of the current line must match any number of characters (but not backslash) and end with .txt. You could also have .*\.txt\Z

 

If the expression was (?-s)\\temp.*\\.*\.txt (so it doesn't have $ at the end) then it would wrongly match \temp\folder.txt\test.txt because folder.txt matches .*\.txt

 

What if you only wanted root temp* folders? The expression would be:

 

(?-s)\A\\temp.*\\.*\.txt$

 

The meta-character \A ensures that \temp\ must be at the beginning.

 

 

SyncBackPro Examples

 

Notice that many of the examples below also include filters to include folders.

\\$
\.txt$
 

All text files (.txt) in all folders. The \\$ filter ensures all folders are looked at.

\\$
(?-s)\\temp\\.*\.txt$

All text files in all folders called temp. It would not include .txt files in any sub-folders of folders called temp. To include all .txt files in all sub-folders of folders called temp you would omit the (?-s)

 
\\temp\\$
(?-s)\\temp\\.*\.txt$

 
All text files in the root folder called temp. For example, if your source directory is C:\My Documents\ then this filter is for all text files in C\My Documents\temp\

 
.*\\test\\$

 
All folders called test. Note that no files will be copied unless another filter is added to include files.

 
.*\\parent\\$
.*\\parent\\child\\$

 
All folders called child whose parent directory is called parent. Notice the filter *\ is required otherwise it will never look inside folders called parent. Note that no files will be copied unless another filter is added to include files.

 
(?-s)\A\\temp.*?\\$

 
All root folders whose name starts with temp or is called temp. Note that no files will be copied unless another filter is added to include files. If you omit (?-s) then it would wrongly also match anything inside \temp\

 

 

All Content: 2BrightSparks Pte Ltd © 2003-2024