Friday, August 5, 2011

How To Use Basic Regular Expressions to Search Better and Save Time


banner-01

Whether you've been searching with Grep or looking at programs that can batch rename files for you, you've probably wondered if there was an easier way to get your job done. Thankfully, there is, and it's called "regular expressions."

(Comic from XKCD.com)

What are Regular Expressions?

Regular expressions are statements formatted in a very specific way and that can stand for many different results. Also known as "regex" or "regexp," they are primarily used in search and file naming functions. One regex can be used like a formula to create a number of different possible outputs, all of which are searched for. Alternatively, you can specify how a group of files should be named by specifying a regex, and your software can incrementally move to the next intended output. This way, you can rename multiple files in multiple folders very easily and efficiently, and you can move beyond the limitations of a simple numbering system.

Because the use of regular expressions relies on a special syntax, your program must be capable of reading and parsing them. Many batch file renaming programs for Windows and OS X have support for regexps, as well as the cross-platform searching tool GREP (which we touched on in our Bash Scripting for Beginners Guide) and the Awk command-line tool for *Nix. In addition, many alternative file managers, launchers, and searching tools use them, and they have a very important place in programming languages like Perl and Ruby. Other development environments like .NET, Java, and Python, as well as the upcoming C++ 11, all provide standard libraries for using regular expressions. As you can imagine, they can be really useful when trying to minimize the amount of code you put into a program.

A Note About Escaping Characters

Before we show you with examples, we'd like to point something out. We're going to be using the bash shell and the grep command to show you how to apply regular expressions. The problem is that sometimes we want to use special characters that need to be passed to grep, and the bash shell will interpret that character because the shell uses it as well. In these circumstances, we need to "escape" these characters. This can get confusing because this "escaping" of characters also occurs inside regexps. For example, if we want to enter this into grep:

\<

we'll have to replace that with:

\\\<

Each special character here gets one backslash. Alternatively, you can also use single quotes:

'\<'

Single quotes tell bash NOT to interpret what's inside of them. While we require these steps to be taken so we can demonstrate for you, your programs (especially GUI-based ones) often won't require these extra steps. To keep things simple and straightforward, the actual regular expression will be given to you as quoted text, and you'll see the escaped syntax in the command-line screenshots.

How Do They Expand?

Regexps are a really concise way of stating terms so that your computer can expand them into multiple options. Let's take a look at the following example:

tom[0123456789]

The square brackets – [ and ] – tell the parsing engine that whatever is inside, any ONE character may be used to match. Whatever is inside those brackets is called a character set.

So, if we had a huge list of entries and we used this regex to search, the following terms would be matched:

  • tom
  • tom0
  • tom1
  • tom2
  • tom3

and so on. However, the following list would NOT be matched, and so would NOT show up in your results:

  • tomato ; the regex does not account for any letters after "tom"
  • Tom ; the regex is case sensitive!

You can also choose to search with a period (.) which will allow any character present, as long as there is a character present.

reg vs period

As you can see, grepping with

.tom

did not bring up terms that only had "tom" at the beginning. Even "green tomatoes" came in, because the space before "tom" counts as a character, but terms like "tomF" did not have a character at the beginning and were thus ignored.

Note: Grep's default behavior is to return a whole line of text when some part matches your regex. Other programs may not do this, and you can turn this off in grep with the '-o' flag.

You can also specify alternation using a pipe (|), like here:

speciali(s|z)e

This will find both:

  • specialise
  • specialize

When using the grep command, we need to escape the special characters (, |, and ) with backslashes as well as utilize the '-E' flag to get this to work and avoid ugly errors.

escape paren pipe

As we mentioned above, this is because we need to tell the bash shell to pass these characters to grep and not to do anything with them. The '-E' flag tells grep to use the parentheses and pipe as special characters.

You can search by exclusion using a caret that is both inside of your square brackets and at the beginning of a set:

tom[^F|0-9]

Again, if you're using grep and bash, remember to escape that pipe!

caret

Terms that were in the list but did NOT show up are:

  • tom0
  • tom5
  • tom9
  • tomF

These did not match our regex.

How Can I Utilize Environments?

Often, we search based on boundaries. Sometimes we only want strings that appear at the beginning of a word, at the end of a word, or at the end of a line of code. This is can be easily done using what we call anchors.

Using a caret (outside of brackets) allows you to designate the "beginning" of a line.

^tom

beg of line

To search for the end of a line, use the dollar sign.

tom$

end of line

You can see that our search string comes BEFORE the anchor in this case.

You can also for matches that appear at the beginning or end of words, not whole lines.

\<tom

tom\>

beg of word

end of word

As we mentioned in the note at the beginning of this article, we need to escape these special characters because we're using bash. Alternatively, you can also use single quotes:

beg of word q

end of word q

The results are the same. Make sure you use single quotes, and not double quotes.

Other Resources For Advanced Regexps

We've only hit the tip of the iceberg here. You can also search for money terms delineated by the currency marker, and search for any of three or more matching terms. Things can get really complicated. If you're interested in learning more about regular expressions, then please take a look at the following sources.

  • Zytrax.com has a few pages with specific examples of why things do and do not match.
  • Regular-Expressions.info also has a killer guide to a lot of the more advanced stuff, as well as a handy reference page.
  • Gnu.org has a page dedicated to using regexps with grep.

You can also build and test out your regular expressions using a free Flash-based online tool called RegExr. It works as you type, is free, and can be used in most browsers.


Do you have a favorite use for regular expressions? Know of a great batch renamer that uses them? Maybe you just want to brag about your grep-fu. Contribute your thoughts by commenting!



--
Thanks&Regards 
sravan

No comments:

Post a Comment