Simple Perl File Parsing Example

Simple Perl File Parsing Example

(Updated on 05/31/2017)

 

Logs with too much information, poorly structured code… Every developer has, at some point, wanted to automate data extraction or even adjust code formatting. Here, I provide a simple Perl file parsing example: you’ll just have to worry about the analysis itself. As a bonus, the parser is multiline and configurable!

 

 

Attachment Content

The zip file you can download at the end of this post contains two files:

  1. parser.pl: A script that parses .c and .h files.
  2. test.c: A file used to test the changes made and see the result.

How to Use This Script

To make things simpler, I included the script’s usage via --help option on the command line that executes it. See below:

When no arguments are informed, .c and .h files are searched within the current folder and its subfolders. For all files found, the script verifies whether it can overwrite them (checking if the write permission is enabled) and then apply the proposed changes (more about those changes in the next topic).

For your aid, the --backup option creates a copy of each file before it is changed, adding the .bkp suffix to the original name of each file.

Two other options make it possible to use the parser from anywhere in the operating system. If the folder containing the files isn’t the current one, use the --dir option to change the search location. On the other hand, to specify which files should be changed, include the -i option listing only the files that are to be parsed, separating them with either , or :.

General Overview

The parser.pl script’s purpose is to format C language source files. For this reason, it will be necessary to read more than one line at a time, which is not always an issue (e.g. when the parsing aims at extracting data from a file).

In the first few lines, a constant called LINES_TO_READ is defined and indicates how many lines read at a time. Change its value as needed or use 1 to remove the multiline behavior.

Following this, the command line options are checked and a file search is done in case they haven’t been specified. The location of each file is stored in a list and the main loop iterates through all of them.

If the backup was enabled, a security copy is done and then two file descriptors are opened: one for the original file and the other for a temporary file. This happens because the modifications will be inserted line by line and thus avoiding problems with the reading pointer’s position.

Once both descriptors are open, an empty line is inserted at the end of the original file so that we make sure there is at least one \n character (or \r\n in Windows) before EOF. Its absence could cause errors while parsing the input file’s last line.

Now the script makes consecutive reads, concatenating how many lines are needed into an internal variable. From there, all modifications will be performed using regular expressions with the multiline flag enabled.

In this example, three simple modifications are done:

  1. All spaces located at the end of the lines are removed;
  2. A lone { character is moved to the end of its previous line;
  3. All those new lines bundled together are shortened.

At the end, the internal variable’s first line is removed and printed to the temporary file. Since two of the three changes change we made can affect the initial number of lines in it, a new round of reading is made until the desired quantity is again present inside the internal variable. Once there is nothing else to read, the script prints all the remaining lines to the output file.

At the end, the two file descriptors are closed and the temporary file is used to overwrite the original file, ending the main loop. After all files go through the same process, the script ends.

How To Parse A File

First and foremost, if you need to change the suffixes used in the automatic file search, make the necessary changes in the line shown below:

The script itself details each step to help anyone’s understanding of it. Regarding just the parsing part, these are the lines to be changed:

The $multiline variable contains all lines under analysis and regular expressions are used to change the contents that will be written to the output file. Change the regex so that the script can do what you need it to do.

You shouldn’t need to change the rest of the script.

Multiline Regular Expressions

Some additional considerations should be made due to the multiline aspect of the script.

For those who aren’t experienced, the regular expressions used in this situation render both ^ and $ characters useless, that is, matching the beginning or the end a the line is not done the regular way. Instead, you need to use the newline character(s) — \n on Linux/Unix/Mac system or \r\n on Windows.

For compatibility between different operating systems, note the use of the sequence \r?\n.

Final Words

The attached script may be used and modified at your will, except for commercial use.

Don’t forget to leave any questions in the comments, in case you need some help, and good luck!

 

 

Download Attachments

Leave a Reply