perl script delimited file

PERL Script Tutorial – Remove Column from Delimited File

What is a Delimited File?

A delimited file is a file where the data is separated by special or unique characters. The most common form of a delimited file is the .CSV (Comma separated values) file format. These files, as you can guess, have the data separated by commas (,). For example, you can save an Excel spreadsheet document as an .CSV file and it will separate all the fields with a comma. Other examples of delimited files include .TSV (Tab separated values), quotation marks, or really any other File that has a unique character (or character pattern) that separates actual data (words, phrases) that you need to utilize.

perl script tutorial
Example of CSV (delimited) file

The above example is just a simple example. Many times delimited files can be separated by character patterns such as þþ, or *%*, and have a file extension of .dat or anything else really. It is important for the delimiter to be different than the actual data content of the delimited file — meaning if your actual data values contain the * character, it isn’t a good idea to use it as a file delimiter for that particular file.

perl script tutorial 2While the example in the above figure is small, a lot of the times a delimited file can contain hundreds (if not thousands) of columns and thousands of rows, and use patterns such as þþ to seperate the data values from each other.

Let’s say we use MS Access and export a .dat file with thousands of fields delimited by þþ, but we accidentally leave a column we shouldn’t have. Sure we can reload the original file in Microsoft Access, then delete the column, and finally re-export the file as .dat, but why when we can simply create a PERL script to format our already exported file?

Alternatively, we can use a PERL script to go through each of the columns in the delimited file, and delete the column from the delimited file based on column number. The best part about having the Perl Script is that you can use it over and over again, so if you accidentally leave an extra column or two in a delimited file in the future, you can easily fix it with the same script with little or NO modification, depending on the delimiter used.

Installing PERL

To check which version of PERL you have, open Command Prompt (Windows) or Terminal (Linux) and type the command:

perl -v

You should get the following output:

perl script tutorial 3
perl -v output

In Linux, most likely you will already have Perl installed, however in Windows you may get an error. To install Perl in Microsoft Windows, I recommend downloading Strawberry Perl. Just download the installer and run it. After Strawberry Perl is done installing, open command prompt and type perl -v and you should get an output similar to the image shown above.

Strawberry Perl is a perl environment for MS Windows containing all you need to run and develop perl applications. It is designed to be as close as possible to perl environment on UNIX systems. — StrawberryPerl.com

Once installed, you should be ready to start creating Perl scripts. If you’ve never used Perl, it isn’t too hard of a language to learn and has a huge collection of modules you can incorporate into your Perl scripts. For a “Hello World” tutorial and example, please see cpan.org Hello World Perl tutorial. It’s a very brief tutorial and I recommend reading it real quick if you have no previous Perl experience.

Creating the Perl Script

First, you need the delimited file that you will be fixing. Let’s say we have a delimited file named techSide.dat with the delimiter þþ:

perl script tutorial 4
techSide.dat delimited file example

The Issue: While the delimited file shown above is good, the client wishes to no longer keep the EndBates column inside the file as it is irrelevant to their database. We can solve this issue by regenerating another delimited file, or simply using a Perl script to fix the current delimited file. We are going to be discussing how to do the latter.

To do this, create a file named updateDat.pl in the same folder as your delimited file (techSide.dat).

Open updateDat.pl and type the following code:


#!/usr/bin/perl
#TechSide - delete column in dat file based on column number
$in_file = "techSide.dat";                       #input dat name
$out_file = "techSideNew.dat";                   #new dat name

print STDOUT "Column number to delete: ";        #ask user to enter column number to delete
$del_col = ;                              #save user input to var del_col
chomp($del_col);                                 #safer version of chop
$del_col -=2;                                    #subtract 2 from del_col
$next_col = $del_col+2;                          #add 2 to del_col and save as var next_col

open (IN, "<$in_file") or die "Can't find file $!\n";               #open techSide.dat to read otherwise provide error
open (OUT, ">$out_file") or die "Can't write file $out_file: $!\n"; #open/create techSideNew.dat to write

while ($line = ) {                     # while loop to read techSide.dat to var $line - one line at a time
	chop($line);                       # removes new line character at end of each line
	@fields = split /þ.þ/, $line;      # splits line at delimiter and puts line in array fields
	$total_col = scalar(@fields) - 1;  # returns total element count of array 
	$line = join "þþ", @fields[0..$del_col], @fields[$next_col..$total_col]; #join line omitting the column the user wants to delete
	$line = $line . "\n";     #adds new line character after JOIN - otherwise extra delimiter at end of line
	print OUT $line;          #print line into out_file - techSideNew.dat
}

close IN;
close OUT;

print "$out_file created in directory.  Enter to exit.";
<>;

Now, if you run the updateDat.pl file, it should ask you to pick a column number to delete:

Column number to delete: 2

As you can see, if we select column number 2, it should get rid of the EndBates field, and your file should look finished:

perl script tutorial
Column 2 (EndBates) deleted.

However, if the user decides to delete column number 1, or the last column, our program forgets to put the þ character in the beginning and if you select the last column to delete, our script forgets to put the last þ symbol:

perl script tutorial
Missing þ beginning of every line

To resolve this issue, we can add an if..else condition. Open up your updateDat.pl file again, and modify the code to the following:


#!/usr/bin/perl
#TechSide - delete column in dat file based on column number

$in_file = "techSide.dat"; 
$out_file = "techSideNew.dat"; 

print STDOUT "Column number to delete: ";
$del_col = ;
chomp($del_col);

$user_dat = $del_col-1;        #NEW VARIABLE

$del_col -=2;
$next_col = $del_col+2;

open (IN, "<$in_file") or die "Can't find file $!\n";
open (OUT, ">$out_file") or die "Can't write file $out_file: $!\n";

while ($line = ) {
	chop($line); 
	@fields = split /þ.þ/, $line;
	$total_col = scalar(@fields) - 1; 
	$line = join "þþ", @fields[0..$user_col], @fields[$next_col..$total_col]; #CHANGED
	
        #New IF/ELSE statement 
	if ($total_col == $user_dat){   #if user selects last column to delete
	$line = $line . "þ\n";
	} 
	elsif ($user_dat == 0){         #if user selects first column to delete
	$line = "þ$line\n";
	}
	else {                          #otherwise continue as normal
	$line = $line . "\n";
	}
	print OUT $line;
}

#print "$total_col, $user_dat";

close IN;
close OUT;
print "$out_file created. Enter to Exit.";
<>;

Now, if you run the code and decide to delete column # 1, the script will automatically add the þ in front, or if you decide to delete the last column in the original dat, it will add a trailing þ and \n (new line character). Otherwise (else) the program will continue as normal and generate the techSideNew.dat file for you.

Summary

In conclusion, as you can see, learning a little bit of PERL scripting can save you a lot of time in the long run. Sure, it may take you longer initially to create the script than to regenerate another delimited file through Microsoft Access, or other software, but once you have the script, you can use it as much as you want. You can just update one line and the perl script has the potential to work on any delimited file. Perl is actually very straight forward language to learn and has hundreds if not thousands of modules available. I recommend reading up on Regular Expressions (REGEX) as they are

Now that we have a script that can delete a column from a delimited file based on column number, do you think you can change the code to make the perl script delete based on column name? I eventually did. 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *