Thursday, June 6, 2013

Playing with text file data in Python: Building SQL delete statements from unstructured data

Yesterday I played around with excel files to build SQL update statements and today I got an opportunity do similar stuff, but this time I got the text file that had column names in the first line and values in following lines separated by spaces. The goal is delete a bunch of records from a table based on the eventids etc. It looks like this:



First, you need to open the file to read it. The rb argument is to open the file for reading in binary mode in windows.

myFile = open("D:\Scripts\deldata.txt", "rb")

I initially tried to find the number of lines in the text file thinking this might be useful while iterating through the lines and when trying to make a list out of these . I know, you could just open the text file in notepad++ and look at the end of the document, but where is the fun in that? So this is what I came up with.

myFile = open("D:\Scripts\deldata.txt", 'rb')
i = 0
for line in enumerate(myFile):
    i += 1
print i

I know that totally sucks. May be there a more elegant way?

myFile = open("D:\Scripts\deldata.txt", "rb")
numLines = sum(1 for line in open myFile)
print numLines

It turns out, you don't really need it. You can just loop through each line and manipulate the lines as needed. When I printed the lines, there were lots of white spaces and tabs in the resulting strings. After browsing through a lot of stackoverflow questions and reading documentation and unsuccessfully trying to use regular expressions, I found an easier way to deal with the white spaces. Just use string.split() method and turn the string into a list.

__author__ = 'RK'
myFile = open("D:\Scripts\deldata.txt", "rb")

for line in myFile:
    line = line.split() #remove white spaces and tabs
    for value in line:
        print ("DELETE FROM REGISTR WHERE eventid = '%s' AND subeventid = '%s' AND personid = '%s' 
                AND RegNo = '%s'" %(line[0], line[1], line[2], line[3]))