Interrogating CSVs

This week at work I had an interesting problem, I needed to bulk import many, many CSV files into our database. I wanted to do this in an automated way that would not lose any data. I came up with this idea and it worked out really well.

Luckily the filenames of the comma separated value (CSV) files where pretty well formed, 20120601_users, 20120531_daily_report, etc. I first tokenized (using split.(‘_’) in python) the CSV filename and excluded the first token which in my case was the date.

Next I opened the file and read out the field names on the first line into a tuple. I next made a dictionary of lengths with keys based off of the field name tuples. I then made a second dictionary called is_it_an_integer with the same tuple field names as keys.

I looped through the entire CSV file and I knew each instance name because it was ordered in my field name tuple. I looked at the max length of the content and overwrote the value only with the highest number I came across. I also checked if the payload matched the criteria for an integer. (I initially set this value to True). If I came across any instance that the integer test failed, I set the integer test to False. Once this value was False I stopped testing for integers.

What this gave me at the end of processing was a tuple of field names, a dictionary with their max length, and a dictionary with integer as True or False.

From this information I created a MySQL LOAD DATA INFILE command. I know the table name based off of the file name. I knew the column names. The logic then followed that for each field name, if I knew it was an integer, I made the datatype an INT. If it wasn’t an int, I looked at the length. If it was under 4 characters, I called it an VARCHAR(10). If it was under 26 character, I called it a VARCHAR(100). If it was under 256 characters, I called it a VARCHAR(255). If it was more than 255 characters, I called the datatype TEXT.

This logic is far from complete but it got me to automatically ingesting unknown CSV files into the database. I could manually fix things up from here knowing I didn’t lose any data. The biggest problem with this logic is that is doesn’t take into account DATE, and DATETIME datatypes. I think solving that issue will be next.





Leave a Reply

Your email address will not be published.