On a recent project we had the challenge of working with and manipulating a large collection of data. Our client provided us with a 50 CSV files ranging from 30 MB to 350 MB in size and all in all containing approximately 20 million rows of data and 15 columns of data. Our end goal was to import and manipulate the data into a MySQL relational database to be used to power a frontend PHP script that we also developed. Now, working with a dataset this large or larger is not the simplest of tasks and in working on it we wanted to take a moment to share some of the things you should consider and know when working with large datasets like this.
I can’t stress this first step enough! Make sure that you take the time to analyze the data you are working with before importing it at all. Getting an understand of what all of the data represents, what columns related to what and what type of manipulation you need to will end up saving you time in the long run.
Importing large data files like the ones we worked with (and larger ones) can be tough to do if you go ahead and try a regular CSV insert via a tool like PHPMyAdmin. Not only will it fail in many cases because your server won’t be able to handle a file upload as large as some of your data files due to upload size restrictions and server timeouts, but even if it does succeed, the process could take hours depending our your hardware. The SQL function LOAD DATA INFILE was created to handle these large datasets and will significantly reduce the time it takes to handle the import process. Of note, this can be executed through PHPMyAdmin, but you may still have file upload issues. In that case you can upload the files manually to your server and then execute from PHPMyAdmin (see their manual for more info) or execute the command via your SSH console (assuming you have your own server)
LOAD DATA INFILE '/mylargefile.csv' INTO TABLE temp_data FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY 'n'
Large or small database it’s always good to take a little time to consider which database engine you are going to use for your project. The two main engines you are going to read about are MYISAM and InnoDB and each has their own benefits and drawbacks. In brief the things to consider (in general) are as follows:
– Lower Memory Usage
– Allows for Full-Text Searching
– Table Level Locking – Locks Entire Table on Write
– Great for Read-Intensive Applications
– Uses More Memory
– No Full-Text Search Support
– Faster Performance
– Row Level Locking – Locks Single Row on Write
– Great for Read/Write Intensive Applications
Your databases design/structure is going to be a large factor in how it performs. Take your time when it comes to planning out the different fields and analyze the data to figure out what the best field types, defaults and field length. You want to accommodate for the right amounts of data and try to avoid varchar columns and overly large data types when the data doesn’t warrant it. As an additional step after you are done with your database, you make want to see what MySQL suggests as field types for all of your different fields. You can do this by executing the following SQL command:
ANALYZE TABLE my_big_table
The result will be a description of each columns information along with a recommendation for what type of datatype it should be along with a proper length. Now you don’t necessarily need to follow the recommendations as they are based solely on existing data, but it may help put you on the right track and get you thinking
For a dataset as large as this it’s infinitely important to create proper indexes on your data based off of what you need to do with the data on the frontend, BUT if you plan to manipulate the data beforehand refrain from placing too many indexes on the data. Not only will it will make your SQL table larger, but it will also slow down certain operations like column additions, subtractions and additional indexing. With our dataset we needed to take the information we just imported and break it into several different tables to create a relational structure as well as take certain columns and split the information into additional columns. We placed an index on the bare minimum of columns that we knew would help us with the manipulation. All in all, we took 1 large table consisting of 20 million rows of data and split its information into 6 different tables with pieces of the main data in them along with newly created data based off the existing content. We did all of this by writing small PHP scripts to parse and move the data around.
A big part of working with large databases from a programming perspective is speed and efficiency. Getting all of the data into your database is great, but if the script you write to access the data is slow, what’s the point? When working with large datasets it’s extremely important that you take the time to understand all of the queries that your script is performing and to create indexes to help those queries where possible. One such way to analyze what your queries are doing is by executing the following SQL command:
EXPLAIN SELECT some_field FROM my_big_table WHERE another_field='Rehatced';
By adding EXPLAIN to the start of your query MySQL will spit out information describing what indexes it tried to use, did use and how it used them. I labeled this point ‘Finding a balance’ because although indexes can help your script perform faster, they can just as easily make it run slower. You need to make sure you index what is needed and only what is needed. Every index consumes disk space and adds to the overhead of the table. Every time you make an edit to your table, you have to rebuild the index for that particular row and the more indexes you have on those rows, the longer it will take. It all comes down to making smart indexes, efficient SQL queries and most importantly benchmarking as you go to understand what each of your queries is doing and how long it’s taking to do it.
As we worked on the database and frontend script, both the client and us started to notice little things that needed changing and that required us to make changes to the database. Some of these changes involved adding/removing columns and changing the column types. As we had already setup a number of indexes on the data, making any of these changes required the server to do some serious work to keep the indexes in place and handle any modifications. On our small VPS server, some of the changes were taking upwards of 6 hours to complete…certainly not helpful to us being able to do speedy development. The solution? Turn off indexes! Sometimes it’s better to turn the indexes off, make your changes and then turn the indexes back on….especially if you have a lot of different changes to make. With the indexes off, the changes took a matter of seconds to minutes versus hours. When we were happy with our changes we simply turned our indexes back on. This of course took quite some time to re-index everything, but it was at least able to re-index everything all at once, reducing the overall time needed to make these changes one by one. Here’s how to do it:
Disable Indexes: ALTER TABLE my_big_table DISABLE KEY
Enable Indexes: ALTER TABLE my_big_table ENABLE KEY
Don’t neglect your server when it comes to making your database and script run quickly. Your hardware needs just as much attention and tuning as your database and script does. In particular it’s important to look at your MySQL configuration file to see what changes you can make to better enhance its performance. A great little tool that we’ve come across is the MySQL Tuner (http://mysqltuner.com/). It’s a quick little Perl script that you can download right to your server and run via SSH to see what changes you might want to make to your configuration. Note that you should actively use your frontend script and database for several days before running the tuner so that the tuner has data to analyze. Running it on a fresh server will only provide minimal information and tuning options. We found it great to use the tuner script every few days for the two weeks to see what recommendations it would come up with and at the end we had significantly increased the databases performance.
Working with SQL can be challenging to begin with and working with extremely large datasets only makes it that much harder. Don’t be afraid to go to professionals who know what they are doing when it comes to large datasets. Ultimately you will end up with a superior product, quicker development and quicker frontend performance. When it comes to large Databases sometimes it’s take a professionals experienced eyes to find all the little caveats that could be slowing your Databases performance.