For most of our projects that involve large datasets we use either PostgreSQL or MySQL to conduct analysis and feed data-driven web sites. In general, we’re happy with these tools, particularly PostgreSQL and its spatial functionality, and find that they meet our data needs. PostgreSQL has also been fairly innovative, recently adding functionality focused on web applications including native support for JSON. PCWorld did a nice article on this.
PostgreSQL’s adaptation of JSON was likely driven, in part by the success of recent non-relational, ‘NoSQL’ databases such as MongoDB. Although we haven’t yet found the need to use MongoDB in a ‘real-world’ setting, we wanted to at least familiarize ourselves with the database in case the need arises.
This post relies very heavily on MongoDB tutorials on their main page (and to a certain extent you might be better off going to the source). For example, they have a nice install tutorial and a getting started tutorial that you might take a look at.
1. Download and install MongoDB
Installation was straightforward. We installed version 2.6.3 on a 64-bit Windows machine using the download links here. No problems.
2. Start the Server
As you might expect, similar to PostgreSQL and MySQL you’ll need to start the server before working with databases. To do this, in a console, you would execute a command like the following where mongod (shorthand for mongod.exe) is the server and the argument –dbpath specifies the data path:
"C:/Program Files/MongoDB 2.6 Standard/bin/mongod" --dbpath "d:/mongodb_data"
Note that the waiting for connections
message in the console indicates that the mongod.exe process is running successfully. Also, of course, depending on your version and computer set up, the path to mongod may be different.
3. Start MongoDB
Previously, you started the server so you're now ready to connect. In order to do this, open a new console window and run the mongo executable:
"C:/Program Files/MongoDB 2.6 Standard/bin/mongo"
This command will run the MongoDB shell and connect to a database.
4. Play With MongoDB
To see what database you're currently using you can type:
db
To see what databases are available to you try:
show dbs
To change to another database (replace 'newDBname' with the name of the database):
use newDBname
Note that to create a new database you can use the same code as above and if the named database does not exist it will be created. So I'm creating a new database called newtestDB using:
use newtestDB
Now I have a new database and I'm ready to add some initial data. First I'll create a couple of objects. MongoDB uses JSON-style documents so you can simply use JavaScript to create objects. For example, the following code creates two objects (but they are temporary) :
j = { name : "mongo" }
q = { name : "postgres" }
If you want the objects j
and q
to persist you would add them to what is called a 'collection'. A collection is a 'grouping of MongoDB documents' and is similar to an SQL table. You can create the collection and add objects to the collection with one command:
db.firstdata.insert(j)
db.firstdata.insert(q)
This code creates a new collection called firstdata
and inserts j
and then inserts q
into the same collection.
If you want to see what is in the collection you can use find
as in:
db.firstdata.find()
Or, to find a specific document you can use something like:
db.firstdata.find({name: 'mongo'})
5. Import Real Data
Naturally, you're not likely to manually type in your data so you need to know the syntax for import. MongoDB has a handy utility for doing just this. For the purposes of this demonstration, I'm going to import a comma-separated file with about 7.5 million records and 8 variables (the table is called udc_percentile
). The code I use is:
"C:/Program Files/MongoDB 2.6 Standard/bin/mongoimport" --db newtestDB --collection udc_percentile --type csv --headerline --file "udc_percentile.csv"
This command will run the MongoDB shell and connect to a database.
In total, the import took about 4.5 minutes.
6. Add an Index and Time Your Query
For the final piece of the experiment with MongoDB I wanted to time a query before and after adding a compound index. In order to do the timing, MongoDB has what they call a 'profiler' (details can be found here). In order to turn on the profiler I used the following code:
db.setProfilingLevel(2)
where 2 is the profiling level -- setting the profiler to collect data for all database operations.
Now I'll run a query and take a look at the time. This query looks for documents where the CHEM_CODE = 'ENDO' etc and the $gt ('greater than') specifies that I'm looking for PCT_VALUE>1:
db.udc_percentile.find({CHEM_CODE:'ENDO', SITE_CODE: '22005', APPLIC_YR: 2007, PCT_VALUE: {$gt: 1}})
To retrieve details on the performance of the query you would use the following command:
db.system.profile.find( { op: { $ne : 'command' } } ).pretty()
Here is a snapshot of a small piece of the output. Notice where it says millis
which is the number of milliseconds for the query and you can see that this query took 2811 milliseconds (2.8 seconds).
We can improve performance by adding an index. You can find more detail on MongoDB indexes here. For this particular collection we will add the index on CHEM_CODE, SITE_CODE and APPLIC_YR using the code below (note that the number 1 refers to an ascending index):
db.udc_percentile.ensureIndex( { CHEM_CODE: 1, SITE_CODE: 1, APPLIC_YR: 1 } )
Now we can re-run our query from above and check on the milliseconds.
Definitely a big improvements -- from 2811 milliseconds to 6 milliseconds. For comparison, I imported the same table to a PostgreSQL database using the same drive, same computer. The initial import (using a COPY statement) took 52 seconds. The initial query with no index took 14051 milliseconds. The query with an index was comparable to MongoDB at 10 ms.
The code I used for import/select/index in PostgreSQL was:
COPY udc_percentiles FROM 'udc_percentile.csv' DELIMITER ',' CSV HEADER;
SELECT * FROM udc_percentiles where chem_code = 'ENDO' and site_code ='22005' and applic_yr = 2007 and PCT_VALUE>1;
CREATE INDEX blah ON udc_percentiles (chem_code, site_code, applic_yr);
SELECT * FROM udc_percentiles where chem_code = 'ENDO' and site_code ='22005' and applic_yr = 2007 and PCT_VALUE>1;