Dead simple MongoDB to Hive Connector

At OneFold, we build software that allows our users to use SQL to query across multiple data sources. For example, one of our customer is using OneFold to run query across his Mixpanel, MongoDB and Stripe data (via SQL JOIN). At the heart of the OneFold platform is an ETL engine that transforms and loads semi-structured data (e.g. JSON) from various data sources into a traditional table-based data warehouses like RedShift, Google BigQuery and Hive.

One of the reason why we created OneFold was to simplify the loading of semi-structured data into traditional warehouses. With semi-structured data like JSON, there are three challenges that we’ve identified:

  1. JSON has weak data type. An attribute “zipcode” can have integer in one JSON object, and string in another JSON object. With data warehouses, the user needs to define the data type associated with a column ahead of time, and usually hard to change afterwards.

  2. JSON doesn’t have a schema. Because of this flexibility, new attributes are added over time, and data engineers usually need to play catch-up to add new columns to accomodate these new attributes.

  3. Nested and Array data types. JSON has nested and array structure that doesn’t translate well into a typical data warehouse table schema.

Recently, we open-sourced our MongoDB to Hive ETL engine. Now, MongoDB has open-sourced a set of code (which can be found here) that allows user to create Hive table with underlying data that lives in MongoDB. But the user needs to specify the schema during table creation which can be a big challenge.

Our ETL engine doesn’t require the user to know the structure of the MongoDB collection. In fact, the goal of our ETL engine is to eliminate all of the challenges listed above. It first parses through all the incoming JSON objects and creates a schema. Then it compares the new schema with the one in the data warehouse and applies the delta. Nested attributes are flattened, and array attributes are loaded into a different child table with the proper parent-child key.

You can download our MongoDB to Hive ETL engine here. Here’s what it does:

  1. Connects to your MongoDB and extract the specified collection into local file which is then copied to HDFS.
  2. MapReduce generates schema (a copy is saved back to MongoDB for info).
  3. MapReduce transforms data, breaking the array into multiple files in HDFS output folder.
  4. Create Hive tables using schema generated in step 2.
  5. Load Hive tables using HDFS files generated in step 3.

Simple case

Say you have a MongoDB collection called test.users, and you have a record in it:


> db.users.find();
{ "_id" : ObjectId("55426ac7151a4b4d32000001"), "mobile" : { "carrier" : "Sprint", "device" : "Samsung" }, "name" : "John Doe", "age" : 24, "utm_campaign" : "Facebook_Offer", "app_version" : "2.4", "address" : { "city" : "Chicago", "zipcode" : 94012 } }

To load this into Hive,


./onefold.py --mongo mongodb://[mongodb_host]:[mongodb_port] \
             --source_db test \
             --source_collection users \
             --hiveserver_host [hive_server_host] \
             --hiveserver_port [hive_server_port]

Results:


-- Initializing Hive Util --
Creating file /tmp/onefold_mongo/users/data/1
Executing command: cat /tmp/onefold_mongo/users/data/1 | json/generate-schema-mapper.py | sort | json/generate-schema-reducer.py mongodb://xxx:xxx/test/users_schema > /dev/null
Executing command: cat /tmp/onefold_mongo/users/data/1 | json/transform-data-mapper.py mongodb://xxx:xxx/test/users_schema,/tmp/onefold_mongo/users/data_transform/output > /dev/null
...
Executing command: hadoop fs -mkdir -p onefold_mongo/users/data_transform/output/root
Executing command: hadoop fs -copyFromLocal /tmp/onefold_mongo/users/data_transform/output/root/part-00000 onefold_mongo/users/data_transform/output/root/
..
Executing HiveQL: show tables
Executing HiveQL: create table users (app_version string,utm_campaign string,id_oid string,age int,mobile_device string,name string,address_city string,hash_code string,mobile_carrier string,address_zipcode int) ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
Executing HiveQL: load data inpath 'onefold_mongo/users/data_transform/output/root/*' into table users
-------------------
    RUN SUMMARY
-------------------
Extracted data with _id from 55426ac7151a4b4d32000001 to 55426ac7151a4b4d32000001
Extracted files are located at: /tmp/onefold_mongo/users/data/1
Hive Tables: users
Schema is stored in Mongo test.users_schema

In Hive, you can see that a new table users is created:


hive> add jar [install_path]/java/HiveSerdes/target/hive-serdes-1.0-SNAPSHOT.jar;

hive> desc users;
app_version             string                  from deserializer
utm_campaign            string                  from deserializer
id_oid                  string                  from deserializer
age                     int                     from deserializer
mobile_device           string                  from deserializer
name                    string                  from deserializer
address_city            string                  from deserializer
hash_code               string                  from deserializer
mobile_carrier          string                  from deserializer
address_zipcode         int                     from deserializer
Time taken: 0.073 seconds, Fetched: 10 row(s)

hive> select * from users;
2.4     Facebook_Offer  55426ac7151a4b4d32000001        24      Samsung John Doe        Chicago 863a4ddd10579c8fc7e12b5bd1e188ce083eec2d        Sprint  94012
Time taken: 0.07 seconds, Fetched: 1 row(s)

Now let’s add a record with new fields

In Mongo, one new records is added with some new fields:


> db.users.find();
...
{ "_id" : ObjectId("55426c42151a4b4d9e000001"), "hobbies" : [ "reading", "cycling" ], "age" : 34, "work_history" : [ { "to" : "present", "from" : 2013, "name" : "IBM" }, { "to" : 2013, "from" : 2003, "name" : "Bell" } ], "utm_campaign" : "Google", "name" : "Alexander Keith", "app_version" : "2.5", "mobile" : { "device" : "iPhone", "carrier" : "Rogers" }, "address" : { "state" : "Ontario", "zipcode" : "M1K3A5", "street" : "26 Marshall Lane", "city" : "Toronto" } }

New fields added to the address nested object. address.zipcode is now string (used to be integer). A new hobbies field is introduced that is a string array. A new work_history field is introduced that is an array of nested objects.

Run the command with parameters --write_disposition append and --query '{"_id":{"$gt":ObjectId("55426ac7151a4b4d32000001")}}' to tell the program to query from MongoDB only records with ID larger than the old one, and to append to existing Hive table during load:


./onefold.py --mongo mongodb://[mongodb_host]:[mongodb_port] \
             --source_db test \
             --source_collection users \
             --hiveserver_host [hive_server_host] \
             --hiveserver_port [hive_server_port] \
             --write_disposition append \
             --query '{"_id":{"$gt":ObjectId("55426f15151a4b4e46000001")}}'

Results:


-- Initializing Hive Util --
...
Executing command: hadoop fs -mkdir -p onefold_mongo/users/data_transform/output/root
Executing command: hadoop fs -copyFromLocal /tmp/onefold_mongo/users/data_transform/output/root/part-00000 onefold_mongo/users/data_transform/output/root/
Executing command: hadoop fs -mkdir -p onefold_mongo/users/data_transform/output/work_history
Executing command: hadoop fs -copyFromLocal /tmp/onefold_mongo/users/data_transform/output/work_history/part-00000 onefold_mongo/users/data_transform/output/work_history/
Executing command: hadoop fs -mkdir -p onefold_mongo/users/data_transform/output/hobbies
Executing command: hadoop fs -copyFromLocal /tmp/onefold_mongo/users/data_transform/output/hobbies/part-00000 onefold_mongo/users/data_transform/output/hobbies/
...
Executing HiveQL: alter table `users` change `address_zipcode` `address_zipcode` string
Executing HiveQL: alter table `users` add columns (`address_state` string)
Executing HiveQL: alter table `users` add columns (`address_street` string)
Executing HiveQL: create table `users_hobbies` (parent_hash_code string,hash_code string,`value` string) ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
Executing HiveQL: create table `users_work_history` (parent_hash_code string,hash_code string,`from` int,`name` string,`to` string) ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
...
-------------------
    RUN SUMMARY
-------------------
Extracted data with _id from 55426f52151a4b4e5a000001 to 55426f52151a4b4e5a000001
Extracted files are located at: /tmp/onefold_mongo/users/data/1
Hive Tables: users users_hobbies users_work_history
Schema is stored in Mongo test.users_schema

In Hive, two new tables are created: users_hobbies and users_work_history:


hive> show tables;
users
users_hobbies
users_work_history

hive> desc users_hobbies;
OK
parent_hash_code        string                  from deserializer
hash_code               string                  from deserializer
value                   string                  from deserializer
Time taken: 0.068 seconds, Fetched: 3 row(s)

hive> desc users_work_history;
OK
parent_hash_code        string                  from deserializer
hash_code               string                  from deserializer
from                    int                     from deserializer
name                    string                  from deserializer
to                      string                  from deserializer
Time taken: 0.067 seconds, Fetched: 5 row(s)

You can join parent and child table like:


hive> select * from users 
  join users_hobbies on users.hash_code = users_hobbies.parent_hash_code 
  join users_work_history on users.hash_code = users_work_history.parent_hash_code;

Voila! As you can see, the ETL engine takes care of all the guesswork related to schema creation and maintenance. It also handles nested and array data types quite well. More information on its usage can be found in the Github repo.

Email Open Tracking using AWS CloudFront

At Plum District, we send out a lot of marketing emails – on some days more than 5 millions daily. It’s a big part of our business, and we spend a lot of our time tracking open and send rates, and running analytics on this data. Luckily, SendGrid is able to funnel all that data via their Event Notification API where all the events (processed, opened, clicked, etc) are sent to us. We recently acquired another company, but unfortunately they send their emails via Amazon SES which doesn’t have any tracking.

In this article, I’ll discuss an innovative way to track email open rate using Amazon CloudFront. Well, the basic mechanism is really just pixel tracking. CloudFront provides detailed access log that’s dumped directly into S3. Hence, we can host the pixel in CloudFront, put the pixel in the email (plus any optional HTTP params that you want to track) and be able to track how many times that pixel is loaded, and finally track open count plus a bunch of other information including demographic, most active timeframe, etc.

Here are the steps:

  1. Create an S3 bucket if you don’t have one to store the pixel. In this case, I’ve put the pixel under s3n://plum-mms/images/1.gif. Feel free to borrow the pixel here. It’s just a 1×1 transparent gif. Also, make sure that bucket has the appropriate permission, i.e. Everyone → Open / Download.

  2. Create another S3 bucket for logs. In our case, we’ve created a bucket named plum-mms-logs.

  3. Configure a CloudFront distribution using the S3 bucket you created in step 1. Make sure you have the following configured when creating:

    Logging
    On. This tells CloudFront to enable logging.
    Cookie Logging
    Off. We don’t really need that information.
    Log Bucket
    This tells CloudFront where to dump the access log. In my case, I selected plum-mms-logs.s3.amazonaws.com which corresponds to the bucket I created in step 2.
  4. Capture the domain name associated with your new CloudFront distribution. In our case, it’s d2x9v85k2ohcuy.cloudfront.net. You can test that http://d2x9v85k2ohcuy.cloudfront.net/images/1.gif returns the GIF file in a browser.

  5. Insert the pixel in your email template. We want to capture who has opened an email, so we’ve included the subscriber ID, as well as the email category as GET parameters. Here’s a bit of Ruby code to generate the pixel:

    
    pixel_tracking_url = nil
    if subscriber && category
      pixel_tracking_url = "http://d2x9v85k2ohcuy.cloudfront.net/images/1.gif?sid=#{subscriber.id}&category=#{category}"
    end
    
    
  6. And in your email template (.erb file), you can add the code anywhere in the email:

    
    <% if @pixel_tracking_url %>
      <img src="<%= @pixel_tracking_url%>" width="1" height="1" alt=""/>
    <% end %>
    
    

Now we are ready to roll! You can send the email to a few test email accounts, and see if you are getting the logs. It usually takes a few hours for CloudFront to push the log files out to your logging bucket.

HOWTO: Deploy a fault tolerant Django app on AWS – Part 2: Moving static and media files to S3

In the last article, I discussed our attempt to remove points of failure in our infrastructure, and increase redundancy. We moved our single database instance running locally to RDS where fault tolerance is built-in through their multi-zone offering.

In this article, I’ll continue this journey by moving our Django static and media files from local file system to S3. Static files are files in the [app]/static folder where typically Javascript, CSS, static images and 3rd party Javascript libraries are stored. Media files that are user generated, through the use of FileField and ImageField in Django model, e.g. profile picture of a user, or a photo of an item. By default, when you created a Django application using the standard “django-admin.py startproject”, all the media files are stored in the [app]/media folder and static files in the [app]/static folders. The locations are controlled by the following parameters in settings.py:


# Absolute filesystem path to the directory that will hold user-uploaded files.
# Example: "/home/media/media.lawrence.com/media/"
MEDIA_ROOT = ''

# URL that handles the media served from MEDIA_ROOT. Make sure to use a
# trailing slash.
# Examples: "http://media.lawrence.com/media/", "http://example.com/media/"
MEDIA_URL = ''

# Absolute path to the directory static files should be collected to.
# Don't put anything in this directory yourself; store your static files
# in apps' "static/" subdirectories and in STATICFILES_DIRS.
# Example: "/home/media/media.lawrence.com/static/"
STATIC_ROOT = ''

# URL prefix for static files.
# Example: "http://media.lawrence.com/static/"
STATIC_URL = '/static/'

So, why do we need to move these files out of the EC2 local file system? It’s a pre-requisite to spinning up multiple EC2 instances that host the Django application. Specifically, we can’t have media files sitting in two locations. For example, when a user updates his or her profile picture, the POST request goes to one server and hence the new image would be stored in that server’s local file system, which is bad because the other app server won’t have access to it (unless you setup some shared folder between the instances – which is what’s typically done before Jeff Bezos gave us S3). By moving the static and media files to S3, both servers will be using the same S3 end-points to store and retrieve these files. Another HUGE plus is that the web servers (apache or nginx) don’t have to handle these static file requests anymore, and the disk and network load on the web servers will be drastically reduced.

Enough talking. First thing’s first. We need to download and install django-storages and boto.


pip install django-storages boto

Now, create a S3 bucket. This part is easy. Log into AWS console, click over to S3 and click on Create Bucket. Give it a name. For this example, we’ll use “spotivate”. All our static and media files be accessed through http://spotivate.s3.amazonaws.com/static/... and http://spotivate.s3.amazonaws.com/media/... respectively.

Also, we need to get the AWS Key and Secret which boto needs to access S3. You can find that from your AWS Security Credentials page.

Now we have all the info to change Django settings. The instructions here are loosely based on various articles I’ve read, but Phil Gyford’s article has been most helpful. Following his instructions, I first created spotivate/s3utils.py with the following content:


from storages.backends.s3boto import S3BotoStorage

StaticS3BotoStorage = lambda: S3BotoStorage(location='static')
MediaS3BotoStorage = lambda: S3BotoStorage(location='media')

Then, in settings.py, I added storages as one of the INSTALLED_APPS and a bunch of other variables that tells Django where to put and read media and static files:


INSTALLED_APPS = (
	...
	...
    'storages'
)

...
...

###################################
# s3 storage
###################################

DEFAULT_FILE_STORAGE = 'spotivate.s3utils.MediaS3BotoStorage' 
STATICFILES_STORAGE = 'spotivate.s3utils.StaticS3BotoStorage' 

AWS_ACCESS_KEY_ID="xxxxxxxxxx"
AWS_SECRET_ACCESS_KEY="xxxxxxxxxxxxxxxxxxxxxxxxxx"
AWS_STORAGE_BUCKET_NAME = 'spotivate'

S3_URL = 'http://%s.s3.amazonaws.com/' % AWS_STORAGE_BUCKET_NAME
STATIC_DIRECTORY = '/static/'
MEDIA_DIRECTORY = '/media/'
STATIC_URL = S3_URL + STATIC_DIRECTORY
MEDIA_URL = S3_URL + MEDIA_DIRECTORY

Voila. We are almost done. To upload all the static files to S3, run the following command:


python manage.py collectstatic

This will copy all the files in your current static folder to S3. What about media files? We need to upload that at least once to S3. Why only once? Because after the settings above is deployed, users who update their profile pics will be posted to S3. I found a great python package call boto-rsync that does the job beautifully.


pip install boto_rsync
boto-rsync media s3://spotivate/media -a [AWS_ACCESS_KEY_ID] -s [AWS_SECRET_ACCESS_KEY]

Verify in AWS console that all static and media files have indeed been copied to S3. Deploy the server, and hit a page. You should see that all references to Javascript, CSS and media files all point to S3.

It actually didn’t turn out so easy for me the first time around. I found that many CSS are still served from local file system. After looking at the template, I realized that I had this in the template:


<link href="/static/web/bootstrap230/css/bootstrap.css" rel="stylesheet" type="text/css" charset="utf-8">
<link href="/static/web/jcarousel/css/style.css" rel="stylesheet" type="text/css" charset="utf-8">
<link href="/static/web/css/spotivate_new.css" rel="stylesheet" type="text/css" charset="utf-8">

I am not using the Django “staticfiles” functionality properly. I had effectively hard-coded the static path, when I should be using the static template tag instead. The above line should be changed to:


{% load staticfiles %}
...
...
<link href="{% static "web/bootstrap230/css/bootstrap.css" %}" rel="stylesheet" type="text/css" charset="utf-8">
<link href="{% static "web/jcarousel/css/style.css" %}" rel="stylesheet" type="text/css" charset="utf-8">
<link href="{% static "web/css/spotivate_new.css" %}" rel="stylesheet" type="text/css" charset="utf-8">

The server is now functioning properly, but we are not done yet. What if we need to modify Javascript? How do changes get copied to S3 during deployment? This doc provides good instructions on this topic.

Now, with the static and media files moved over the S3, and database moved over to RDS, I’ve effectively remove all state from app server. Now I can spin up another EC2 instance, drop my code there and hence spreading all the traffic to two servers. If one goes down, we are still in business! And did I mention that the page loads a lot faster too?

HOWTO: Deploy a fault tolerant Django app on AWS – Part 1: Migrate local MySQL to AWS RDS

For a while, Spotivate was running on a single EC2 instance. Everything was in it — MySQL, Django, static files, etc. Yes, we know this is a terrible setup. Single point of failure, bad performance, etc. Here comes the excuses. We had better things to do, like customer development, sales, design, product development, etc. We had no time for ops! Plus, our traffic wasn’t really that high especially in the beginning. Our CPU / IO load was low. And we knew we can fix things fairly easily. Then one day, our EC2 instance went down for half an hour. Ooops! Called AWS support. They had a disk failure. Our last snapshot was a day old. So our site was down this whole time.

We figured we had to do it right. And EC2 makes it super easy. Our goal:

  • Remove all single points of failure, thus making the system fully fault tolerant.
  • As a result, the response time should go up, especially when under load.

Here’s the plan:

In this article, I’ll talk about the steps we took to move our MySQL to RDS.

If you don’t know what RDS is, read more about it here. Basically it’s AWS’s version of database server. RDS comes loaded with features. Here’s summary of what’s relevant:

  • Easy to deploy via the Management Console or command line.
  • Automatic backup (you get to choose how many days and when).
  • Multi-availability zone deployment means AWS automatically creates a primary DB instance and synchronously replicates the data to a standby instance in a different Availability Zone, thereby removing this as a single point of failure.
  • Replication that allows you to create read-only replicas. This is especially valuable for Spotivate, since our personalized email server put a heavy load on the DB. By having this, the performance of our website won’t be affected while we send out our weekly emails.

Well, let’s get on with it.

Step 1: Goto your management console and select RDS

Launch Database Instance

 

Step 2: Find a database server that fits your bill. In our case, MySQL.

Select database type

 

Step 3: Here’s where you pick the MySQL version and the instance size.

RDS Step 3

Multi-AZ Deployment Select “Yes” which creates a standby instance in a different AZ. That’s the whole point of this article, right?
Allocated Storage Choose a storage size that’s appropriate. Go small, as you can easily upgrade later with minimal down time. Generally, estimate enough for 3 months down the road.
DB Instance Identifier This is just the prefix to the public DNS.
Master Username Your database user name, typically “root”
Master Password Your database root user password

 

Step 4: Here you specify the database name, port, etc.

You also get to create (or assign) a database security group for this database. This is a little different from the EC2 security group. For database security group, you assign which EC2 security group to use. And any EC2 instance that belongs to that EC2 security group has access to the database. By default, everything else is turned off including ping. For more info, visit here.

RDS Step 4

 

Step 5: Backup Settings

Here, you specify the backup retention period, and when to backup. Make sure your backup window and maintenance window don’t overlap.

RDS Step 5

 

Step 6: That’s it. Review and Launch.

RDS Step 6

 

Step 7: Test it out.

After the DB has been launched (takes several minutes – enough time for coffee), you can find the public DNS from the detail page. This machine is accessible externally and within EC2. However, the security group by default prohibits any external access to the database server. Only EC2 instances that belong to the security group have access. From my web server, I can use my typical “mysql” command to connect to the new RDS instance.

RDS Step 7

 

Step 8: Import.

Our database is fairly small, so we can just dump the database and pipe it to the new instance. Here’s a fun command that you can use (make sure you stop your web server first to avoid consistency issue).

mysqldump [your current db] | mysql --host=[rds host name] --user=root --password [root password]

That’s it! All you need to do now is change your Django settings to use the new database instance. Bring down your local MySQL and restart your Django server to see if everything is running properly. If so, change chkconfig to keep the local MySQL from restarting.

Next time, I’ll talk about the migration of our static files to S3.

HOWTO: Share a Nav Bar between Django and WordPress

At Spotivate, we used Django as our backend infrastructure. A few months ago, we wanted to put up a blog and WordPress was the obvious choice due to the amount of tools and plugins available. We also needed some amount of WordPress customization that neither WordPress.com nor Tumblr can provide. Ideally, both Django and WordPress are hosted on the same server and accessible through the same domain name (www.spotivate.com). Not a lot has been written about how Django and WordPress can live happily under one roof. Even less so on how to share UI components between the two. In our case, we wanted to share the main nav bar between the two frameworks, so that users don’t feel that they are leaving the Spotivate experience when they are reading our blog.

Here were the design requirements:

  1. Django and WordPress both running in the same (EC2) instance. Both through Apache HTTP Server. Both using the same MySQL instance.
  2. There is a main nav bar on the Spotivate web site. We wanted this nav bar to be on our blog. The nav bar indicates whether the current user is logged in (and if so, a user thumbnail), the currently selected tab, and various statistics about the logged in user.
  3. URL Namespace
    • /blog goes to WordPress
    • Everything else goes to Django

Here’s what our main nav bar looks like:

Spotivate Main Nav Bar

Spotivate Main Nav Bar

A bit of background. Our Django environment runs under Apache through WSGI. Before WordPress, we had it setup so that all traffic goes to Django (with the exception of static files). Here’s a snippet from our httpd.conf file that enabled this.

<VirtualHost *:80>
ServerName www.spotivate.com
WSGIScriptAlias / "/home/ec2-user/src/spotivate/server/apache/django.wsgi"

<Directory "/home/ec2-user/src/spotivate/server/apache">
Order allow,deny
Allow from all
</Directory>

Alias /static/admin/ "/usr/lib/python2.6/site-packages/django/contrib/admin/static/admin/"
<Directory "/usr/lib/python2.6/site-packages/django/contrib/admin/static/admin/">
Order allow,deny
Allow from all
</Directory>

Alias /static/ "/home/ec2-user/src/spotivate/server/static/"
<Directory "/home/ec2-user/src/spotivate/server/static/">
Order allow,deny
Allow from all
</Directory>

</VirtualHost>

We installed WordPress under our Django “statics” folder, but it can be any folder really. We followed the normal installation procedure of WordPress and installed the database in our current MySQL instance.

Then we changed our httpd.conf by adding the following directives to our existing VirtualHost block:

<VirtualHost *:80>
...
...
# wordpress blog
Alias /blog "/home/ec2-user/server/static/blog"
<Directory "/home/ec2-user/server/static/blog">
Order allow,deny
Allow from all
AllowOverride FileInfo
</Directory>
...
...
</VirtualHost>

Restart the httpd server. Now, we are able to go to http://www.spotivate.com/blog/wp-admin and log in. Next step is to tell WordPress that it belongs under blog by going to Settings >> General. There, change both WordPress Address and Site Address to www.spotivate.com/blog.

Spotivate Blog General Settings

Spotivate Blog General Settings

At this point, we have ourselves a fairly functional stand-alone WordPress blog, showing the default Hello World post under http://www.spotivate.com/blog. Also, all our Django code still works. Great! Now, onto the harder task. Add our nav bar to WordPress by mucking around with header.php and footer.php. We are using Genesis WordPress framework, but this trick should work for most.

The nav bar and any accompanying logic (CSS / Javascript) will be served from Django. Before we had the blog, the nav bar was a piece of HTML in our base template, which also contains reference to CSS file that controls how it looks, and reference to Javascript file that handles hover events, drop down menu, event logging, etc. We want to host all of this code in Django still, since only Django know who the logged in user is. To make nav bar work externally, we need to chop up our base template so that the nav bar can be re-used in WordPress. Then over in WordPress, we will modify header.php and footer.php to call Django to get the nav bar code, and add it to the page dynamically via Javascript.

On Django side, we create two URL mappings:

/blog_head

This will return stuff that will go into thetag of the blog. This will include our CSS and all necessary Javascript (i.e. our Javascript, Google analytics, JQuery, FB, Twitter APIs). Have a look!

/blog_navbar

This will return HTML of the nav bar. No body or html tags. Just the div. Checkout the source to see what I mean, both as an authenticated and un-authenticated user.

Once we got Django to serve the above URLs properly, we can modify header.php by adding this block of code to before </head>. This pretty much adds all our Javascript and CSS code to our blog. This also allows us to reuse CSS definitions in our blog.

...
...
<!-- Django integration: read from /blog_head -->
<?php
$contents = file_get_contents('http://www.spotivate.com/blog_head');
echo $contents;
?>
</head>

Finally, modify footer.php by adding this block of code to before </body>.

...
...
<!-- Django integration: read from /blog_navbar -->
<div id="spotivate-blog-navbar" style="z-index:1000; position: fixed; height: 45px;">
</div>
<script>
$(function() {
$('#spotivate-blog-navbar').load('/blog_navbar');
});
</script>
</body>

The first bit of code defines an empty navbar div with id spotivate-blog-navbar, with fixed positioning and pre-defined height. We wanted the nav bar to float over any blog content just like in our main website. Then there’s a bit of Javascript that loads /blog_navbar (which calls our Django view mentioned above), and stuffs the content into the spotivate-blog-navbar div.

To be honest, it’s not the most pleasing user experience, as the user will see the blog post first, and a split second later, the nav bar shows up.. But it works!

Here’s how our website looks:

Spotivate Website

And here’s how our blog looks:

Spotivate Blog

We spent quite a bit of time making sure the experience feels integrated, and we feel like we have achieved that.

Do you have any other tips and gotchas while integrating Django and WordPress?

HOW TO: Install WordPress on Amazon EC2

First thing’s first. I am kinda new to WordPress. (Yes, I said I am behind in technologies in my previous post, didn’t I?) I setup a WordPress blog for Spotivate a few months ago. Today, I set up my own WordPress on an Amazon EC2 micro instance. Here are the steps I took.

Christophe Coenraets has an amazing tutorial on how to install WordPress on EC2 here. So I am not going to repeat it, since it was really that simple and took less that 5 minutes as advertised.

A few comments:

  • I didn’t create a small instance. Instead, I chose a (Free) micro instance. Why? Well, it’s free. Also, micro will do for now given I have no traffic. Plus, I want to show you how to migrate / upgrade from micro to small later!
  • There are a few typos in the tutorial:
    • mysql_secure_Installation should be mysql_secure_installation
    • tar -xzvf latest.tar.gzcd should be tar -xzvf latest.tar.gz
  • If you followed the instructions given by Christophe, the owner of the blog folder is “root”. Apache runs as “apache” by default. You will have problem uploading plugins and media files later on, since apache != root. Change the owner of /var/www/html/blog to apache:apache by running this command:
    sudo chown -R apache:apache /var/www/html/blog

Now that the URL http://www.jorgechang.com/blog is up, I want to make it so that http://www.jorgechang.com brings the user there as well. There are a few ways I can do that.

  1. Move my blog folder to root as described here. Yes, it looks complicated but not really. Just a matter of moving the blog directory and reconfiguring a few things like Permalinks. I’ve decided against that because I want to keep the URLs of my blog posts under /blog/, e.g. /blog/hello-world. I have other projects in mind, and want freedom over my URL namespace in the future. REMEMBER: Once your blog post is published, you need to make sure that the URL works forever, as other people will likely link to your post if your post is any good. You need to maintain backward compatibility whenever you change your URL structure, so better to keep all blog related activities isolated to /blog.
  2. Setup HTTP redirect so that the end users are redirected to http://www.jorgechang.com/blog.

I am going with redirect method. Now, there are difference kinds of redirect. SEOmoz has an excellent article here that describes HTTP Redirection in detail. Basically, there are three main types of redirect:

  1. 301 (Moved permanently)
  2. 302 (Moved temporarily)
  3. Meta Refresh

Option 3 requires the most work for everyone since I need to write an index.html with meta tag, and the end user’s browser need to do more work (load the page, parse, execute meta refresh, etc) and hence slower.

The difference between 1 and 2 is very subtle, and mainly impacts how search engines crawl and index your pages. 301 is the most suitable option, since I don’t plan on having anything other than my blog on my home page in the foreseeable future.

Edit /etc/httpd/conf/httpd.conf and stick this block of code to the end of the file:

RewriteEngine On
RedirectMatch 301 /index.html /blog

Restart Apache by running this command:

sudo service httpd restart

Now, the URLs for my blog posts look something like /blog?p=1. It doesn’t look very pretty and also affects SEO. Here’s how you can make it look more like /blog/hello-world.

Again, edit /etc/httpd/conf/httpd.conf and look for the following blocks of code and change AllowOverride from None to All. Restart Apache HTTP server afterwards.

<Directory />
Options FollowSymLinks
AllowOverride All
</Directory>
<Directory "/var/www/html">
Options Indexes FollowSymLinks
AllowOverride All
Order allow,deny
Allow from all
</Directory>

Create an .htaccess file in your blog folder. If you follow the instructions outlined by Christophe Coenraets, then your blog folder would be /var/www/html/blog. A lot of examples online tell you to create an empty .htaccess file, chmod it with 666 permission, and let WordPress admin handle the changes. I highly recommend against that due to security risk. Most people forget to change it back to 644. Instead, simply create the file and with the following content:

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /blog/
RewriteRule ^index\.php$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /blog/index.php [L]
</IfModule>

Now, go to your WordPress admin. Navigate over to Settings >> Permalinks. Here’s what you should see.

Wordpress Permalink Settings

Pick one that is suitable for you. Yoast suggested that it’s best to stick with “Post name” to give your post a timeless look, so I follow his recommendation. Click Save Changes and voila, your blog post URLs are now readable and timeless.

Next: Themes, plug-ins, custom CSS, nav bar, and more! (Did I say the more I learn, the more behind I feel?)

Hello world!

We are living in a very exciting time. New technologies are invented every day, and although it’s hard to stay up-to-date with all the latest stuff, it’s also fun to discover tools and services that others have built, and how they can in-turn help you invent tools and services for others! 

Although I have been in the software industry since 1998, I feel “behind” sometimes. The funny thing is the more I learn, the more behind I feel. Such is the phenomenon that is the Internet, where information about everything is just a Google search away.

This blog will be a documentary of my journey through discovering and learning new technologies — mainly ones related to web and mobile development. My meager attempt to stay less behind. Once in a while, I’ll write about entrepreneurship and my thoughts on starting and running a startup, and building a product.

A bit about myself. I am the Co-Founder at Spotivate. Before that I held management and engineering roles at Telenav, SugarSync and Ingrian Networks (now Safenet).

You can reach me at jorge (at) jorgechang.com.

Thanks for reading!