Migrating underWater desert Blogging to Jekyll

Wordpress is a great system. A really great system, actually. Especially for non-technical users. It also requires a huge amount of overhead for what most people use it for. Also, it is not nearly as fast and reliable as I would like it to be. So I decided that my blog needed to move from Wordpress to Jekyll. I had been using the same blog theme for about four years now so I also wanted to refactor that. Here's what I did to migrate it.

Step 1: Install Ben Balter's Migrate to Jekyll Plugin

First, I downloaded Ben's migrate plugin for Wordpress. Then I unzipped that into my plugins directory in my Wordpress install (which I deploy from my workstation using git), enable the plugin and click Tools -> Export to Jekyll. Then the server will chew on it for a while and spit out a decent zip. The zip file has the templates (which I didn't use as I was rebuilding them from scratch and didn't care what the old ones looked like), the posts, and the pages you have on your blog.

There were a few annoyances with the output, though. These annoyances are not any problem with the Exporter which worked as I expected it would. They were mainly annoyances with a lot of rust that is within the code/database of my blog. That rust was one of the main reasons that I wanted to switch to Jekyll. Since Jekyll is simply plain text files it is much simpler to iterate over the files and make the changes.

Step 2: Build the Jekyll Deploy How I Wanted it to Look

Next, I worked on the layouts, includes, html files, config, rakefile and put everything how I wanted it to be. This was simple web design and there are lots of tutorials which were quite helpful in this endeavour. Once the Jekyll deployment looked and acted how I wanted it to, it was time to chew on the files.

Step 3: Build a Ruby Script to Reformulate the Posts How I Wanted

The first problem that I had with the export was that the yaml frontmatter was loaded with a lot of crap that I neither needed nor wanted for the Jekyll. So I thought I would build a small ruby script that would clean the YAML Frontmatter. Here's the part of the script that does that.

  def clean_the_yaml( yaml_block, first_para )
    new_yaml = {}
    new_yaml["layout"] = "post"
    if yaml_block["title"] != "" && yaml_block["title"]
      yaml_block["title"].gsub!("\"", "\'")
      if yaml_block["title"] =~ /\A"/
        new_yaml["title"] = yaml_block["title"]
      else
        new_yaml["title"] = "\"" + yaml_block["title"] + "\""
      end
    else
      new_yaml["title"] = ""
    end
    new_yaml["published"] = "true"
    new_yaml["comments"] = "true"
    new_yaml["meta"] = "true"
    if yaml_block["categories"]
      new_yaml["category"] = yaml_block["categories"].first.downcase 
    else
      new_yaml["category"] = yaml_block["category"] || "unclassified"
    end
    new_yaml["tags"] = yaml_block["tags"] if yaml_block["tags"]
    if yaml_block["excerpt"]
      new_yaml["excerpt"] = yaml_block["excerpt"].gsub("\n", "")
      new_yaml["excerpt"] = new_yaml["excerpt"][2..-1].strip if new_yaml["excerpt"][0] == ">"
    else
      new_yaml["excerpt"] = first_para.gsub("\"", "'")
    end
    if new_yaml["excerpt"][0] == "\""
      new_yaml["excerpt"][1..-2].gsub!("\"", "'")
    else
      new_yaml["excerpt"].gsub!("\"", "'")
    end
    new_yaml["excerpt"] = "\"" +  new_yaml["excerpt"].strip unless new_yaml["excerpt"][0] == "\""
    new_yaml["excerpt"] = new_yaml["excerpt"] + "\"" if new_yaml["excerpt"][-1] != "\"" || new_yaml["excerpt"].length == 1
    final_yaml = "---\n\n"
    new_yaml.each{ | head, val | final_yaml << head + ": " + val.to_s + "\n" }
    final_yaml << "\n---\n\n"
  end

That prt of the script mainly strips out the old rust from the YAML front matter and ensures that all my punctuation will not get in the way of Jekyll doing its job. For instance, I had lots of colons in my titles and these are a problem with YAML if you do not escape them or put the string in double quotes. Also I wanted all the posts to have some sort of a excerpt so if I did not have an excerpt in the Wordpress, I built the script to import the first paragraph and then escaped it and guarded against problems with punctuation in the paragraphh.

The second thing I had to overcome was the way the exporter handled images embedded in the text was not really how I wanted it. I am not sure if the exporter was made to export images or not, but I think because I run a multisite deployment – and because Wordpress handles multisite files much differently than single site files – that the images were a problem.

When I looked on my machine in the Blogs.dir for the Blog I was exporting I saw another problem which was that Wordpress saves a lot of different versions of the same file. This was going to be a problem when I tried to import the photos into the Jekyll asset pipeline I built.

I was used to the ruby system command but there was a problem when I tried to drop the return of the system command into a variable. And then I came upon my new favorite rubyism: backticks. These were great. They work just like backticks in bash/zsh scripting so they were quite intuitive. They run a system command and throw the result of that into a string variable – which you can split, strip, do whatever you need. This let me build a find command that I could use to get over the multiple files issue.

I also needed to reformulate how the images were called within the text. Wordpress puts a caption (which I almost always used) and the exporter used the reference link syntax that I don't particularly prefer in markdown. So I wanted to reform that part of the source posts. Here's the part of the script that did that.

def clean_the_content( entry_block )
  pictures_pattern = /\A(\[\!\[(.*?)\])(.*?\z)/
  file_pattern = /\A\s*\[\]:(.*?\/(.*?))\z/
  delete_lines = []
  for para in entry_block do
    if para[pictures_pattern]
      pict_line = $1
      caption = $2
      rest_of_it = $3
      picture_to_copy = false
      next if rest_of_it[/\A\(\{\{/]
      p_index = entry_block.index para
      delete_lines << entry_block[p_index + 1] if entry_block[p_index + 1][/\A#{caption}/]
      if entry_block[(p_index + 5)] =~ file_pattern
        full_pict_ref = $1.strip
        picture_to_copy = $2.split("/").last
        delete_lines << entry_block[p_index + 5]
        picture_to_copy = copy_over_pictures( picture_to_copy )
      end
      if picture_to_copy
        entry_block[p_index] = "#{pict_line}({{ site.url }}{{ site.root }}{{ site.images_dir }}/{{ page.date | date: \"%Y\" }}/#{picture_to_copy})][#{full_pict_ref}]"
      end
    end
  end
  delete_lines.each{ | line | entry_block.delete( line ) }
  return entry_block
end

def copy_over_pictures( picture_name )
  location = `find #{@pictures_pull_dir} -name '#{picture_name}*' -type f | sort -nr`
  if location != ""
    location = location.split("\n").first.strip  
    picture_name = location.split("/").last
    if `ls #{@pictures_push_dir}/#{@publish_year}` == ""; `mkdir #{@pictures_push_dir}/#{@publish_year}`; end
    `cp #{location} #{@pictures_push_dir}/#{@publish_year}/#{picture_name}`
    return picture_name
  else
    `echo "Error importing to #{@publish_year} the file: #{location}/#{picture_name}" >> cleaner-errors.log`  
    return false
  end
end

That was it. Just a few regexs and a few deletes and a few system commands.

Step 4. Put it all Together && See What We Have

After I built the script and tested it on a few of the more recent posts I needed to run it on the entire directory of posts. I kept the posts in a separate directory out of the repo so if anything got screwed up with the script I could just re run it. I ended up having to do that a couple of times as I had to tweak my original version of the script a bit to arrive at what is presented above.

rm -rf _posts && cp ~/Downloads/jekyll-export/_posts . -R && git add _posts && ./cleaner.rb

One of the first things I realized was that I needed to add the _posts dir to the the repo again as when you run jekyll it will git ls-files. So I ran the above command until I was happy with the results and then I finally ran this command.

rm -rf _posts && cp ~/Downloads/jekyll-export/_posts . -R && git add _posts && ./cleaner.rb && rake site:publish

That was it! The full cleaner script can be found in this gist. If you want to use it, download it into the root of Jekyll, chmod +x cleaner.rb and you will be all set.

Lessons Learned.

Testing of scripts is huge. I still don't really know how to do this but I tested it with sample data and was prepared for things to go wrong.
Backticks. FTW.

~ # ~