An Introduction to Scraping with Hpricot

Posted about 7 years back at zerosum dirt(nap) - Home

For one of my hobby projects, I’ve been building a comic book release schedule webapp in Ruby. Obviously, a large part of that involves locating data sources for comic book publishers and importing those sources. Unfortunately, none of the major publishers have seen fit to make their release schedules available in RSS or Atom or an other structured format for that matter. Sigh.

Fortunately, all is not lost. With the Hpricot gem and a little scraping know-how, we can overcome almost any parsing obstacle, as long as the data is in a somewhat predictably arranged state. Let’s see how it works…

For our example, we’ll consider DC Comics, home of Superman, Batman, Aquaman, and… Super-Chief (Apache Chief? No, he’s different). DC makes their weekly release schedule available through their website at this URL. That’s nice and convenient. But it’d certainly be more convenient if they had a feed available. (If they do have a feed available, hidden deep within their website, and I haven’t found it, please let me know!)

As we click through to next/previous weeks and it becomes pretty clear that passing the dat=<year><month><day> parameter gives you the appropriate listing. Note that they display a month at a time, so all you really have to do is ask for dat=<year><month>01 every time. We’re going to build a little scraper that just grabs the current months’ books, but armed with the knowledge of how this works, you should find grabbing 3-4 months worth of books at a time to be no challenge whatsoever (comic book publishers usually solicit about 3 months in advance).

OK. So now let’s take a peep at the structure of the document itself. We can do this by just viewing source in a browser. It seems that every comic listed in the release schedule has a link to a full description of the issue, with a cover art previews, a short synopsis, writers/artists listed, etc. And every one of those links seems to have a CSS class of ‘contentLink’. Oh, lucky day.

This is certainly starting to smell like a job for Hpricot, the super fast (and delightful!) HTML parser for Ruby, written by the enigmatic why the lucky stiff. Gem install that sucker!

gem install hpricot

Now let’s fire up IRb and chew on some delicious Ruby syntax:

require 'hpricot'
require 'open-uri'

URL_DC = "http://www.dccomics.com/comics/"

doc = Hpricot(open("#{URL_DC}?dat=#{Time.now.strftime('%y%m01')}"))
books = (doc/"a.contentLink")
books.each { |book| read\_comic(book.innerHTML.strip, 
  "#{URL_DC}#{book.attributes['href']}") }

def read\_comic(title, url)
  puts "#{title} - #{url}"
end

Run that, and you’ll get a list of stuff that looks like this:

THE ALL-NEW ATOM #12 – http://www.dccomics.com/comics/?cm=7447
BATMAN: TURNING POINTS – http://www.dccomics.com/comics/?cm=7251

Each output line lists a title with a URL, for each comic solicited in a given month. How does it work? Well, first we open the URL and feed it into Hpricot. Then the line books = (doc/“a.contentLink”) uses a CSS selector to yank out just the elements that match the selector. We could have also used XPath-style syntax to accomplish the same thing. Anyway, those elements we’re selecting are all the links to comics being released this month. Hpricot hands us an array of these elements, and then we iterate over them, calling the read_comic function and passing it the title (the innerHTML of the link, stripped of excess whitespace), and the URL (an absolute link to the href attribute of the link).

Next, let’s beef up the read_comic function to do something useful. We’ll have it make another remote connection, this time to the URL specified for the detailed comic description, parse out the talent, description, and some other information about the issue and stuff it into a model object. But first let’s examine the source of one of those pages. The Trials of Shazam! #7 should do nicely.

We note in our examination of the page source that the data we want to scrape is all contained in tags, with different classes, as listed below. Note that this time we’ve chosen to use XPath-style syntax for the selectors. Note also that the span tag with class=“display\_copy” appears twice. The first time, it contains what appears to be the description of the issue, and the second time it lists the publication date. So instead of returning a single element, display\_copy gets an Array of 2 (or possibly more) elements.

def read\_comic(title, url)
  doc = Hpricot(open(url))
  display\_talent = (doc/"span[@class=display\_talent]").innerHTML
  display\_copy = (doc/"span[@class=display\_copy]") # 2 elements
  puts "====="
  puts "title: #{title}"
  puts "talent: #{display\_talent}"
  puts "copy (0): #{display\_copy[0].innerHTML}
  puts "copy (1): #{display\_copy[1].innerHTML}
end

Now we’re iterating through each book from the remote source, and dumping out it’s title, the writer and artist responsible for it, a quick synopsis, and some other information (publication date, etc). Alright. If we just had a Comic model in our application, we could be somewhere!

So let’s make one. In fact, let’s do it in Ruby, with ActiveRecord. First the schema:

DROP DATABASE IF EXISTS comics;
CREATE DATABASE comics;
USE comics;

CREATE TABLE comics (
  id int(11) NOT NULL AUTO\_INCREMENT,
  name VARCHAR(255),
  publisher VARCHAR(255),
  talent VARCHAR(255),
  description TEXT,
  published\_on DATETIME,
  PRIMARY KEY (id)
);

Load this up and then add the following code to the top of your comics scraper. In fact, put it in a file called comics.rb so you can execute it on the command line.

require 'active\_record'

ActiveRecord::Base.establish\_connection(
  :adapter  => 'mysql',
  :host     => 'localhost',
  :username => 'root',
  :password => '',
  :database => 'comics')

class Comic < ActiveRecord::Base
end

Now we’ve established a connection to the database via ActiveRecord and defined a Comic model that inherits from ActiveRecord::Base, thus wrapping our database schema and giving us some handy getters and setters. Our next step will be to trade in the read_comic function in favor of an import class method on the Comic model.

class Comic < ActiveRecord::Base
  def self.import(title, url)
    doc = Hpricot(open(url))
    display\_talent = (doc/"span[@class=display\_talent]").innerHTML
    display\_copy = (doc/"span[@class=display\_copy]") # 2 elements

    comic = Comic.new(:name => title)
    comic.publisher = "DC"
    comic.talent = display\_talent
    comic.description = display\_copy[0].innerHTML
    comic.published\_on = Date.parse(display\_copy[1].innerHTML.
      sub('on sale', ''))

    comic
  end
end

When Comic.import receives a title and a URL it makes a connection to the URL specified and fires up Hpricot. It uses Hpricot to parse out the information we’re looking for, and then instantiates an instance of the Comic class. We set the talent, the description (the first of the display\copy spans) and then parse the date out from the second display\copy span.

We’ll remove all the output from there and put it in the book loop, since it’s clearly not the job of the model code to be rendering a view of any sort. Our new book loop will use Comic.import on each element of the books Array, creating the model, saving it, and then printing out some attributes. Here’s the final code for comics.rb:

require 'rubygems'
require 'active\_record'
require 'open-uri'
require 'hpricot'

ActiveRecord::Base.establish\_connection(
  :adapter  => 'mysql',
  :host     => 'localhost',
  :username => 'root',
  :password => '',
  :database => 'comics')

URL_DC = "http://www.dccomics.com/comics/"

class Comic < ActiveRecord::Base
  def self.import(title, url)
    doc = Hpricot(open(url))
    display\_talent = (doc/"span[@class=display\_talent]").innerHTML
    display\_copy = (doc/"span[@class=display\_copy]") # 2 elements?

    comic = Comic.new(:name => title)
    comic.publisher = "DC"
    comic.talent = display\_talent
    comic.description = display_copy[0].innerHTML
    comic.published\_on = Date.parse(display\_copy[1].innerHTML.
      sub('on sale', ''))

    comic
  end
end

doc = Hpricot(open("#{URL_DC}?dat=#{Time.now.strftime('%y%m01')}"))
books = (doc/"a.contentLink")
books.each do |book|
  comic = Comic.import(book.innerHTML.strip, 
    "#{URL_DC}#{book.attributes['href']}")
  if comic.save
    puts "====="
    puts "name: #{comic.name}"
    puts "description: #{comic.description}"
    puts "release date: #{comic.published\_on}"
  else
    puts "uh-oh! we should handle errors!"
  end
end

And here’s the final result:

name: TRIALS OF SHAZAM! #7 (OF 12)
description: Freddy must find Hercules for his next trial,
which is considerably more difficult than he expected,
since Herc is behind bars!
release date: 2007-06-13

Obviously we can do a lot more with this. We can build a series model, that has_many issues or episodes. We can build a publisher model. We can suck in the images and use RMagick to generate thumbnails. We can discriminate between graphic novels, trade paperbacks, and issues of a standard series book. We can roll this into a Rails application, and allow the results to be browsable, users to add comics to their pull lists, create collections, comment on them, rate them, and so on. Actually, that’s exactly what I’m working on for my hobby project (if you’re interested, email me and I’ll let you take a look — I’m hoping to release it relatively soon-ish).

To go further with scraping, we’ll need to pay particular attention to handling errors, because it’s an inexact science and, since we have no hard format, things are subject to change or break in weird ways. That’s the obvious downside to scraping. But when you have no other alternative for automating mass import of data like in this scenario, it’s certainly a good thing to know how to do.

If you want to learn more, \_why’s Hpricot site is chock full of useful information, and you may also want to check out scRUBYt, which combines Hpricot and WWW::Mechanize into a full-on web scraping “toolkit”.

Episode 51: will_paginate

Posted about 7 years back at Railscasts

In edge rails (soon to be Rails 2.0), the built-in pagination has been moved into a plugin: classic_pagination. I recommend jumping over to the will_paginate plugin as shown in this episode.

Free-for-all: Tab Helper (Summary)

Posted about 7 years back at The Rails Way - all

The first RailsWay free-for-all came off quite well. Many of you posted your favorite solutions to the problem of tab-based navigation, as posed by Nate Morse.

Jamis’ Take

Of all the solutions posted, my personal favorite was the pragmatic and simple CSS-based solution given by Mr. eel (Nate Morse came to the same solution independently):

I take a completely different approach. I ID the body of the page with the name of the current controller. Then I use a descendent CSS selector to highlight the current tab based on the body id and an id given to each link. I don???t bother with replacing the current tab link with a span. If the user wants to click that link again??? then it???s the same as refreshing. Totally up to them.

With html like:

1
2
3
4
5
6
<body id="users">
  <ul>
    <li><a href="/users" id="usersNav">Users</a></li>
    <li><a href="/comments" id="commentsNav">Comments</a></li>
    <li><a href="/posts" id="postsNav">Posts</a></li>
  </ul>

I would use CSS like this

1
2
3
4
5
6
#users #usersNav,
#comments #commentsNav,
#posts #postsNav {
  background:red;
  font-weight:bold;
}

What a great approach. Although I would make the choice of the body ID explicit (rather than depending on the controller name), it is otherwise really nice. It shrugs off the whole issue of “should the current tab be a link” by saying it just doesn’t matter—every tab is always a link. Such pragmatism gets right to the heart of the Rails Way: implement just what matters, and nothing more.

Koz’s Take

A number of solutions relied on tightly coupling the controller and tabs. While this may seem like a time-saver at first, I believe that it’s unlikely to remain useful as your application grows. You’ll find yourself moving functionality into strange locations in order to make your tabs highlight correctly.

The problem is amplified with a restful application where your choice of controllers are dictated by the resources that you’re managing. You may have a list of comments in several different sections of your application, but not want to highlight the ‘comment’ tab whenever you display them.

Personally, I prefer the really simple approach of a before filter and a navigation partial.

1
2
3
def set_current_tab
  @current_tab = :people
end

Thanks, everyone for your submissions!

Episode 50: Contributing to Rails

Posted about 7 years back at Railscasts

The best way to contribute to the Rails project is to submit a patch. This episode shows how to do exactly that. There's also a surprise at the end that you don't want to miss!

Flex component development mindset

Posted about 7 years back at work.rowanhick.com

Having just picked up Programming Flex 2 I was dismayed that components seem (almost) to be an afterthought. Yes they're documented, with 2 chapters dedicated to them - but the importance of them is from an architecture perspective seems to be glossed over. After keyboard sized imprints in my forehead for the past day coding. I'm going to say this, definitely develop your Flex applications with a component first approach (well I will anyway!) don't bother with a cruddy component-less prototype. You will save far more time in the long run designing your app from the ground up with components in mind and skipping the ugly component-less prototype step. Why do you want to do this ? Components are a VeryGoodThing™. Components are so easy to use, that it's almost criminal to start building your first rough cut without them. One of the biggest benefits is how well you can capture your business interface needs within the components. Almost turning your mxml and actionscript into a domain specific language. For example in an eCommerce application, the heirachy might look like this AppStack (< View Stack ) - Dashboard ( < Canvas ) - OrdersStack ( < View Stack ) --- OrdersList ( < Panel ) --- OrderView ( < Panel ) - CustomersStack ( < View Stack ) - InventoryStack ( < View Stack ) - ReportsStack ( < View Stack ) With examples of public methods: OrdersList.refreshOrders() OrdersList.switchToCustomersOrders(customerID:Number) OrderView.changeToEditMode() OrderView.showBackOfficeInformation() This takes no imagination to understand the application. With a little bit of foresight, it reduces your prototyping and development time as you can translate your wireframes into components pretty much from the get go. Keep your reader's subscribed - next up we'll look at an actual component implementation. From pencil sketch through to working code.

Game Full ——– on like you’ve never seen it

Posted about 7 years back at work.rowanhick.com

An overexcited brit can lower the tone on any web app. Todays race in the Americas Cup with ETNZ (go boys!!!) vs Alinghi saw some dramatic action. All of the amazing graphics and real time communication does nothing to convey the sheer drama of the event. One of the text commentators got a little excited in the commentary, it's amazing how much atmosphere can be generated with a few random characters. Click the thumbnail link and look at the text at the bottom to see what I mean. What a race (oh and we won!!)

Episode 49: Reading the API

Posted about 7 years back at Railscasts

The Rails API docs are very useful but can be difficult to read. This episode will give some tips on reading the docs and mention a few alternative sites for accessing the API. Update: sorry about the broken movie, it should work now.

Code Digest #2

Posted about 7 years back at Revolution On Rails

When you program for a living, you write lots of code. There is often some code that you are fond of. We started the Code Digest series to present such code written by the RHG developers. We encourage other teams and individual developers to share similar snippets in their blogs so we all can learn from each other and become better rails developers.


Mai Nguyen


Simple AJAX error messaging

When simple javascript validation is not enough and your product managers insist on ajax for error messaging, guess what. You have to implement ajax error messaging. One such case is 'username availability'. This simple example displays an error message on blur of a the username field. It doesn't hit the server unless the value of the field is well-formed (at least that will save you *some* network traffic ...)

Controller would look something like:
class FooController < ::ApplicationController

def is_username_taken

unless Person.find_by_username(params[:username])
render :nothing => true
return
end

message_html_id = params[:message_html_id]
render :update do |page|
page.replace_html message_html_id ,"Username is not available, please choose another."
# if you want error styling associated with the failed state
page << "document.getElementById('" + message_html_id + "').className = 'failed'"
end

end

end



View would look something like:
<dl>
<dt><label>Username: </label></dt>
<dd id="username_input">
<%= text_field :person, :username, :class => "input", :type => 'text', :id => 'person_username_input' %>
</dd>
<dd id="username_messaging"><%= @person.errors.on(:username)%></dd>
</dl>



Your javascript would look something like:
var UserNameUnique = Class.create();
UserNameUnique.prototype = {
initialize: function( field_id, message_id) {
this.message_id = message_id;
this.field = document.getElementById( field_id );
if (typeof this.field == 'undefined') {return;}
// Observe blur on field
Event.observe(this.field, 'blur', this.checkName.bindAsEventListener(this));
},
checkName: function() {
if( typeof this.field != 'undefined' )
{
var name = this.field.value;
// don't hit server unless username is well-formed
var re = /^[A-Za-z])[a-zA-Z0-9]{2,25}$/;
if( name.match( re ))
new Ajax.Request('/foo/is_username_taken', {asynchronous:true, parameters:'username='+name+'&message_html_id='+this.message_id});
}
}
};

new UserNameUnique('person_username_input', 'username_messaging');


You could also use Rails helper observe_field instead of the writing your own javascript, but there is more flexibility in writing your own javascript (such as the need to check well-formed values before hitting the server).


Mark Brooks


Creating the options list for select tags can be annoying, especially if there is a requirement that the current "option" be displayed as the default option. With javascript, it isn't such a big deal, but one must take care of the degraded case as well.

In any event, the base object is an array of two-item hashes representing both the name of a video channel and the number of videos in that channel. By way of example:
@channelinfo = [
{"name"=>"Fitness","count"=>4},
{"name"=>"Diabetes", "count"=>1},
{"name"=>"Pregnancy", "count"=>11}
]



The option values are the name fields of each hash. The key is, we want the current option value to be the first one in the option list, and the rest to be in alpha order by name.

So let's create an option builder from the bottom up.

First, we only need the channel names for the select tag. This gives us what we need:
options = @channelinfo.collect { |channel| channel['name'] }


That gives us a list of channel names. Using the list above, it would be ["Fitness", "Diabetes", "Pregnancy"].

However, we need to make sure that the current channel is first in the list. Let's say that current channel is Diabetes. Since we already have that value in @current_channel, we can exclude it from our options list:
options = @channelinfo.collect do |channel|
channel['name']
end.reject do |channelname|
channelname == @current_channel
end



Now the resulting list will look like ["Fitness", "Pregnancy"]. However, we still need the current channel to be at the front of the lift, so we add it back as the first element:
options = @channelinfo.collect do |channel|
channel['name']
end.reject do |channelname|
channelname == @current_channel
end.unshift(@current_channel)



Now our options list looks like ["Diabetes", "Fitness", "Pregnancy"].

Two points. First, we want to make sure that, while the current channel is at the head of the list, the rest of the list items are in alpha order, so we add a sort directive in the appropriate place. Also, it is probably a good idea to exclude any nils that might pop up in the collection on the original data object, since it comes from a service and it is possible, however unlikely, that a hash might get spit out without a 'name' property. The more rigorous code looks like this now:
options = @channelinfo.collect do |channel|
channel['name']
end.compact.reject do | channelname |
channelname == @current_channel
end.sort.unshift(@current_channel)



Now we can generate our options list using:
options.collect do |channelname|
"<option>" + channelname + "</option>"
end.join(",")


although if you want to, you can simply combine the whole thing as follows:
@channelinfo.collect do |channel|
channel['name']
end.compact.reject do |channelname|
channelname == @current_channel
end.sort.unshift(@current_channel).collect do |channelname|
"<option>" + channelname + "</option>"
end.join(",")



to get the same result, and eliminate the unnecessary options binding.


Todd Fisher


Here is an extension trying to execute a block multiple times before giving up. Handy for network operations and such.

module Kernel
def could_fail(retries = 3, &block)
tries = 0
begin
yield
rescue Exception
tries += 1
if tries < retries
retry
else
raise
end
end
end
end


Call it as result = could_fail { some_operation_that_might_fail_first_time }

Todd Fisher


When you are writing a script that needs to auto install gems, you are likely to run into a problem that it stops because there are multiple platform versions available (jruby, win32, etc) and the gem command expects you to pick one that matches your platform. This patch forces to use a specific platform so no user interaction is needed. Original idea from Warren updated to support rubygems 0.9.4.

module GemTasks
def setup

return if $gems_initialized
$gems_initialized = true
Gem.manage_gems

# see => http://svn.bountysource.com/fishplate/scripts/debian_install.pl
Gem::RemoteInstaller.class_eval do

alias_method :find_gem_to_install_without_ruby_only_platform, :find_gem_to_install

def find_gem_to_install( gem_name, version_requirement, caches = nil )
if caches # old version of rubygems used to pass a caches object
caches.each {|k,v| caches[k].each { |name,spect| caches[k].remove_spec(name) unless spec.platform == 'ruby' } }
find_gem_to_install_without_ruby_only_platform( gem_name, version_requirement, caches )
else
Gem::StreamUI.class_eval do

alias_method :choose_from_list_without_choosing_ruby_only, :choose_from_list
def choose_from_list( question, list )
result = nil
result_index = -1
list.each_with_index do |item,index|
if item.match(/\(ruby\)/)
result_index = index
result = item
break
end
end
return [result, result_index]
end

end

find_gem_to_install_without_ruby_only_platform( gem_name, version_requirement )

end
end

end

end
end



Val Aleksenko


Since class-level instance variable are not inherited by subclasses, you need to go some extra steps when writing a plugin using them. Depending on the amount of such variables, I have been either defining a method instead of class-level instance variables or forwarding them to subclasses.

Example #1. acts_as_readonlyable needs to provide a single class level instance variable. Defining a method instead.

def acts_as_readonlyable(*readonly_dbs)
define_readonly_model_method(readonly_models)
end

def define_readonly_model_method(readonly_models)
(class << self; self; end).class_eval do
define_method(:readonly_model) { readonly_models[rand(readonly_models.size)] }
end
end



Example #2. acts_as_secure uses a bunch of variables. Forwarding them to subclasses.

[PLUGIN RELEASE] Metrics

Posted about 7 years back at Revolution On Rails


From Jeffrey Damick


Introduction

This gem provides a metrics collecting for controllers, database queries, and specific blocks of code or methods. It is designed to be light-weight and have minimal impact on production builds while providing performance indicators of the running application.



Disclaimer

This software is released to be used at your own risk. For feedback please drop us a line at rails-trunk [ at ] revolution DOT com.
Using this plugin should not be your first step in application optimization/scaling or even the second one.



Example

class SomeClassToTest
collect_metrics_on :my_method

def my_method(blah = nil)
true
end
end


Output:
[ERROR] [2007-06-21 23:21:19] [trunk] [Metrics]|[76716]|[MysqlAdapter.log]|0.012727|args=["root localhost trunk_test", "CREATE DATABASE `trunk_test`"]
[ERROR] [2007-06-21 23:19:56] [trunk] [Metrics]|[35158]|[Request to [Test::SomeControllerWithMetricsId]]|0.001373|action = index|path =some?
[ERROR] [2007-06-21 23:19:56] [trunk] [Metrics]|[33676]|[SomeClassToUseModuleMixin.another_method]|0.000020|args=["also"]


for more samples and test cases see test/metrics_test.rb



Usage

The metrics are written to: logs/<environment>_metrics.log

Configuration can be updated in metrics/config/metrics.yml, you may copy this file to your RAILS_ROOT/config/metrics.yml and customize for your application, the RAILS_ROOT will be checked first.



Sample metrics.yml

production:
min_real_time_threshold: 1.0
single_line_output: true
some_module/test_controller: 0.0




Installation

As plugin:
script/plugin install svn://rubyforge.org/var/svn/metrics/trunk/vendor/plugins/metrics



License

metrics is released under the MIT license.



Support

The plugin RubyForge page is http://rubyforge.org/projects/metrics

[PLUGIN RELEASE] Metrics

Posted about 7 years back at Revolution On Rails


From Jeffrey Damick


Introduction

This gem provides a metrics collecting for controllers, database queries, and specific blocks of code or methods. It is designed to be light-weight and have minimal impact on production builds while providing performance indicators of the running application.



Disclaimer

This software is released to be used at your own risk. For feedback please drop us a line at rails-trunk [ at ] revolution DOT com.
Using this plugin should not be your first step in application optimization/scaling or even the second one.



Example

class SomeClassToTest
collect_metrics_on :my_method

def my_method(blah = nil)
true
end
end


Output:
[ERROR] [2007-06-21 23:21:19] [trunk] [Metrics]|[76716]|[MysqlAdapter.log]|0.012727|args=["root localhost trunk_test", "CREATE DATABASE `trunk_test`"]
[ERROR] [2007-06-21 23:19:56] [trunk] [Metrics]|[35158]|[Request to [Test::SomeControllerWithMetricsId]]|0.001373|action = index|path =some?
[ERROR] [2007-06-21 23:19:56] [trunk] [Metrics]|[33676]|[SomeClassToUseModuleMixin.another_method]|0.000020|args=["also"]


for more samples and test cases see test/metrics_test.rb



Usage

The metrics are written to: logs/<environment>_metrics.log

Configuration can be updated in metrics/config/metrics.yml, you may copy this file to your RAILS_ROOT/config/metrics.yml and customize for your application, the RAILS_ROOT will be checked first.



Sample metrics.yml

production:
min_real_time_threshold: 1.0
single_line_output: true
some_module/test_controller: 0.0




Installation

As plugin:
script/plugin install svn://rubyforge.org/var/svn/metrics/trunk/vendor/plugins/metrics



License

metrics is released under the MIT license.



Support

The plugin RubyForge page is http://rubyforge.org/projects/metrics

Ticketish Email Integration

Posted about 7 years back at benmyles.com - Home

Note: Also posted on the Integral Impressions blog.

One of the central features of Ticketish is email integration. Users can create new tickets by email and add comments to tickets by replying to any email that has the ticket id in the subject. I'll show how this integration is achieved, from setting up Postfix to writing the Ticket and Comment handlers.

Each project in Ticketish receives its own email address. The email address is in the format "[project.permalink]@[account.subdomain].ticketi.sh". Sending an email to the project's email address creates a new ticket. Once a new ticket is created, a notification email is sent. Replying to the notification email (or any further emails regarding the ticket) adds a comment to the ticket. Further, the user can add attachments to any emails and have them show up as attachments to a comment.

So, how does it work? Let's start with the big picture. The mail transfer agent Postfix runs on the same server as the Ticketish mongrels. Postfix receives mail for all ticketi.sh subdomains. When Postfix receives a new message, it fires up script/runner for Ticketish and passes the message in via STDIN. At this point we don't care if the project exists or not, that's for Rails to figure out. Postfix just needs to send any emails addressed to *.ticketi.sh to script/runner.

Let's dig a little deeper, and see how to accomplish this behavior with Postfix.

The first Postfix configuration file we'll look at is main.cf. I won't paste the whole file, just the interesting parts that differ from the default main.cf file Postfix comes with.


/etc/postfix/main.cf

myhostname = ticketi.sh
mydomain = ticketi.sh
mydestination = $myhostname, localhost.$mydomain,
localhost, regexp:/etc/postfix/mydestination


This is mostly straight-forward. We set the hostname and domain to "ticketi.sh". The interesting part is that we're using a regular expression lookup table with mydestination. The mydestination setting means "this is the final destination for the following domains". Obviously, we want to be the final destination for all Ticketi.sh domains, so we use a regular expression.


/etc/postfix/mydestination

/.*\.ticketi\.sh/ OK


Pretty simple. You need to make sure you generate the database file from that lookup table by running postalias:


# postalias /etc/postfix/mydestination
# ls /etc/postfix/mydestination.db
/etc/postfix/mydestination.db


Now that Postfix knows to receive mail for *.ticketi.sh, we need to tell it where to send that mail. First, we'll create a wrapper around script/runner that can receive incoming emails from Postfix.


/etc/postfix/ticketish_agent.sh

#!/bin/bash
HOME=/home/lsws /usr/bin/ruby \
/home/lsws/apps/ticketish/current/script/runner \
'data = STDIN.read; CommentHandler.receive(data) \
|| TicketHandler.receive(data)' 2>> \
/home/lsws/ticketish_agent.log


The actual Rails code in runner is pretty simple. We assign the incoming email to the data variable, and then do 'CommentHandler.receive(data) || TicketHandler.receive(data)'. This code tries to add the message as a comment first, but if the comment handler returns nil (as it will if the message isn't a comment) it'll create it as a new ticket instead.

Now we need to add a service to Postfix by adding a few lines at the bottom of master.cf.


/etc/postfix/master.cf

mailman unix - n n - - pipe
flags= user=lsws
argv=/etc/postfix/ticketish_agent.sh


We've just called the service "mailman" and it pipes the message through to the wrapper we just created. Note the user= field. As you might guess, this is the user the process runs as. Be sure to use the user your Rails application runs under.

Finally, we need to tell Postfix to deliver all local mail to this new service (all local mail, since Postfix is entirely dedicated to Ticketish). We do this by defining the local_transport in main.cf.


/etc/postfix/main.cf

local_transport = mailman


Creating the CommentHandler and TicketHandler is pretty simple. We just use the following structure:


ticketish/app/models/comment_handler.rb

class CommentHandler < ActionMailer::Base
def receive(email)
# ...
end
end


You can do whatever you like once you have the email. For example, to extract the project name and domain name from the recipient address of the email:


project_name, domain_name =
email.to.first.split("@")


So there you have it. Easy email integration with Postfix and Rails. If you haven't already heard of Ticketish, take a look. It's our new simple ticketing application, currently in beta. We're still sending out beta invites, so feel free to add your name.

Episode 48: Console Tricks

Posted about 7 years back at Railscasts

The Rails console is one of my favorite tools. This episode is packed with tips and tricks on how to get the most out of the console.

Thanks Akismet!

Posted about 7 years back at zerosum dirt(nap) - Home

I’m probably a bit behind the game on this one, but huge props are due to Akismet for making my blog life just a bit more pleasant. Before it was installed last week, I was deleting hoards of comment spam every day. Today, none.

In other blog-related news, I’m still planning on moving productions over to Mephisto, but have been hard pressed for time lately. Fortunately (when I get around to it), it has Akismet support baked right in.

UPDATE: finally moved over to a new blogging platform! About time, eh?

Episode 47: Two Many-to-Many

Posted about 7 years back at Railscasts

There are two different ways to set up a many-to-many association in Rails. In this episode you will see how to implement both ways along with some tips on choosing the right one for your project.