musings about technology and software development..

A taste of Ruby on Rails, part 2

NOTE: This is a continuation of part 1.

Ruby on Rails is a great web development platform, but has a few drawbacks that you should understand before switching from ASP.Net.

Catch #1: Library support, or lack there-of
When you're staring at Notepad, and you want to support uploaded images and auto-resize them, it's not easy to figure out where to start. With .NET, you have the standard library to poke around and can usually find what you need with the help of Intellisense. With Ruby, everything starts with a Google search. After deciding amongst five identical-sounding choices, you download an image plugin written 2 years ago by a kind developer, which doesn't install on Windows because said developer only has a Mac. You download the tarball of source files that you can't compile without a special version of Visual C++, and doesn't work against the latest version of Ruby you just downloaded. After it finally installs, you end up burning an afternoon debugging it. Of course, it's free. But only in terms of money -- it costs a lot of time (not spent coding, mind you), and time is more important than money (that's why you're even considering using RoR, right?).

Catch #2: Convention over Configuration
One of the key tenets of Rails is convention over configuration. What this means is the language has an expectation of a "right" way to do things, and if you do it that way, lots of goodness happens for you automatically. The net result is you will find yourself making compromises just to make sure you stay on the right side of syntactical sugar.

Catch #3: Stabilization
At Microsoft, perhaps only a third of the time is spent writing new code. Making the code functional is the easy part. Making it work in all the different user scenarios, under heavy load, with different data sets, and otherwise stabilizing the code, that's what we spend two-thirds of the time on. And what are you doing during much of this time? Debugging.

Anyone who's ever used a debugger knows how critical it is to walkthrough code to comprehend what's going on, especially as your code gets more complicated. And, for anyone used to Visual Studio, the debugging experience in Ruby leaves a lot to be desired. So the thing I spent one-third of my time on is now super fast, but the thing that I spent two-thirds of my time on is now frustratingly slow.

Catch #4: Performance
I can't stress how important speed is when it comes to the user experience. Milliseconds matter. Like other non-compiled languages, Ruby will execute slower than ones which compile to native bytecode. Moreover, the tools to measure and profile those milliseconds are still in their infancy. If you've never looked at a profile of your code, you are probably wrong about where your code spends its cycles.

RoR is a great tool to have in your inventory, and like every language it has it's deficiencies. And, like any deficiencies, they can be mitigated -- for example, writing better unit tests to reduce time debugging, and using AJAX to improve user perceived response time.  As long as you understand the trade-offs, you can decide what's best for your project. For my one-man "fun" web development projects, RoR is a winner.

A taste of Ruby and Rails

Although I've done C, Perl, PHP, and Java-based web programming in past lives, I've spent the last 8 years or so developing on ASP.Net for a living. As a result, I believed Ruby on Rails (RoR) was a toy not meant to be used for anything serious. Because it's easy to learn, people use it to build cheesy websites. But, that doesn't preclude you from building serious websites, Twitter being the standard-bearer for Rails. Here's what I can tell you: a programming language is merely a tool. And if data-driven web programming is your nail, then RoR is quite the hammer.

Here's the key difference: C developers think of a pointer as a basic building block.  C# developers think of hashtables and lists as basic building blocks.  RoR developers think of database tables as basic building blocks.  Working on SharePoint, anytime we needed to add a new database table, it was a big deal.  You had to write a bunch of CRUD operations and stored procedures.  You had to write a bunch of UI to expose the CRUD operations.  You had to write an object model.  You had to write upgraders.  You had to make sure it got backed up properly.  All of this took, on average, a month for a developer to code and unit test.  With Rails, all of this is inherent to the architecture.  You design the database schema, and all of this functionality is immediately available, freeing you from the drudgery and allowing you the time to actually design and build something useful.

You might say, well I can do all of these things with LINQ, the Entity Framework, or some third-party bolt-on solution.  And you'd be right.  But that's a bit like flying economy on a long-haul flight to Italy with a layover in London.  Oh, you'll get there, but only after much discomfort, a strained neck, and having paid extra for your checked bags.  Why bother when you can take a comfortable non-stop flight in first class -- and did I mention it was free?

Of course, it is free only in the monetary sense.  One of the main critiques of RoR is you pay a price in performance (partially because it is not a compiled language).  But if it's good enough for Twitter's billion tweets a month, I think I'll manage.  In a followup post, I will go over some of the other problems I've encountered which are not often discussed in online forums.

Office has shipped!

Big congratulations out to those working on Office 2010 and SharePoint 2010, which is finally DONE! This really was a monumental release for us. About halfway through the release, the manager for each product in Office was asked to give a short demo of something cool their team was building. I remember having to demo our product, watching the other demos, and thinking, "Damn, that's amazing stuff, I hope people thing my demo is half as cool." And later, when my team switched to owning a platform component everyone depended on, I remember thinking, "Damn, we better get this thing working soon or those demos will never actually ship." Well, everyone pulled together, and now you can try for yourself all those amazing features we've built for you.

I found this statistic interesting: Microsoft announced their earnings today, and for the last quarter, the Business Division (which includes Office) represents about $2.6 billion in net profit for the company. To put that in perspective, that's more than all of Google. In fact, it's about the same as Google and Amazon put together. Just for Office. As an Office developer, it makes you think twice about how important each line of code is.

Collection of essays on software engineering

If you haven't already read Joel Spolsky's books on software (Joel on Software and Best Software Writing), I'd highly recommend them.  But while those are geared towards working on large projects at big companies, "Getting Real" from 37 Signals is a collection of essays about software engineering at a startup (and most of the lessons apply even if you are a team of one).  Better yet, it's free, so what have you got to lose?

Internet Math

This image from College Humor is intended as a parody, but there's quite a bit of truth in there:

It seems like most "new" startups are simply XX + Social, or YY + Mobile.  But, that business model seems to hinge on XX and YY neglecting to notice that the startup is trying to eat their lunch, and not immediately add the same functionality and squash said startup like a bug.

Best feature of Outlook 2010

Office 2010 is almost ready to ship!  I'm an Outlook user by day, and Gmail user by night.  But I find that Gmail doesn't scale well when you are being flooded with e-mail -- for example, basic UI metaphors like shift-click don't work, and labels just don't cut it compared to Outlook rules.  So, here's my favorite new feature from Outlook 2010 for dealing with floods of e-mail:

Basically, it deletes any e-mails that are entirely contained within replies later in the conversation. This is great for high traffic discussion aliases and long-winded threads.  There's just something really gratifying about pressing a button and seeing half my Inbox disappear..

Uh-oh for Windows?

For most people, the two biggest advantages of a PC over a Mac are that Macs cost more, and you can't play (most) games on a Mac.  Most Mac owners I know either have a separate gaming rig or dual boot to Windows just for video games.

Today marks an inflection point in the Mac vs PC war: Steam has been ported to Mac!  The only games I play on a PC anymore are those from Valve (Left 4 Dead, Counterstrike, Half-life, etc) and from Blizzard (Starcraft, Warcraft, etc).  Most other games are better experienced on a console.  Well, both of those sets of games are now going to be released for the Mac on the same day as the PC!

As someone who owns Microsoft stock, this is a big problem.  You do not want an OS where your main differentiator is that it's cheaper, or to rely on mass-market inertia.  My computer use is split amongst internet use, coding, creativity software, office software, and video games.  If I were to buy a computer today, for the first time, I would actually consider a Mac.  For the first time, Mac has achieved parity with PC across my usage scenarios. 

This is a dangerous time for Microsoft.. tread carefully.

Color calibration, or lack thereof

Every monitor displays color differently.  If you've ever used dual monitors, you know what I'm talking about.  The picture below is my Lenovo T500 on the left, a Dell 2005WFP on the right:

I suppose how much of a color difference you see in the two monitors above depends on your monitor's color profile, but for me, the standalone monitor comes across as having greener greens and redder reds.  In fact, my laptop portrays this blog as a nice cool blue, whereas on my monitor it is a hideous shade of green.  My intention is most certainly the blue variant, but I have no idea what other people are seeing.

Anyways, this is really important for web design and photography.  So, I am using this as an excuse to go buy a Dell U2410 IPS monitor and a Spyder3 color calibrator.  That will ensure I am seeing what I am "supposed" to see, but presumably it remains a crapshoot for the remaining 99% of the world with uncalibrated monitors.  They, no doubt, will take a look at this blog and see some unflattering and garish hue.  Yuck.

Microsoft Azure Services

Microsoft is getting ready to release their cloud computing platform, Azure, and there's a pretty good overview written by David Chappell.  One snippet which I found amusing was:
Windows Azure platform AppFabric provides cloud-based infrastructure services. Microsoft is also creating an analogous technology known as Windows Server AppFabric. [...] Don’t be confused; throughout this paper, the name “AppFabric” is used to refer to the cloud-based services. Also, don’t confuse the Windows Azure platform AppFabric with the fabric component of Windows Azure itself. Even though both contain the term “fabric”, they’re wholly separate technologies addressing quite distinct problems.
Don't be confused? Really? Then don't call everything "fabric"!  I thought Microsoft had learned from the "Windows Live" naming debacle. Somebody needs to buy Microsoft a thesaurus..

Algorithms for storing and querying hierchical trees

I've often found myself needing to represent hierarchical data in my database -- navigation trees, forum threads, organizational charts, taxonomies, etc.  I've been trying different approaches to maintaining a hierarchy, and thought others might be interested in my findings.  For purposes of illustration, our sample tree is the following:

      /   \
     2     4
   /  \
  3    5

Approach #1: Adjacency list
The idea here is simple, you store each node's parent in a table:

table: nodes
  id   parent_id 

This is trivial to implement, but hierarchical queries become hard. In order to query for all nodes under a given branch, you have to recurse through its children. If you don't have too many nodes, you can just read the entire table into memory and cache it -- which is sufficient for most web site navigation structures, for example.

Approach #2: Store the Path as a string
Here, the idea is that each node stores its path as string. For example, a node might have a path of "1_8_13". Thus, you could find the children of node "8" by querying for all nodes with a path of "1_8_%".

table: nodes
       id        path 

This gives you the benefit of hierarchical queries, but only if you add an index on the "path" column, forcing SQL to do the heavy lifting. And, since it's a string column, your performance will not be as fast as if it were integer-based.

Approach #3: Nested subsets
The idea here is that each subtree is kept within a range of IDs, and the range of its subtree is stored in the node. In the example, the subtree of 1 is (obviously) within the range of 1..5. However, you'll notice the subtree of 2 is NOT within the range of 3..5 because node 4 violates that rule. As a result, we need a mutable ID in order to maintain the subset.

table: nodes
       id        mutable_id   min_mutable_id    max_mutable_id  

Note how we had to swap the IDs of 4 and 5, so that node 2 could have a valid nested subset range of 3..4. This can easily happen on insertions as well and force us to recompute large parts of the table if shifting is required. However, hierarchical reads are fairly inexpensive, as they just become numerical range queries.

Approach #4: Expanded tree
The idea here is that you store the normal adjacency list, but maintain another table of the tree already recursively expanded-out:

table: nodes
  id   parent_id 

table: nodes_expanded
  id   expanded_parent_id 
Essentially, the expanded table acts as a hierarchy cache.  For example, to get all nodes under the "2" subtree, just find all nodes with (expanded_parent_id == 2), which will return matches on 2, 3, and 5 as expected.  The main benefit of this approach is that all your SQL queries are based on exact match, whereas the last two approaches use range queries. Likewise, while an insertion will require you to futz with the "nodes_expanded" table, the data in the "nodes" table stays intact. With the nested subsets approach, you may find your main "nodes" table locked on reads while all the IDs get shuffled around.

So, to summarize:
Adjacency list
  • Easy to implement
  • Minimum storage
  • Slow calculation of subtrees (can mitigate with in-memory caching)
Path substrings
  • Easy to implement
  • Handles hierarchical queries
  • Relies on SQL index on a string column
  • Inefficient storage (only using 0-9 and "_" in the char range)
Nested subsets
  • Handles hierarchical queries
  • Insertions can be expensive
  • Insertions can result in lock contention
Expanded tree
  • Handles hierarchical queries
  • Hierarchy is pre-cached as a simple "equality" join
  • Requires maintaining separate "nodes_expanded" table
  • Insertions can be expensive, but not against the main "nodes" table

Later, I hope to implement and benchmark each approach against each other.  Any other algorithms worth investigating?