musings about technology and software development..

A taste of Ruby on Rails, part 2

NOTE: This is a continuation of part 1.

Ruby on Rails is a great web development platform, but has a few drawbacks that you should understand before switching from ASP.Net.

Catch #1: Library support, or lack there-of
When you're staring at Notepad, and you want to support uploaded images and auto-resize them, it's not easy to figure out where to start. With .NET, you have the standard library to poke around and can usually find what you need with the help of Intellisense. With Ruby, everything starts with a Google search. After deciding amongst five identical-sounding choices, you download an image plugin written 2 years ago by a kind developer, which doesn't install on Windows because said developer only has a Mac. You download the tarball of source files that you can't compile without a special version of Visual C++, and doesn't work against the latest version of Ruby you just downloaded. After it finally installs, you end up burning an afternoon debugging it. Of course, it's free. But only in terms of money -- it costs a lot of time (not spent coding, mind you), and time is more important than money (that's why you're even considering using RoR, right?).

Catch #2: Convention over Configuration
One of the key tenets of Rails is convention over configuration. What this means is the language has an expectation of a "right" way to do things, and if you do it that way, lots of goodness happens for you automatically. The net result is you will find yourself making compromises just to make sure you stay on the right side of syntactical sugar.

Catch #3: Stabilization
At Microsoft, perhaps only a third of the time is spent writing new code. Making the code functional is the easy part. Making it work in all the different user scenarios, under heavy load, with different data sets, and otherwise stabilizing the code, that's what we spend two-thirds of the time on. And what are you doing during much of this time? Debugging.

Anyone who's ever used a debugger knows how critical it is to walkthrough code to comprehend what's going on, especially as your code gets more complicated. And, for anyone used to Visual Studio, the debugging experience in Ruby leaves a lot to be desired. So the thing I spent one-third of my time on is now super fast, but the thing that I spent two-thirds of my time on is now frustratingly slow.

Catch #4: Performance
I can't stress how important speed is when it comes to the user experience. Milliseconds matter. Like other non-compiled languages, Ruby will execute slower than ones which compile to native bytecode. Moreover, the tools to measure and profile those milliseconds are still in their infancy. If you've never looked at a profile of your code, you are probably wrong about where your code spends its cycles.

RoR is a great tool to have in your inventory, and like every language it has it's deficiencies. And, like any deficiencies, they can be mitigated -- for example, writing better unit tests to reduce time debugging, and using AJAX to improve user perceived response time.  As long as you understand the trade-offs, you can decide what's best for your project. For my one-man "fun" web development projects, RoR is a winner.

A taste of Ruby and Rails

Although I've done C, Perl, PHP, and Java-based web programming in past lives, I've spent the last 8 years or so developing on ASP.Net for a living. As a result, I believed Ruby on Rails (RoR) was a toy not meant to be used for anything serious. Because it's easy to learn, people use it to build cheesy websites. But, that doesn't preclude you from building serious websites, Twitter being the standard-bearer for Rails. Here's what I can tell you: a programming language is merely a tool. And if data-driven web programming is your nail, then RoR is quite the hammer.

Here's the key difference: C developers think of a pointer as a basic building block.  C# developers think of hashtables and lists as basic building blocks.  RoR developers think of database tables as basic building blocks.  Working on SharePoint, anytime we needed to add a new database table, it was a big deal.  You had to write a bunch of CRUD operations and stored procedures.  You had to write a bunch of UI to expose the CRUD operations.  You had to write an object model.  You had to write upgraders.  You had to make sure it got backed up properly.  All of this took, on average, a month for a developer to code and unit test.  With Rails, all of this is inherent to the architecture.  You design the database schema, and all of this functionality is immediately available, freeing you from the drudgery and allowing you the time to actually design and build something useful.

You might say, well I can do all of these things with LINQ, the Entity Framework, or some third-party bolt-on solution.  And you'd be right.  But that's a bit like flying economy on a long-haul flight to Italy with a layover in London.  Oh, you'll get there, but only after much discomfort, a strained neck, and having paid extra for your checked bags.  Why bother when you can take a comfortable non-stop flight in first class -- and did I mention it was free?

Of course, it is free only in the monetary sense.  One of the main critiques of RoR is you pay a price in performance (partially because it is not a compiled language).  But if it's good enough for Twitter's billion tweets a month, I think I'll manage.  In a followup post, I will go over some of the other problems I've encountered which are not often discussed in online forums.

Office has shipped!

Big congratulations out to those working on Office 2010 and SharePoint 2010, which is finally DONE! This really was a monumental release for us. About halfway through the release, the manager for each product in Office was asked to give a short demo of something cool their team was building. I remember having to demo our product, watching the other demos, and thinking, "Damn, that's amazing stuff, I hope people thing my demo is half as cool." And later, when my team switched to owning a platform component everyone depended on, I remember thinking, "Damn, we better get this thing working soon or those demos will never actually ship." Well, everyone pulled together, and now you can try for yourself all those amazing features we've built for you.

I found this statistic interesting: Microsoft announced their earnings today, and for the last quarter, the Business Division (which includes Office) represents about $2.6 billion in net profit for the company. To put that in perspective, that's more than all of Google. In fact, it's about the same as Google and Amazon put together. Just for Office. As an Office developer, it makes you think twice about how important each line of code is.

Collection of essays on software engineering

If you haven't already read Joel Spolsky's books on software (Joel on Software and Best Software Writing), I'd highly recommend them.  But while those are geared towards working on large projects at big companies, "Getting Real" from 37 Signals is a collection of essays about software engineering at a startup (and most of the lessons apply even if you are a team of one).  Better yet, it's free, so what have you got to lose?

Internet Math

This image from College Humor is intended as a parody, but there's quite a bit of truth in there:

It seems like most "new" startups are simply XX + Social, or YY + Mobile.  But, that business model seems to hinge on XX and YY neglecting to notice that the startup is trying to eat their lunch, and not immediately add the same functionality and squash said startup like a bug.

Best feature of Outlook 2010

Office 2010 is almost ready to ship!  I'm an Outlook user by day, and Gmail user by night.  But I find that Gmail doesn't scale well when you are being flooded with e-mail -- for example, basic UI metaphors like shift-click don't work, and labels just don't cut it compared to Outlook rules.  So, here's my favorite new feature from Outlook 2010 for dealing with floods of e-mail:

Basically, it deletes any e-mails that are entirely contained within replies later in the conversation. This is great for high traffic discussion aliases and long-winded threads.  There's just something really gratifying about pressing a button and seeing half my Inbox disappear..

Uh-oh for Windows?

For most people, the two biggest advantages of a PC over a Mac are that Macs cost more, and you can't play (most) games on a Mac.  Most Mac owners I know either have a separate gaming rig or dual boot to Windows just for video games.

Today marks an inflection point in the Mac vs PC war: Steam has been ported to Mac!  The only games I play on a PC anymore are those from Valve (Left 4 Dead, Counterstrike, Half-life, etc) and from Blizzard (Starcraft, Warcraft, etc).  Most other games are better experienced on a console.  Well, both of those sets of games are now going to be released for the Mac on the same day as the PC!

As someone who owns Microsoft stock, this is a big problem.  You do not want an OS where your main differentiator is that it's cheaper, or to rely on mass-market inertia.  My computer use is split amongst internet use, coding, creativity software, office software, and video games.  If I were to buy a computer today, for the first time, I would actually consider a Mac.  For the first time, Mac has achieved parity with PC across my usage scenarios. 

This is a dangerous time for Microsoft.. tread carefully.

Color calibration, or lack thereof

Every monitor displays color differently.  If you've ever used dual monitors, you know what I'm talking about.  The picture below is my Lenovo T500 on the left, a Dell 2005WFP on the right:

I suppose how much of a color difference you see in the two monitors above depends on your monitor's color profile, but for me, the standalone monitor comes across as having greener greens and redder reds.  In fact, my laptop portrays this blog as a nice cool blue, whereas on my monitor it is a hideous shade of green.  My intention is most certainly the blue variant, but I have no idea what other people are seeing.

Anyways, this is really important for web design and photography.  So, I am using this as an excuse to go buy a Dell U2410 IPS monitor and a Spyder3 color calibrator.  That will ensure I am seeing what I am "supposed" to see, but presumably it remains a crapshoot for the remaining 99% of the world with uncalibrated monitors.  They, no doubt, will take a look at this blog and see some unflattering and garish hue.  Yuck.

Microsoft Azure Services

Microsoft is getting ready to release their cloud computing platform, Azure, and there's a pretty good overview written by David Chappell.  One snippet which I found amusing was:
Windows Azure platform AppFabric provides cloud-based infrastructure services. Microsoft is also creating an analogous technology known as Windows Server AppFabric. [...] Don’t be confused; throughout this paper, the name “AppFabric” is used to refer to the cloud-based services. Also, don’t confuse the Windows Azure platform AppFabric with the fabric component of Windows Azure itself. Even though both contain the term “fabric”, they’re wholly separate technologies addressing quite distinct problems.
Don't be confused? Really? Then don't call everything "fabric"!  I thought Microsoft had learned from the "Windows Live" naming debacle. Somebody needs to buy Microsoft a thesaurus..

Algorithms for storing and querying hierchical trees

I've often found myself needing to represent hierarchical data in my database -- navigation trees, forum threads, organizational charts, taxonomies, etc.  I've been trying different approaches to maintaining a hierarchy, and thought others might be interested in my findings.  For purposes of illustration, our sample tree is the following:

      /   \
     2     4
   /  \
  3    5

Approach #1: Adjacency list
The idea here is simple, you store each node's parent in a table:

table: nodes
  id   parent_id 

This is trivial to implement, but hierarchical queries become hard. In order to query for all nodes under a given branch, you have to recurse through its children. If you don't have too many nodes, you can just read the entire table into memory and cache it -- which is sufficient for most web site navigation structures, for example.

Approach #2: Store the Path as a string
Here, the idea is that each node stores its path as string. For example, a node might have a path of "1_8_13". Thus, you could find the children of node "8" by querying for all nodes with a path of "1_8_%".

table: nodes
       id        path 

This gives you the benefit of hierarchical queries, but only if you add an index on the "path" column, forcing SQL to do the heavy lifting. And, since it's a string column, your performance will not be as fast as if it were integer-based.

Approach #3: Nested subsets
The idea here is that each subtree is kept within a range of IDs, and the range of its subtree is stored in the node. In the example, the subtree of 1 is (obviously) within the range of 1..5. However, you'll notice the subtree of 2 is NOT within the range of 3..5 because node 4 violates that rule. As a result, we need a mutable ID in order to maintain the subset.

table: nodes
       id        mutable_id   min_mutable_id    max_mutable_id  

Note how we had to swap the IDs of 4 and 5, so that node 2 could have a valid nested subset range of 3..4. This can easily happen on insertions as well and force us to recompute large parts of the table if shifting is required. However, hierarchical reads are fairly inexpensive, as they just become numerical range queries.

Approach #4: Expanded tree
The idea here is that you store the normal adjacency list, but maintain another table of the tree already recursively expanded-out:

table: nodes
  id   parent_id 

table: nodes_expanded
  id   expanded_parent_id 
Essentially, the expanded table acts as a hierarchy cache.  For example, to get all nodes under the "2" subtree, just find all nodes with (expanded_parent_id == 2), which will return matches on 2, 3, and 5 as expected.  The main benefit of this approach is that all your SQL queries are based on exact match, whereas the last two approaches use range queries. Likewise, while an insertion will require you to futz with the "nodes_expanded" table, the data in the "nodes" table stays intact. With the nested subsets approach, you may find your main "nodes" table locked on reads while all the IDs get shuffled around.

So, to summarize:
Adjacency list
  • Easy to implement
  • Minimum storage
  • Slow calculation of subtrees (can mitigate with in-memory caching)
Path substrings
  • Easy to implement
  • Handles hierarchical queries
  • Relies on SQL index on a string column
  • Inefficient storage (only using 0-9 and "_" in the char range)
Nested subsets
  • Handles hierarchical queries
  • Insertions can be expensive
  • Insertions can result in lock contention
Expanded tree
  • Handles hierarchical queries
  • Hierarchy is pre-cached as a simple "equality" join
  • Requires maintaining separate "nodes_expanded" table
  • Insertions can be expensive, but not against the main "nodes" table

Later, I hope to implement and benchmark each approach against each other.  Any other algorithms worth investigating?

Windows 7 Shortcuts

Just thought I'd share some shortcut keys I use all the time:

Windows + D: Show Desktop
Windows + Tab: 3D Flip
Windows + #: Runs the #'th program on your Quick Launch

And in Explorer:

Shift+Right-Click on a folder/file: Additional options like "Open command window here"
Alt+Up: Goes up a folder level in Windows Explorer (plus Alt+Left/Right for Back/Forward)

My love for dependencies ...

Once upon a time, we had a developer whose full-time job was debugging random issues in some particular feature.  That feature had a dependency on an external team who had no vested interest in this feature, and therefore using their library was a bit like using chopsticks (their library) to eat steak (of course our feature is the delicious steak).  Sure, you can use the chopsticks, but every time you do you question whether you'd be better off without them and just eating the steak with your hands.

Couldn't get any worse, right?

So, when a different team approached us with a product that was a perfect fork and knife that they used to eat steak every day for the last three years, we chomped at the bit to get a hold of it.  Long story short, their utensils were made of plastic and were constantly breaking, and now we have two developers whose full-time jobs are debugging random issues in this feature. 

We long for the days of having chopsticks to eat our steak.  Do not take dependencies lightly.

Hard Drive Backup with Live Mesh

I hope everyone out there is backing up their data.  Up until now, I've used the tried-and-true method of copying my files periodically to another drive.  Of course, in the event of data catastrophy, I would lose all my changes since the last xcopy .. which was .. about 9 months ago.  A file backup gestation period, if you will.

In any case, I'm now using Live Mesh.  It's cross-platform and you get 5gb of online storage for free (you can sync unlimited data between machines).  I've synchronized my musics, videos, and documents between all my machines which is pretty fantastic.  In case you want to try it, here's what I would have liked to know beforehand:
  • You cannot synchronize your Desktop folder.
  • Your first 5gb of synchronized files ends up in the cloud. Choose wisely.
  • You have to login with a LiveID, but it doesn't share cookies with the browser.  So, if you will ever want to sync with a friend, create a new LiveID to share.
  • When you add a folder to be sync'd, it will show up on every other machine as a virtual folder.  This can be very confusing when you've named them all "Documents" -- prefix folder names with the computer name.
My next step is to set up a sync with my a friend in another state, in case my home with all my computers burns down.  Overall, it was pretty easy to setup, although I now have a paranoia that one node will decide to delete something, and spontaneously trigger all my files to be deleted on every machine simultaneously.

Concurrency bug..

OK, spot the bug in the code:

    object m_lockObject = new object();
    object[] m_collection = null;

    public object[] GetCollection() {
        lock (m_lockObject) {
            if (m_collection != null) {
                // already initialized
                return m_collection;
            else {
                // needs to be initialized
                m_collection = new object[5];
                return m_collection;
.. the bug is that a second call could come after m_collection is new'd up, but before it's initialized, resulting in an empty collection being returned.  The first call works, the second call sometimes fails, and the third call onwards likely succeeds.  Bugs like this can be a pain to track down as, depending on what these objects do, the symptoms will appear really strange...

Managing your time wisely

I'm one of those people who strive to be “efficient”. I learned this playing games like Starcraft. To win, you have to click like a madman to control everything at once. The best players were above 200 clicks per minute. And, you better type at light speed, otherwise you will get clobbered while writing messages to your teammates. At work, this means I don't sit around for process recycle or rebuild, I always go quick-check something else while I wait. I've read that your brain thinks at around 400 wpm (words per minute), so even if you type at a zippy 150 wpm then you are wasting braincycles. When I watch people type at a very reasonable 60wpm, it takes every ounce of resistance in my body not to rip the keyboard away and type for them.

So yes, patience is not one of my virtues. As a result, I cannot believe a 3.0ghz quad-code computer makes me wait. Ever. Everytime Outlook hangs while I'm in the middle of typing my e-mail, I can't help but flip it the bird. What on earth is it doing? If not for NetBIOS name restrictions, my computer's names would be !$(^$@( and %!%^(*.

Anyways, some tips for dealing with e-mail:

  • Reply to the e-mail the first time you read it. It takes a few minutes to context switch into a problem, so make sure to only do it once. Don't “save this mail for later“, because you'll either forget, waste time reading it again, or you're making the other guy wait. Even a brief initial response is often enough for the sender to figure the problem out.
  • Delete the e-mail as soon as you reply.  Don't worry, it'll be in your trash for a while, and you have your "sent items" to fall back on too.  But the net result will be a clean Inbox which reads like a to-do list so you won't lose track of things. 
  • If your response is going to be more than a paragraph or two, go talk in person. I could not believe how long it takes to craft a well thought-out e-mail -- try timing yourself sometime. And even then, the recipient usually just asks you to schedule a meeting and it quickly becomes clear they didn't even bother reading the mail.
Some tips for software development:
  • Invest in your development environment.  Spend time learning all the shortkeys, discovering ways to customize the tools you use every day, and get the hardware that will make you most productive.  Start with a new monitor. A big one.
  • Given that your typing speed is a constant, reduce the amount you have to type. I create batch files for everything -- "n" for notepad, "d" for diff, "b" to rebuild, shortkeys to take me directly to common directory paths, etc. I ditched my hardware KVM switch because of the two-second switching lag -- using software to swap desktops is instantaneous. It may not sound like much, but instantaneous is an order of magnitude better and can change the way you work.
  • Automate repetitive tasks. If you find yourself doing the same thing over and over, you can save tons of time by automating it. I've written tons of tools that do repetitive, labor-intensive tasks automatically, and your peers will appreciate it too when you share it with them.
Strategies that haven’t worked out for me:
  • Closing the door doesn’t prevent people from stopping by, and it shouldn’t. The fact that they invested the time to pay you a visit, means that it must be important to them. Ignoring them may only save you five minutes, but cost them an hour.
  • Having separate dedicated boxes for coding and e-mailing doesn’t allow me to focus single-mindedly on programming. It just makes me switch between machines all the time.
  • When I come in early in the morning, I don't get any additional work done. If I don't get sleep, I will spend the morning sipping tea and reading the news. Even more than usual.
What strategies work for you?