Thursday, March 5, 2009

Is It Really My Job To Fix That!?

That's Not On My Resume

I have to tell you that working in a software services firm isn't always a picnic.  Being a vendor (read outsider), and responsible for delivering working solutions on other peoples infrastructure, in a foreign LAN is nothing but a brazen undertaking. It's akin to trying to construct a building in a war torn, 3rd world country that happens to be in the middle of ongoing natural disasters. The drama sometimes matches that of most reality TV series, and while it's rarely dull in some regards, there's easier ways in life to get a win.

problems

When the hardware isn't changing and internal IT isn't ignoring you, it's most likely that you'll get bad OS builds, under powered VMs or fail to take note of the hidden agendas in a politically charged atmosphere. This is all in addition to regular project management woes. I'm just talking about delivery here, requirements and software lifecycle are another boat in entier.

You need to be really sharp to see all the possible ways you could fail, and even then it'll probably be something small that never even crossed your worried mind that will end threatening you and your project. Game on.

An Instance of Success

Today we came up with this interesting problem when installing some ETL on a BI server. The machine was running SSIS RTM on a Win2k3 x86 Service Pack 1. The ETL was taking a bunch of records from some source SQL Server database (SP2) and migrating them to another destination SQL Server database (SP2). The ETL would run for about 20ish minutes before consistently throwing the following exceptions:

An OLE DB record is available. Source: "Microsoft SQL Native Client" Hresult: 0x80004005 Description: "Protocol error in TDS stream".

An OLE DB record is available. Source: "Microsoft SQL Native Client" Hresult: 0x80004005 Description: "Communication link failure".

An OLE DB record is available. Source: "Microsoft SQL Native Client" Hresult: 0x80004005 Description: "TCP Provider: An existing connection was forcibly closed by the remote host. ".

Hmm...what to do. The ETL worked great in stage and in our own QA environment. After an exhaustive troubleshoot (on our client's machine) the fix ended up being...disabling TCP Chimney offloading. Not even on our BI server but on the destination SQL Server!

What!? You say? Your problems with delivery weren't even vaguely related to the technologies you were developing!? Exactly, something not even vaguely related to ETL became a huge friction point between us and a client. The worse part is that there's always a risk of this. Vendors come in and constantly take it on the chin (or sometime just look incompetent) because a dependency they depend on isn't working.

What makes it worse is you may have never even seen that machine in your life, and it's the one that just might bury you. Half the time you don't even have the necessary rights on the given machines to fix it let alone conduct a decent troubleshoot. All of this just might want to make a man take up farming.

This time we did have the necessary rights on the machine. We got lucky and there's even a chance that we might actually get paid for the troubleshoot...this time. Who knows what kind of circus fixes we'll be trying to pull off tomorrow. The good news is that after a while some partners actually start to trust you.

Hope that fix helps some vendor somewhere, we sure could have used it.

My Best,
Tyler