Wednesday, October 29, 2008

Don't Set Your Sharepoint App To Full Trust

I Know You've Done It

It happens all the time, SharePoint web applications are put in Full Trust simply because the developer can't be bothered to learn about his options, nevermind exercise them.

Don't feel bad if this sounds familiar, I've done it too. In fact we're both just two of many, a growing army of developers that simply "fix" the problem by setting the <trust level="Full" originUrl="" /> in the web.config. We move on with our lives and gleefully watch runtime exceptions disappear thinking we're doing the world a favor. Unfortunately we're also exposing our applications to unnecessary security risks and when the SharePoint content database gets mangled because some developer decided to let all code run in full trust...you're going to wish you'd read to the end of this post.

CAS For Dummies

Code Access Security (CAS) is all about protecting a bunch of code in a runtime environment (as in the .NET runtime) from other code. It's meant to provide hooks for you the developer, so that you can decide what kinds of code can call your libraries. You can demand that the calling code at least has a minimum set of privileges before you let them call your members/classes/assemblies. This is pretty handy, especially if you're planning on providing some dangerous functionality in the form of an API.

Let's pretend that you're writing an API that lets users write to a sensitive disk area. Well what happens if some sketchy code wants to use your library to do it's dirty work!? Even if the user who's running the code has access to the disk, the user doesn't necessarily know what the application is doing, they could have gotten this off of Astalavista.com...they only really know what the UI is telling them.

We're trying to stop some sketchy application from using your API to perform dangerous operations on disk.

Shields Up

Enter code access security. Leveraging CAS you can put special attributes on your code that ensures that the calling code at least have certain privileges. If they don't, the runtime throws a security exception.

In the block below, the AccessSecretArea() method only runs if the calling code already has access to "c:\windows\system32\secretplace". We can be assured that this method isn't going to elevate the privileges of untrustworthy code downloaded off the Internet and run unwittingly by some callous user. Remember that this has nothing to do with the rights of the user, it has to do with the trustworthiness of the calling code.

[FileIOPermissionAttribute(SecurityAction.LinkDemand,
Read="C:\\windows\system32\secretplace")]
public string AccessSecretArea()
{
//Go get content from a sensitive area on the computer...
}

In the code below we ensure that the calling code at least has rights to access the SharePoint object model. Otherwise the .NET runtime tells the calling code to hit the road.

[SharePointPermission(SecurityAction.LinkDemand, ObjectModel=true)]
public SPSite OpenWeb(string url)
{
//Go get some Sharepoint API object.
}

How Permissions Are Set

So how does your code get permissions assigned to it?

It can happen in a couple of ways:

  • Code Access Security Policy on the computer (held in Administrative Tools->.NET Configuration). This normally sets trust levels for non web application code from different zones (Local, Intranet, Internet, Trusted Sites, Untrusted Sites)
  • Trust level explicitly set in web.config or bubbled down from the web.config held in c:\windows\microsoft.net\Framework\[version]\Config\web.config. Web Applications can dictate the trust level they want to run at with the <trust level="Full"> tag. The options for ASP.NET are Full, High, Medium, Low, Minimal. Out of the box, ASP.NET web applications run in full trust. Good idea?...who knows.
  • Permissions explicitly set in custom policy files. These files let you give certain assemblies in an application more trust than others. SharePoint creates two such custom policy files WSS_Minimal (default) and WSS_Medium. You can set them like so: <trust level="WSS_Medium" />

What Permissions Come With Which Trust Level?

Glad you asked. The table looks like below. It reads like so; when your app is running at trust level [column name], it gets [contents of cell] privileges for the [row name] permission group. Lets have a look.




Trust Level
PermissionFullHighWSS_MediumMediumLowWSS_MinimalMinimal
Environment**Read: TEMP, TMP, OS, USERNAME, COMPUTERNAMERead: TEMP, TMP, OS, USERNAME, COMPUTERNAME---
FileIO**Read, Write, Append, PathDiscovery:Application DirectoryRead, Write, Append, PathDiscovery:Application DirectoryRead, Path Discovery: Application Directory--
IsolatedStorage**AssemblyIsolationByUser, Unrestricted UserQuotraAssemblyIsolationByUser, Unrestricted UserQuotra1 MB quota AssemblyIsolationByUser--
Reflection*ReflectionEmit-----
Registry*------
Security*Execution, Assertion, ControlPrincipal, ControlThread, RemoteingConfigurationExecution, Assertion, ControlPrincipal, ControlThread, RemotingConfigurationExecution, Assertion, ControlPrincipal, ControlThread, RemotingConfigurationExecutionExecutionExecution
Socket**-----
WebPermission**Connect to origin host (if configured)Connect to origin host (if configured)---
DNS****---
Printing*Default PrintingDefault PrintingDefault Printing---
OleDBPermission*------
SqlClientPermission*SqlClientAllowBlankPassword=falseAllowBlankPassword=false---
EventLog*------
Message Queue*------
Service Controller*------
Performance Counters*------
Directory Service*------
SharePointPermission*-Object Model----
WebPartPermission*-Connections--Connections-

* = All Privileges for that permission group
- = No Privileges for that permission group

This speaks to why people often get the following exception when running code in SharePoint that tries to access to object model. Notice in the table above that WSS_Minimal (which is the default trust level that WSS ships with) doesn't have CAS rights to get at the object model. Hence the exception.

Request for the permission of type 'Microsoft.SharePoint.Security.
SharePointPermission, Microsoft.SharePoint.Security,Version=12.0.0.0,
Culture=neutral, PublicKeyToken=71e9bce111e9429c' failed.

When these kinds of exceptions happen (CAS exceptions that is, you have three options). You can either:

  1. Raise the trust level of the application so you get the permissions you need.
  2. GAC your assemblies (so that they run in full trust and get the permissions needed).
  3. Leave your assemblies in the Bin folder and author a custom policy file.

The pros/cons are detailed below. They were taken from here.


Option Pros Cons
Increase the trust level for the entire application. (ie. change trust level in web.config) Easy to implement.In a development environment, increasing the trust level allows you to test an assembly with increased permissions while allowing you to recompile assemblies directly into the BIN directory without resetting IIS. This option is least secure.This option affects all assemblies used by the virtual server.There is no guarantee the destination server has the required trust level. Therefore, Web Parts may not work once installed on the destination server.
Create a custom policy file for your assemblies. Recommended approach.This option is most secure.An assembly can operate with a unique policy that meets the minimum permission requirements for the assembly.By creating a custom security policy, you can ensure the destination server can run your Web Parts. Requires the most configuration of all three options.
Install your assemblies in the GAC. Easy to implement.This grants Full trust to your assembly without affecting the trust level of assemblies installed in the BIN directory. This option is less secure.Assemblies installed in the GAC are available to all virtual servers and applications on a server running Windows SharePoint Services. This could represent a potential security risk as it potentially grants a higher level of permission to your assembly across a larger scope than necessaryIn a development environment, you must reset IIS every time you recompile assemblies.Licensing issues may arise due to the global availability of your assembly.

Customizing A Policy File

A great alternative to raising the trust level for the entire application, or GAC'ing your assemblies is to simply create a custom policy file that gives the assembly in question the trust level or permissions you have in mind.

Because this post is already getting long, I'm going to point to do a set of instructions in a future post.

Epilogue

The moral of the story is that by blindly cranking the trust level of your application you're allowing ANY code in that web application to run with additional CAS privileges you just handed out (see table above).

While this might have been cool when you were an ASP.NET developer and there were a finite amount of web parts/features running around, remember that SharePoint is a platform where other developers, administrators and even worse...information workers are actively installing prototyping web parts from god knows where. If you value your SharePoint instance, you really don't want to trust all that code. In fact, you should consider living in a bomb shelter and trusting no one...

X Files - Trust no one.

Best,
Tyler

Sunday, October 26, 2008

Seriously...Stop Writing Code

Pick Me!

In my last year of university I had the privilege of being flown out to Toronto to attend VS Live! 2004. It was the first developer conference I'd been to (at least of scale), and and even though it was years ago there are a lot or moments that still resonate with me today.

One such moment was during an event called Midnight Madness. Some .NET evangelist was working the crowd into a near frenzy. The excitement may have been related to the subsidized bar, but either way the crowd was fully engaged.

I remember the speaker calling out to the crowd as he described his perfect development team.

"How many of you have written your own UI controls?" he asked encouraging the crowd to respond by putting up his hand.

[Hands shot up and people cheered.]

"How many of you have written your own data structures, sorting algorithms, or math libraries?" he continued by putting his hand in the air again?

[The crowed continued to roar with some attendees now on their feet.]

"Well unfortunately..." he said pausing for effect. "I wouldn't hire any of you to be on my team..." he finished watching the crowd deflate.

[A hurt silence hung in the air while a lot of confused developers sat back down].

The speaker went on to explain one of the most important principles I've ever learned.

Good Developers Write Less Code

The first time you hear this it's a bit of a doozy. It's an incredibly hard discipline to instill because developers love to write code. It's probably related to why we became developers in the first place!

What makes the problem worse is the more clever you are more likely you are to to figure out a good solution to a problem...and code it up. Moreover, you might see some opportunity for reuse in your solution...and make a library out of it. If what you just wrote already exists (even a little) then you're probably doing your team and your company a huge disservice.

Every line of code that you write has a huge cost associated with it, every line.

Each line needs to be:

  • Debugged
  • Commented (and comments need to be maintained as well)
  • Stored and Versioned
  • Tested/QA'd
  • Maintained/Updated
  • Read and understood by teammates
  • Deployed

The costs of ownership of for all that code you produce is huge, even if you happen to stay with the company and the requirements don't change (which is unlikely).

Code rots. It doesn't matter how elegant or well implemented the solution is, the system will inevitably get old and the requirements will shift. What was once your prized piece of work will soon become someone else's real problem. When was the last time you heard a developer say:

Developer 1: "Wow, I sure am glad I inherited all this code from [Developer 2]! Maintaining this system is going to be effortless in days to come, I can't wait until the change requests start to appear!

You just witnessed English that has never, and will never be uttered. Good developers aren't necessarily awesome coders, they're great at learning, troubleshooting and understanding. And most of all, they're extremely reluctant to write code, ideally because they understand the cost of ownership of those codes.

With a Gun To Your Head

Of course sooner or later you're going to have to write something...it is after all your job. The point I'm trying to make is that it should only be under extreme duress that you reach for the keyboard. Only after doing some serious research and asking around should you consider touching the home row. There are a lot of alternatives that can be exhausted prior to getting creative with your own solution. Tools like code generation, excellent 3rd party libraries, existing services/applications can not only speed up a project time line but save a lot of money down the stretch.

Is there really value in writing your own CRUD stored procedures for each entity in your data store? Consider generating them with some code gen tools.

Is there any value in writing your own Ajax grid or WYSIWIG editor? Buy one. If you're worried about the company going under then buy the source too.

Exceptions

There are of course exceptions, Joel Spolsky writes about a key exception that you can use as a guiding principle when it comes to deciding whether or not this is something you want to own and maintain in house.

If it's a core business function -- do it yourself, no matter what.

The point is that you're always going to want control over key business functions. If you outsource everything then there's no way you can be exceptional or offer core business functionality that's better than your competitors.

Enter the balancing act of what to own and what to outsource.

It's worth noting that I would be very careful about what you consider your core business functions. The more code you write the more expensive that differentiated service is going to cost you, and as a result, it will cost your customers more too.

Keep the costs in mind. Writing those solutions is fun especially when the problems are interesting, but we're not just developers, we're analysts too. The gold star really belongs with the developer that is selfless enough to have the customer's (and more importantly) the company's best interests at heart.

My Best,
Tyler

Thursday, October 23, 2008

Experts-Exchange...Still Cloaking

The Site

Experts-Exchange has been around for quite some time. Even today the site still gets a lions share of traffic. Alexa ranked the Experts-Exchange as one of the most visited 1500 sites for the first 6 months of 2008.

I personally stopped using the site years ago when you had to upgrade your account to a premium membership to view answers. What surprised me at the time was that the site continued to get great search result ranking. Only this week however did I finally discover what's been happening all this time.

Cloaking

How has the Experts-Exchange gotten such good search result ranking if they don't show answers to non authenticated users? Simple...they do show answers to anonymous users, just not all of them. Experts-Exchange targets specific search engine user agents and renders out a different page for them. This is a practice called cloaking.

Experts-Exchange very purposefully shows one page to the Google Bot and another page in entire to other user agents (say MSIE). This is a candid breach of the of Google quality guidelines and ideally would get you removed from the search index.

Shadiness in Action

You can taste the difference yourself by performing the following steps:

  1. Download and install Firefox and User Agent Switcher (you may also want to use Tamper Data to validate the request's user agent).
  2. Go get a list of common user agents here and import them into the User Agent Switcher add on.
  3. User Firefox to navigate to a Experts-Exchange question of your choice using a normal browser agent (you can inspect your user agent by looking at the request in Tamper Data. When I ran the URL out of stock Firefox the request looked like:Request to Experts Exchange using normal (default) user agent.
    The Response to this request is the following page (bottom of page shown). What the bottom of the page looks like when requesting with a Firefox user agent.
  4. Now use User Agent Switcher to change your user agent to Google Bot and make a request to the same URL. The request should look like:Request to Experts-Exchange using GoogleBot User Agent.
    And of course you get a dramatically different page.Experts Exchange page w/GoogleBot user agent.

Free Answers

One might think that this is great, that you can get free answers from what's normally a paid service. The icing on the cake is that Google will even promote these answers to the top of their index based on content that only you the savvy user can see! Hurray for us (at least those who know how to step around the cloak).

Free answers are great right? It's worth mentioning that there's even more free answers to be had. For some URLs you'll see the answer anyways if you simply scroll to the bottom of the screen regardless of your user agent. Even if that wasn't availiable you could always try to dig the answer out of Google's page cache.

The unfortunate part is that this has been going on for years and Google doesn't really seem to care. Here's a site that's blatantly cloaking, and not terribly shy about it. There's not only Wikipedia entries that talk about this in detail, but blog entries going back at least a year describing this particular version of search engine abuse.

I'm positive that Google is aware that sites like this are showing specialized content to their crawler, but I'm blown away that they haven't done anything about it, especially given that this particular site has been doing it for quite some time.

When you see companies like this break the rules without any action from search engines, what's really there to make you follow the rules? Imagine if lots of sites started to cloak, trust me you'd get irritated in a hurry with the search results you'd get. Either way this is one problem I don't know how to fix. Either way at least we all know how to get the content we want off of the Experts-Exchange.

That's one answer you won't find on Experts-Exchange...

Shaking the head,
Tyler

Friday, October 17, 2008

How Well Do SharePoint Lists Scale?

List Scalability

List scalability is a hot topic in the WSS/MOSS arena, and for good reason. Pretty much everything in SharePoint is stored in a list. Knowing if you're likely to hit a wall with scalability is a proactive step that can save you a tonne of time (and help you decide if a List is an appropriate container in the first place).

Rules of Thumb

When it comes to putting a tonne of data into a SharePoint list I would consider the following guidelines:

  1. No more than 2000 items per view/container. That could be 1900 items and 100 folders but either way, no more than 2000 per view. A folder is an example of a container, so is the "root" level of a list.
  2. No more than 5 million items per list. This is the max, it's only really do-able by nesting many folders together which in turn hold more items.
  3. No more than 2000 lists per sub site (SPWeb).

Methods of Access

It's also worth noting that depending on how you access data from a list you can get dramatically different performance (especially for large data sets). The following white paper from Microsoft (which was described as a must read by Joel Oleson) compares the performance of getting data out of lists with the following techniques:

  • SPList with For/Each
  • SPList with SPQuery
  • SPList with DataTable
  • List Web service
  • Search
  • PortalSiteMapProvider (yes you can query a list with a portal site map provider...who would have known).

For those simply to busy/lazy to give the article a read, I'll try say that in general getting data from the PortalSiteMapProvider is worth looking into (since it's cached) if the data isn't going to be changing that often.

Search and SPList with SPQuery were also strong performers when it came to getting data from lists with 100k+ items.

What's more important is that SPList with For/Each and SPList with DataTable (both below) just don't scale well for large lists.

//THIS CODE DOES NOT SCALE
SPSite site = new SPSite("http://someurl.com");
SPWeb web = site.OpenWeb();
SPList list = web.Lists["ListName"];
SPListItemCollection items = list.Items;

foreach (SPListItem in items)
{
//do something clever
}
//THIS CODE DOES NOT SCALE
SPSite site = new SPSite("http://someurl.com");
SPWeb web = site.OpenWeb();
SPList list = web.Lists["ListName"];
DataTable items = list.Items.GetDataTable();

//Doesn't scale even if you use a data view
foreach (DataRow in items)
{
//do something clever
}

You can probably see what the above codes have in common, they both get EVERYTHING out of the list which is a big no, no. What scales a lot better is retrieving a subset of what you're after from the view with an SPQuery, or other like minded code.

//THIS CODE SCALES
SPSite site = new SPSite("http://someurl.com");
SPWeb web = site.OpenWeb();
SPList list = web.Lists["ListName"];
SPQuery query = new SPQuery();
query.Query =
@"<Where><Eq><FieldRef Name='Name'/><Value Type='Text'>
Joe</Value></Eq></Where>";
query.RowLimit = 100;
SPListItemCollection items = list.GetItems(query);

foreach (SPListItem in items)
{
//do something clever
}

Know The Limits

SharePoint will a LOT of work for you, but there's a balance that needs to be observed. Knowing when the platform is at its limits is key to differentiating what the platform can and cannot do for a project.

Hopefully the above code will help you get a little more out of your lists. If you're more interested in general scalability (how many lists to a site, how many sites to a site collection etc...) there's a great TechNet article here on the topic.

Best,
Tyler

Wednesday, October 15, 2008

Stop ASP.NET From Serving Specific File Types

No Service Please

The great thing about Web Servers is that they will serve pretty much anything on disk. This can get interesting if there's content that you need your application to be able to get at but don't want to served off. Sure you could mess with file permissions or these files in the App_Data directory (which doesn't serve any content). But if this problem happens to plague you in v1.1 or you're in v2.0 and don't want to refactor any code then there's another option. You can simply tell your web application to no longer serve files of a certain type. What's even better is you don't have to write a single line of code.

Stopping ASP.NET From Serving File Types

The following example will stop PDFs from being served by ASP.NET.

The first step is to have ASP.NET handle requests for .PDFs, we do this with the following steps:

  1. Open up the Internet Information Services Manager (Start->Run->InetMgr).
  2. Right click on the web site you'd like to alter and click properties.
  3. Click on the Home Directory tab and then click Configuration.
  4. Find the .aspx extension and click edit. Copy the path the ISAPI .dll and click cancel. (should be something like c:\windows\microsoft.net\framework\v2.0.50727\aspnet_isapi.dll).
  5. Click Add, paste in the path to the ASP.NET ISAPI .dll, put .PDF as the extension and choose All Verbs. Click OK.Adding .PDF mapping to ASP.NET ISAPI .DLL.

IIS will now have ASP.NET handle all requests for .PDF files in this web site. We now use the ASP.NET System.Web.HttpForbiddenHandler to stop the serving of these files. Open up the web.config for your web application and insert the following code in the <httpHandlers> section.

<system.web>
<httpHandlers>
<add verb="*" path="*.pdf"
type="System.Web.HttpForbiddenHandler" />
</httpHandlers>
</system.web>

That's it! Now whenever someone tries to get at a .PDF for this web application an HttpException will be thrown with an error message of "This type of page is not served." You can now go about catching that error with custom error handler and displaying a less user hostile message (should you wish). Either way these file types are now protected.

2008.10.15 13.15.30

Other Uses

This is actually how ASP.NET protects file types from being served (.cs, .csproj, .resx). You can see these HttpForbiddenHandler mappings in the root web.config (C:\WINDOWS\microsoft.net\Framework\v2.0.50727\CONFIG\web.config). Ideally it will save you from writing some code one day.

Hope that made sense.

Best,
Tyler

Saturday, October 11, 2008

Make Sure Your Application's Email Gets There

Return To Sender

So many applications send email these days it's near mind boggling. Unfortunately spam continues to grow and anti-spam tools continue to get more aggressive. This often leaves developers in the middle of an email jihad, realizing that getting email to clients consistently can be a complex task.

Intranet scenarios present a little less of a challenge since you may have access to the mail administrator for the domain and you can actually figure out what kind of rules are being applied to incoming mail. Internet scenarios however require a little more attention. Not only is it awkward to QA the many different types of mail accounts (and servers) you'll be sending to, but many of them will simply ignore your mail (as opposed to sending a bounce) when your mail looks suspicious.

Developer Education

What makes the matter worse is precious few developers (myself included) get sat down at some point in their career and tutored in what a good email looks. Knowledge of SPF, PTR, MX and A records which are key to getting mail reliably to a destination, is usually compartmentalized to network, DNS and mail administrators. Either way when the mail doesn't get there, the burden often falls on the developer to figure out why email from a specific application isn't getting there.

Here are some tips, starting with what's easiest to implement to help your email reliably reach it's destination.

Use A Real Domain Name in the From Address

Don't make up domains when it comes to from addresses. A lot of email servers will check to see if the sending domain actually exists. If it doesn't then the mail will be categorized as spam and disregarded.

Odds are if you send email from someuser@somefakedomain.com it will get dropped. You can choose any value you want for someuser, just make sure the from address domain is real (that is, there's is a valid A record for that domain).

Ensure Your Domain Isn't Blacklisted

It's almost funny how often this comes up. Because mail servers often get hacked and then start spamming, blacklists are maintained by various 3rd party services to help stop the bleeding. These lists are often used by mail servers to help filter out email sent from questionable domains. It's easier than you'd think to end up on a blacklist, your domain might be blacklisted for any of the following reasons:

  • Your mail exchanger is an open relay (you can test this with 3rd party services).
  • A workstation in your domain has been infected with a virus and is sending out spam.
  • You're sending out valid email (news letters, email blasts, etc...) which has been confused as spam from multiple recipients...and they've blacklisted you.

The good news with being blacklisted is you usually get a bounce from a recipient's mail server stating why they rejected your mail and what blacklist you're on. From there you usually end up contacting the company that maintains the list and engaging with them to remove you from the blacklist.

You Have a PTR Record for Your Domain

You need to have a PTR record for your domain, and that PTR record needs to match the from address of the email message.

When mail servers get an email with a from address of someuser@somedomain.com, they'll often check to see if the IP address that sent the email is either:

  • An IP address to a domain that is dynamically assigned (most email coming from dynamic IP is viewed as spam and disregarded).
  • From a domain other than the one listed in the from address. IE. If your from address says the email is coming from somedomain.com, but the IP address that sent the email is in otherdomain.com then this looks a little suspicious and the email may be disregarded.

It's of note that neither of the above rules can be implemented unless you have a PTR record for your domain so that the mail server on the receiving end can do a reverse lookup. If your domain doesn't have a PTR record then a reverse lookup can't be done then your emails could be ignored. Comcast for instance disregards (at time of writing) all mail from domains that don't have a valid PTR record. If you don't have one then your network admin staff can set one up.

Ensure Your SMTP Servers are Setting a FQDN in the Email Header

If you're not setting a fully qualified domain name in the email header then it may not be obvious where this mail is coming from. Half the battle in sending out email is doing as much leg work as possible so that the receiving mail server has no doubt the email is valid (or how to go about validating the email).

If you're using the IIS SMTP Server this is set under Delivery->Advanced.Setting a Fully Qualified Domain Name in SMTP Server.

Have Valid MX and SPF Records for Your Domain

Many mail servers won't accept mail from domains that don't have an MX record, ensure your domain has one (domains that receive mail will have this).

SPF records allow a domain to list the machines in the domain that are actually allowed to send mail. They are becoming increasingly important when it comes to getting your mail accepted.

Increasing Complexity

Like so many other systems developers work with today, mail is becoming increasingly complex. While it's difficult to know everything when it comes to the topic of mail and the internet, it helps to be well versed in the common problems you can run into. Unfortunately just knowing how to send email with the API often isn't enough these days.

Best,
Tyler

Wednesday, October 8, 2008

The Database Interview Question No One Gets

The Question

Every now and then I end up interviewing potential employees for database related positions. To me, this includes ETL developers, DBA's, and web developers. I lump web developers in there because even though a deep knowledge of ASP.NET/JSPs/PHP/Rails is important for web work, it's likely that almost all of your application state is going to be persisted in some database. As a result, knowing how those database things work isn't just for DBAs anymore. Anyway's I digress, back to the question.

How do I know if a table is in 3NF?

Rational

This question really speaks to what 3NF is and how to apply it to a table. Before we get to the answer, let me spend a sentence or two talking about why I think the above roles should know the answer.

When we talk about creating a schema we often talk about normalization. When I ask people why we normalize I sometimes get pretty foggy answers. People will say things like "to add clarity to the data model...", which is usually prolog for me asking them to be more specific. Enter the question above which quickly seems to separate the wheat from the chaff.

I think it's fair to assume that if you declare that you know what normalization is...then you should be able to identify one of the most common normal forms and its rules.

Good developers IT Professionals are exceptional learners, they rarely stop with the short answer. If you've built more than a couple schemas you should probably know what 3NF is and why you would (and would not) normalize to 3NF.

As you become proficient in a topic it's inevitable that you begin to learn some surrounding knowledge about the subject. Would you ever hire a C developer who claimed to be an expert, but had never heard of Dennis Ritchie. How about a PERL developer who had never heard of regular expressions? People who are experts with a tool know exactly what it's good for, and more importantly what it's bad at.

Answer

So how about it, is the following table in 3NF (answers at bottom)?

Monarch Name Monarch Number Royal House
Edward II Devonshire
Edward III Plantagenet
Richard V Plantagenet
Henry IV Lancaster

How about this one?

Name Year Winner Winner Date of Birth
Indiana Invitational 1998 Al Fredrickson 21 July 1975
Cleveland Open 1999 Bob Albertson 28 September 1968
Des Moines Masters 1999 Al Fredrickson 21 July 1975
Indiana Invitational 1999 Chip Masterson 14 March 1977

To figure out if a table is in 3NF lets have to look at an abridged version of the definition. 3NF is all about identifying functional dependencies a table. Another way to say this is; for all rows in this table, this column implies the value of some other column.

Consider the following employee table.

Employee Name Pay Grade Store No. Pay
Chuck Taylor 5 66 $7.90
Daryl Weep 6 75 $8.50
Josh Wellington 5 33 $7.90
Horris Lowbrow 7 11 $9.25

Notice how whenever the Pay Grade is 5, the Pay is $7.90. If this behavior repeats for all rows (and values) in the table then it is said to be a functional dependency. For the above table Pay Grade implies Pay.

For a table to be in 3NF it can't have any functional dependencies involving columns that aren't part of the primary key or are a superkey (a super key is a potential key for the table).

The following table has many functional dependencies (StudentId->All Columns, SSN->All Columns) but each is either part of the primary key or part of a superkey. As a result it's still in 3NF.

StudentID SSN Name Faculty
1 544718392 Jerry Engineering
3 651273928 75 Law
9 341841892 33 Medicine
15 483938928 11 Engineering

I guess in a nutshell I'm looking for an interviewee to finally say:

[Tyler]: How do I tell if a table is in 3NF?

[Great Dev]: Well first we look for functional dependencies between table columns. If we find any column that implies another, isn't part of the primary key, and isn't a superkey...then the table isn't in 3NF.

If we wanted to make the table 3NF we'd move the functional dependency out to another table.

Perfect answer! You can tell this candidate is at least well versed when it comes to normalization and 3NF.

Answer Key

If you're mildly curious the first table IS in 3NF. There are no functional dependencies to be found.

The 2nd table is NOT in 3NF since:

  • The Winner column implies the Winner Date of Birth column.
  • The Winner column is not part of the primary key.
  • The Winner column is not part of a superkey.

If we wanted to rectify this we'd break it up into two tables.

Tournament Year Winner
Indiana Invitational 1998 Al Fredrickson
Cleveland Open 1999 Bob Albertson
Des Moines Masters 1999 Al Fredrickson
Indiana Invitational 1999 Chip Masterson

Player Date of Birth
Chip Masterson 14 March 1977
Al Fredrickson 21 July 1975
Bob Albertson 28 September 1968

It's important to know that there's a tonne of times where you shouldn't normalize, but that's for another. It's also good to know that there are very common breaches of 3NF that are widely accepted. Consider a User table with an Address fields. Street Address always implies Zip Code, but address information is seldom broken up. That being said, knowing is half the battle.

Exodus

Maybe it's not a fair question (I'm sure I'll hear feedback either way), but I often get frustrated with Senior Engineers that have yet to get deep in their tools or topics. If you heard a term repeatedly that you didn't understand, how long would it take you to go explore it? I'm also not saying that people should run around with academic definitions memorized in their head. All I'm getting at is you can't ever stop learning, and getting deep in the tools that immediately surround your day job is really just going after the low hanging fruit. Either way I hope this post made 3NF a little more simple for someone out there.

Best,
Tyler

Friday, October 3, 2008

Database Isolation Levels: A Refresher

Back to Basics

It's exceedingly easy to forget almost everything you learned in school, even if happened to pay attention at the time. Inevitably some of those same topics you may have covered in your youth will rear their head in your day job (especially if you took an IT related degree/diploma).

The other day I got into a discussion about database concurrency anomalies and how setting isolation levels can mitigate concurrency issues. When pressed to speak to different kinds of concurrency issues, all I could remember off the top of my head was the "dirty read". This post is an effort to re-cement the topic into my mind (at least for a while).

Concurrency in Databases

Remember that the DBMS has the unfortunate responsibility of fielding requests for CRUD operations for many different clients at the same time. When multiple users access the same data at the same time, sooner or later someone is going to step on another's toes. When a shoving match ensues over who gets to read/write on the same piece of data, the DBMS looks to you the developer to tell it how to behave when a lock needs to occur.

Remember that these problems below don't necessarily affect just stored procedures. It's equally easy to have these affect you from some Java or C# whether you're using transactions or not. It all depends on the isolation level that you asked for (or that you got defaulted to). It's worth mentioning that in both SQL Server and Oracle the default locking behavior is Read Committed. This is also true for most RDMBS'.

Dirty Reads (Uncommitted Dependency)

  1. Transaction1 begins modifying a record.
  2. Before Transaction1 commits the update, Transaction2 reads the change.
  3. Transaction1 rolls back the change.

In this scenario Transaction2 is said to have performed a dirty read. It has read data which was never committed to the data store. You usually don't need to worry about these in SQL Server or Oracle as the default isolation level for these DBMS' is Read Committed (below), which only reads data that has been committed to the data store.

Non Repeatable Read (Inconsistent Analysis)

  1. Transaction1 reads a record.
  2. Meanwhile Transaction2 updates the same record and commits its changes.
  3. Transaction1 reads the same record again (in the scope of the same transaction) but the value has changed!

In this scenario Transaction1 could have been running in Read Committed (which reads only records that have been committed to the data store), but still have values change underneath it in the midst of a transaction. To protect yourself against this type of anomaly you need to run in an isolation level of Repeatable Read (or higher).

Phantom Reads

  1. Transaction1 reads a set of records (or maybe aggregates a value off a set).
  2. Transaction2 inserts records into that set.
  3. Transaction1 reads that set of records again and notices that some records have magically appeared.

Phantom reads can happen even if you're running an isolation level of Repeatable Read. You can really only protect against Phantom Reads by running in a isolation level of Serializable (Oracle or SQL Server) or Snapshot (SQL Server).

Who Cares?

The conversation of "what isolation level are these running at?" usually only comes up when you get into trouble or you're planning for big scalability. Normally the default isolation level (Read Committed) offers enough protection to get you through the traditional OLTP application and most ETL jobs you'll end up writing. There are scenarios however when knowledge of these isolation levels can save the day.

Relaxing the isolation level on your transactions can yield a lot more performance for your application by allowing more users to concurrently access data. Running in an isolation level of Read Uncommitted is great for the case that you simply don't care if you end up with dirty reads.

However, if you need gestapo like isolation on a series of actions against the data store (building an important financial report where you need the data to remain consistent while you work with it), an isolation level like Serializable or Snapshot could be just what you're looking for.

Isolation Levels

It's important to note that isolation levels only affect the behavior surrounding reads in SQL Server (and most other DBMS'). A transaction always gets an exclusive lock on any data it modifies and holds that lock until the transaction completes, regardless of the isolation level set for that transaction.

Below is a table of isolation levels available in SQL Server/Oracle, the anomalies they protect against and the locks that they use to enforce the behavior.

Isolation Level (SQL Server)

Dirty Read

Non Repeatable Read

Phantom Read

Behavior

Read Uncommitted

Yes

Yes

Yes

No locks are acquired or honored when reading data. Highest degree of concurrency.

Read Committed

No

Yes

Yes

Application only reads values that have been committed to data store.

Repeatable Read

No

No

Yes

Application puts locks on all records it reads during the life span of the transaction.

Serializable

No

No

No

Application puts locks on all record it reads and prohibits (locks)insertions/deletions of records into the set of records that it is operating on.

Using Isolation Levels

Below are some examples of how to use a specific isolation level in both T-SQL and C#.

You can specify isolation level in T-SQL either just before you start a transaction:

SET TRANSACTION ISOLATION LEVEL REPEATABLE READ
GO
BEGIN TRANSACTION
SELECT * FROM publishers
SELECT * FROM authors
...
COMMIT TRANSACTION

Or right after you name a table using an isolation level hint.

SELECT *
FROM Orders (nolock)
WHERE UserName = 'Peter'

You can also specify a transaction in C# using the SqlConnetion.BeginTransaction method.

SqlConnection connection = new SqlConnection(connectionString)
connection.Open();
SqlTransaction transaction = connection
.BeginTransaction(IsolationLevel.ReadCommitted);
...
transaction.Commit();

Wrapping it Up

Like we mentioned earlier, using the defaults will work 90 percent of the time. However, being aware of how isolation levels fit into the puzzle and how they can help you just might be the piece of information that saves the day. I've seen web applications that frequently deadlock become quite useful, and reporting apps that consistently get the "wrong" data cured by spending a couple of hours looking at isolation levels.

Best,
Tyler

Wednesday, October 1, 2008

Subversion and NTLM with a Little Active Directory on Top

When Different Tools Fit in the Same Belt

I'm a big fan of Subversion when it comes to source control. It's feature rich, the price is the best kind (free as in freedom), and it has some great UI options which help developers learn/explore the repository without too steep a learning curve.

I currently work in a (mostly) .NET shop which means we have precious few BSD/Linux distros kicking around, and we also do pretty much all of our authentication via NTLM and Active Directory. Here's a story of some unlikely tools working unexpectedly well together.

Eclectic Authentication Scenario

We wanted to do the following:

  • Allow certain NT Users in given Active Directory groups to access the repository.
  • Allow certain outside (non company employees) users to access the repository. These users would not have an AD account and get in via a proxy pass.
  • Disallow all other users.

The diagram below tries to summarize the scenario. It's worth mentioning that I'm only going to go over the steps for setting up Subversion/Apache to work with NTLM for domain and local accounts. The whole Apache/ISA reverse proxy is a little big for me to get into in this post. That, and I'd probably end up messing it up anyway. Onward.

Diagram of NTLM authentication via the domain for internal users and non AD NTLM authentication for external users.

The Setup

This is a potential solution to the problem (a.k.a. what we did). I'm positive there are some Subversion experts out there that are deeper with the tool and could come up with a cleaner solution. That being said, I couldn't find any of their blogs.

It's worth mentioning that in this solution all the authentication needs to be done through Apache, Subversion itself is quite unaware of Active Directory and is unable to authenticate through NTLM. Apache on the other hand can handle NTLM quite well so long as you install the SSPI Authentication module.

Ok. Lets get to it.

1. On some Win2k3 machine download and install Apache/Subversion, a good walkthrough for this can be found here. This includes installing the SSPI Apache mod.

2. Backup and then change the <Location> element in your Apache httpd.conf file.

<Location /SVN>
DAV svn
SVNParentPath "C:/SVN"
AuthName "SVN Server"
AuthType SSPI
SSPIAuth On
SSPIAuthoritative On
SSPIDomain [LocalMachine]
SSPIOmitDomain Off #Require users to use [domain]\username
SSPIOfferBasic On #Required for TortoiseSVN to get at repository
#it does't support NTLM.
SSPIUsernameCase lower #Allows you to give username in lower case.
#SSLRequireSSL (Not a bad idea, if interested read this)
#We limit access by virtual directory instead of svn access file.
#AuthzSVNAccessFile "C:/SVN/svnaccess.conf"
<LimitExcept GET PROPFIND OPTIONS REPORT>
Require group [LocalMachine]\SVNAccess
</LimitExcept>
</Location>

3. Define a local group on the subversion machine (SVNAccess in this case) that allows both domain groups and local users to access the repository. All internal users will get authorized because of their AD Group membership, all external users will get authorized because of their local group membership.Local group on the Subversion/Apache machine that contains all the users/groups that have access to a particular virtual directory (the SVN root in this case).

That's it you're done! Users will now be logging in with either their domain\username credential (if they have one) or a [machinename]\username if they're an external user.

Of Note

  • It's a darn good idea to have your reverse proxy use https and encrypt the channel, especially if our Tortoise SVN client is using basic authentication (read username/passwords sent in clear text).

  • If you don't have all this internal/external user business then you can just omit step number 3 and potentially add one or more domain groups setting the SSPI Domain [DOMAINNAME] and then adding multiple Require group entries like so.

Especially these days IT departments are being asked to do more with less. Tools like Apache and Subversion definitely allow that to happen even with constantly shrinking budgets. Hopefully your creativeness an ingenuity continue to grow as your budget shrinks.

God speed,
Tyler Holmes