| By Mike Nicewarner | Article Rating: |
|
| February 1, 2005 12:00 AM EST | Reads: |
1,138 |
The purpose of this article is to discuss and describe the terms "Surrogate Key" and "Natural Key," and explain when they are appropriate to use in a data model. Furthermore, the concepts of "Intelligence" and "Generated" will be applied to both types of keys.
The goal is for the reader to understand the plusses and minuses of all forms of identifying records uniquely in a database system. The reader is also challenged to think critically about the business situation before adopting any given identification methodology.
Terms Defined
Surrogate Key
A surrogate key is a column or columns that are not defined by business requirements, but are added to a table simply to identify records uniquely. For example, a business might define a set of related attributes that it wants to keep track of. In that list there may be a number of items that could be considered unique (candidate keys), but none appear to be stable enough to be the primary identifier for the set of attributes. The data analyst might suggest an additional attribute in the form of an Entity Name plus the text "Identifier." For instance, for an entity called "Survey," this would be "Survey Identifier." There is no business purpose or meaning for this new attribute, and therefore the resulting table column would be more resistant to business changes.
Natural Key
A natural key is a column or columns that a business chooses to uniquely identify records. The data analyst allows them because they are considered stable enough to be the primary identifier. For instance, an existing business process might define a Product Code that uniquely identifies equipment in a company. By selecting this as the equipment's primary key, the business accepts the resulting database structures more readily.
Intelligent Key
An intelligent key derives its name from the intelligence behind the values assigned to the columns in the key. For instance, a Product Code may be a 10-character field (defined as CHAR(10) in the DBMS). This can be mapped as follows: the first two characters are the business unit, the next two indicate the month of manufacture, the next three is a product-type code, the last three is a sequence number. This specific case could also be called a concatenated key since it's actually a combination of four separate fields in one single field.
Generated Key
A generated key has no intelligence to the contents, and is typically assigned a value by the computer through some internal mechanism like a sequential or random number generator or some other method. The point is that the business doesn't care about the value assigned to any given key, just that a unique value is assigned that can be referred to by dependent tables. For example, an Invoice Number might be defined as a 10-digit number, and assigned values sequentially, beginning at some starting point and incrementing by one for each new record added to the Invoice table.
The Keys in Use
In most databases, relationships between tables are handled in two primary ways, and both can be found. First is the single-column key. In this method, each table is assigned a single primary key column, all foreign keys are also single-column, and all referential integrity is on pairs of columns. The second method has the child table inherit the primary key column(s) from the parent as part of its own primary key because it's a dependent of the parent table. In this case, the key column of the child table is not sufficient to identify the records uniquely, so it needs both the parent primary key and the child table primary key.
Single-column primary keys are well suited to surrogate keys, since the surrogate key has no business meaning and is stable by itself. They are typically system-generated. Child tables also have a single unique primary key column. The reference from the child to the parent is on the single parent column and carried as a mandatory foreign key column in the child. A child of the child would also have its own unique primary key column, and would have a single-column foreign key to the child table.
Dependent table configurations are suitable for natural keys. The parent table has whatever columns are chosen to identify the record uniquely as its primary key. The child table's primary key is composed of the parent primary key plus a column from the child that makes each record unique. Since natural keys are defined by the business, there's a good chance that at least one of them will be an intelligent key.
It's extremely rare to see just single-column primary keys in a database, or for all tables to depend on their parents for parts of their primary keys. Rather, some deference is usually given to a judicious mix of single- and multi-column primary keys. For instance, associative tables typically have their primary key defined as the combination of the primary keys of the parent tables, which may themselves be single-column primary key tables.
Strengths & Weaknesses
A design that favors single-column surrogate primary keys will have very efficient joins between tables, since most of them will be single-column joins. Tables will also be narrower than their natural key cousins, meaning less deadweight. However, getting from a deeply nested great-great-great-grandchild table to the ultimate parent table may take some serious navigation. Read that as complex queries. Relational database engines are optimized for these kinds of queries, and the overall reduction in table deadweight tends to offset the complexity of the queries.
The natural key structure is, simply enough, more natural for the developer and the super-user accessing the tables through their favorite query tool to deal with. Child tables make sense intuitively, since the primary key columns can be used to discover the lineage.
Unfortunately, this comes at a fairly steep price. If any key is intelligent, it's going to be almost everywhere, and any business change will be devastating.
Take the case of a data-type change because the Product Code described above has maxed out the three-digit sequence code. The business started using letters after it passed 999, and now even that's not enough. It wants to make the last part of the Product Code five digits, increasing the Product Code from CHAR(10) to CHAR(12). Every table that uses that Product Code as part of the primary key will have to be structurally changed. Many databases won't let you make that change casually, and typically require that the data be unloaded, the table dropped and rebuilt, then the data reloaded. In large database systems, this could be a monumental undertaking.
Best Practices
So, what's the best thing to do? Well, that depends, of course. There is no absolute mandate that says there's only one way to manage keys. However, to minimize the impact of future business changes, there are a few rules of thumb with a good track record.
- Make every attempt in the initial design to find an absolutely stable primary identifier. Press the business to be honest and say upfront whether there's any possibility that the natural key they recognize will ever change in value or data-type. Be ready with arguments to support using surrogate keys. They are inherently stable, perform well in joins, have referential integrity, and provide simple joins from table to table.
- Establish a maximum limit for nested dependent tables. One published guide claims that five levels of dependent tables is a red flag. Re-evaluate the tables and consider introducing surrogate keys somewhere along the line.
- If using single-column primary keys, remember that deeply nested tables may need a direct foreign key link to a great-grandparent table. It can provide a shortcut up the chain if the application and data access tendencies require it. NOTE: This introduces problems, and violates at least second normal form. However some denormalization can be tolerated because database engines aren't perfect, and complex joins might perform poorly.
- Don't automatically assume that multiple natural keys are bad. Many situations such as Invoice and Invoice Detail and Order and Order Detail tables are obvious places to use the business-understood Line Number as the Detail table primary key, combined with the parent table primary key. Trying to force a surrogate key on the Detail tables would be cumbersome and a relatively poor performer.
- As already stated, associative tables typically use the combined keys of the parent tables as the primary key. However, if the associative table is itself a parent of a lot of other tables, be ready to come up with a single-column primary key for it.
Your experience may vary. If you disagree, great. Please send your feedback on what I've said to the email address below. I'm always willing to learn and, heck, I might have left out some important details. My hat's off to every DBA who has to maintain a database he inherited with the kind of Product Code-from-hell I described. I actually went through that ordeal in one of the companies I worked for. My sympathy to anyone who has had to make massive changes throughout a DB2 subsystem because of an intelligent key that was anything but.
This article has been reprinted with permission from www.datamodel.org.
Published February 1, 2005 Reads 1,138
Copyright © 2005 SYS-CON Media, Inc. — All Rights Reserved.
Syndicated stories and blog feeds, all rights reserved by the author.
More Stories By Mike Nicewarner
Mike Nicewarner, a database design manager working in Nebraska and president of the DataModel.Org organization, has been designing databases and modeling enterprise data with PowerDesigner since it was SDP and S-Designor.
- Why SOA Needs Cloud Computing - Part 1
- Cloud Expo and The End of Tech Recession
- The Transition to Cloud Computing: What Does It Mean For You?
- A Rules Engine Built in PowerBuilder
- Sybase Named “Silver Sponsor” of iPhone Developer Summit
- How PowerBuilder Got Its Groove Back
- The Cloud Has Cross-Border Ambitions
- Ulitzer Names The World's 30 Most Influential Virtualization Bloggers
- Ulitzer Named "New Media" Partner of Greatly Anticipated iStrategy Event in Berlin
- Risks and Enterprise Mobility?
- Steps for Success in Enterprise Mobility?
- Are Mobile Luddites Resisting Mobility?
- The Difference Between Web Hosting and Cloud Computing
- Sybase CTO to Speak at 4th International Cloud Computing Expo
- Why SOA Needs Cloud Computing - Part 1
- Cloud Expo and The End of Tech Recession
- The Transition to Cloud Computing: What Does It Mean For You?
- Five Reasons to Choose a Private Cloud
- Seeding The Cloud: The Future of Data Management
- The Threat Behind the Firewall
- Economy Drives Adoption of Virtual Lab Technology
- Tips for Efficient PaaS Application Design
- A Rules Engine Built in PowerBuilder
- Sybase Named “Silver Sponsor” of iPhone Developer Summit
- Where Are RIA Technologies Headed in 2008?
- PowerBuilder History - How Did It Evolve?
- The Top 250 Players in the Cloud Computing Ecosystem
- Custom Common Dialogs Using SetWindowsHookEx
- DDDW Tips and Tricks
- OLE - Extending the Capabilities of PowerBuilder
- DataWindow.NET How To: Data Entry Form
- Book Excerpt: Sybase Adaptive Server Anywhere
- Sybase ASE 12.5 Performance and Tuning
- Working with SOA & Web Services in PowerBuilder
- Office 2003 Toolbar: A New Look For Your Old PowerBuilder App
- Dynamically Creating DataWindow Objects
































