Beyond SELECT — Part 1: Constraints

You’ve just built your first database application. You’re proud of your accomplishments — and you should be. You mastered the fundamentals of SQL: creating a table, putting data into it, and querying it. You took the basic approach: when it came time to sum a column of numbers, you wrote a FOR loop. Everything works, but shouldn’t your database be doing more for you?

Many people are in the position just described. When time came to use a database, the first web hit for “SQL Tutorial” became the bible. SELECT, INSERT, and DELETE — the bread and butter of data manipulation. The SQL database has satisfied the need for data storage. What else can it do?

It can do a lot more. From aggregate functions to stored procedures, this article will help your database sing and dance. (Note: the author is not responsible for damage, physical, mental or emotional, caused by database servers and software singing and dancing)

This article is part 1 of a multi-part article, Beyond SELECT. This introduction to more than basic SQL is written with multiple databases in mind. Feature availability is documented for PostgreSQL, and MySQL. If you have information on the use of features here with other databases, please leave a comment and I will be glad to update this article. This is part 1 of a multi-part article. Links to the next parts will be added at the end as they are published. I am eager to do similar writing on a freelance basis for any publications. If you are interested, please leave a comment.

Constraints — Make a smarter database

Keys, and Primary Keys

A key is a value that uniquely identifies a row in a table. (PostgreSQL, MySQL 5) A table can have many keys, and most tables have at least one. One key is called the primary key, and is intended to be the method of identifying a row in a table.

Most new SQL users are familiar with the concept of a primary key in the form of an auto-incrementing integer column. In MySQL, this type of column is called AUTO_INCREMENT. In PostgreSQL, the SERIAL identifier is the equivilant. These data types are easy to use and understand, and make excellent primary keys in many cases.

In cases where integer primary keys are used, undesirable data duplication can be posible. For example, let’s create a simple table describing an employee, and recording his name and social security number:

CREATE TABLE employee (
    employee_id SERIAL,
    name VARCHAR(200),
    ssn VARCHAR(11)
);

With this data definition, a simple data entry mistake could create two employee rows representing the same employee. Maybe the HR director has big thick fingers, and he hit the enter button twice when creating a record. You could put application logic in place to prevent two people with the same social security numbers from being created, but your database can do that for you. Let’s drop the employee_id row, and use the SSN as the table’s primary key instead:

CREATE TABLE employee (
    ssn VARCHAR(11)
        PRIMARY KEY,
    name VARCHAR(200),
);

By taking this simple step, we’ve reduced the size of our employee table and also made it more “error-proof”. If you attempt to enter two employees with the same SSN, the database will refuse and give you an error.

Not every table will have a single field primary key. For example, a posting to an online forum can’t be keyed off the time, since multiple people could post at the same time. Sometimes it is easier to use an integer primary key, and sometimes it is a good idea to use a composite primary key. A composite primary key takes two columns, and ensures that no other row in the table exists with those same two columns. It uniquely identifies a row based upon multiple values.

Our company has just gone multinational. It seems unlikely, but somehow John Q in Canada has the same social insurance number as Jane B in Tennessee. Now we need to alter our database to allow both users in, but we still don’t want to go to an integer primary key. We can use a composite primary key on the country of employment AND the social (insurance | security) number of the employee:

CREATE TABLE employee (
    ssn VARCHAR(11),
    country VARCHAR(2),
    name VARCHAR(200),
    PRIMARY KEY (country, ssn)
);

Now (‘US’, ’223-0423-85′, ‘Jane B’) is a distinct row from (‘CA’, ’223-0423-85′, ‘John Q’), and both can be entered into the database. Of course, this is a contrived example since Canadian SIN numbers and US SSN numbers have different formats, but that doesn’t really matter.

What is the difference between a primary key and a key? Is there such a thing as a secondary key? A tertiary key? As it happens, there’s nothing much “primary” about a primary key. More than anything else, the “primary” is documentation, letting people know how this table was intended to be used. You can create any number of keys on your table by creating unique indicies.

For example, let’s say that you decided not to use a composite primary key for the employee table. Many other tables reference employee, so it is easiest to just use the original employee_id that we started with. However, we still don’t want multiple employees to have the same social security number, so we create a unique index on that column:

CREATE TABLE employee (
    employee_id SERIAL
        PRIMARY KEY,
    name VARCHAR(200),
    ssn VARCHAR(11)
);
CREATE UNIQUE INDEX employee_ssn_key ON employee (ssn);

Now only one row can have any given social security number, and only one row can have any given employee_id. This table has two equally valid keys. Keys should be created on any unique data, and ideally the primary key should not be a manufactured arbitrary number. However, you will find that integer primary keys can reduce the size and complexity of your database once you start using many foreign keys.

Foreign Key Constraints

A foreign key occurs when a row in one table references a row in another table. (PostgreSQL, MySQL 5) This is the relational part of a relational database system, and it is very common. Most people learn how to make two tables reference each other, but a surprising number don’t know that the database itself can help enforce that. Let’s create a little example:

CREATE TABLE manager (
    manager_id SERIAL
        PRIMARY KEY,
    name VARCHAR(200),
    evil BOOLEAN
);
CREATE TABLE employee (
    employee_id SERIAL
        PRIMARY KEY,
    name VARCHAR(200),
    manager_id INTEGER
);

In this scenario, each employee has a manager, which can be looked up based upon their manager_id. It’s actually a pretty bad design, for a couple different reasons. Most importantly, I could enter an employee with a manager_id that doesn’t exist. That could be intentional, or it could be a giant mistake. We’re going to introduce a foreign key constraint that will make sure all employees have a manager that exists:

CREATE TABLE employee (
    employee_id SERIAL
        PRIMARY KEY,
    name VARCHAR(200),
    manager_id INTEGER
        REFERENCES manager (manager_id)
);

Suddenly it is impossible to enter a manager_id that does not exist. The database is doing the hard work of checking every input manager_id for us, and all it took was a couple of words! But we actually didn’t quite accomplish what we want. The manager_id can still have a NULL value entered into it. Is there some kind of constraint that can fix that?

NULL / NOT NULL Constraints

The NOT NULL constraint is quite possibly the simplest we’re going to take a look at. (PostgreSQL, MySQL 5) Let’s throw the words NOT NULL into a table, and see what affect that has:

CREATE TABLE employee (
    employee_id SERIAL
        PRIMARY KEY,
    name VARCHAR(200)
        NOT NULL,
    manager_id INTEGER
        REFERENCES manager (manager_id)
        NOT NULL
);

Adding NOT NULL has made it so that the name and manager_id field must be provided. The employee_id field is already NOT NULL, because it is a database primary key. Now we have employees that must have names, and must have managers, and their managers must exist in the manager table.

There is a constraint (or, really, a lack of constraint) that is opposite of NOT NULL. Writing "NULL" in after a field allows a value to be NULL. This is actually the default for all columns, and it does not have to be explicitly stated. Personally, I like writing NULL for every field that can be NULL — it’s a reminder to myself, when I’m looking at the schema in the future.

Don’t you think it’s about time our employees got paid?

CHECK Constraints

A check constraint enforces a defined rule on a table or column. (PostgreSQL, unsupported in MySQL 5?) It is another tool that helps you design databases that only take logical and sensical data. Let’s create a table of employees and how much they get paid:

CREATE TABLE employee (
    employee_id SERIAL
        PRIMARY KEY,
    monthly_salary NUMERIC
        NOT NULL
);

This table works great, until an employee complains about the payroll system deducting money from his bank account. A quick look shows that the HR director and his fat fingers are the cause once again — he entered $-640 into the payroll application! He must have been aiming for the 9 key, you figure. Let’s put a check constraint onto that column to prevent this from ever happening again. While we’re at it, let’s limit the monthly salary to values under $15,000. You can always come back and change it later if necessary, but for now it will prevent another data entry typo.

CREATE TABLE employee (
    employee_id SERIAL
        PRIMARY KEY,
    monthly_salary NUMERIC
        NOT NULL
        CHECK (monthly_salary > 0)
        CHECK (monthly_salary < 15000)
);

Now we cannot enter salary values outside of (0 ... 15000) per month. You might need to rebuild this table and increase the salary limit, once your boss finds out how you prevented a terrible payroll mistake. ... ha ha. ha.

Be Sociable, Share!

8 Comments on Beyond SELECT — Part 1: Constraints

  1. Ian Sparks
    2006/08/11 at 1:20 pm (8 years ago)

    Nice tutorial. I think it’s important to stress that constraints are part of your business logic and you should be careful not to add them prematurely. They can also be a pain during data migrations or changes to the schema because they will require you to enter a row in tableA before you can add any rows to tableB and that can be very inconvenient and limit your options.

    If you’re building a database for an application and you know what the business rules are go ahead and put in your constraints. If you’re still feeling your way and the busines rules are not really known yet, stay loose – you can always add them in later once the requirements are firm.

  2. Ryan Elisei
    2006/08/11 at 1:52 pm (8 years ago)

    This is a great tutorial, especially for PostgreSQL.

    I think it’s a good idea to also introduce the concept that data validation often happens at the application level. Perhaps even discuss the pro’s and con’s.

    Very nice work.

  3. Jason Lai
    2006/08/11 at 3:16 pm (8 years ago)

    Pretty good introduction. I’m looking forward to the next installment.

  4. Reinier Zwitserloot
    2006/08/11 at 5:42 pm (8 years ago)

    I don’t particularly like this tutorial. First you do away with a simple integer counter as primary key, and then later you convolute the primary key into a combination of SSN and country. That’s -really- annoying: Let’s say I want to pass a reference to a person around in my application. Whatever way you turn this, the easiest thing you can possibly do here is to just pass the tuple (tableName, integer) which uniquely identifies -ANY- record so long as all records always have a primary key that’s an integer, and that has a unique field name (Let’s say, they are always called ‘unid’). Now if this new bit of code actually needs the data, a simple select will get it, quickly, because the primary key is guaranteed to be indexed (and ints index quickest and with the smallest overhead, computers being what they are).

    That’s simply not a useful abstraction – you’re gaining a tiny amount of speed and size (but in some cases actually you lose some, ie with a composite primary key) but you are making development a lot more difficult. You can always add a different type of constraint (like a second key on the SSN/country pair to ensure they remain unique) or you can build this constraint outside of the dbase server. Whatever happends to work nicest.

    Moving on, stuff like ‘CHECK’ doesn’t work in all database engines and where it does, notation tends to be different. The nice thing about just making a simple SELECT and doing whatever extra checks you need outside the dbase schema is that you end up with very simple ‘portable’ SQL. For high load huge schemas at some point you will just have to start designing to the strengths of the specific dbase app you are using, or commit to adding a LOT of metadata for some big iron middle layer like Java Hibernate or some such. Once you elect to go middleware these tutorials become effectively pointless.

    Google for ‘database normal form’ and learn something useful. Alternatively, talk about basic dbase calculations in the form of SUM and JOIN – both of those, once you grok them, can help really speed up queries significantly, something that using the natural key as primary key never will. Also talk about how you can add indices to speed up selects on those specific fields, and perhaps hint at using the database local variant of TOP/LIMIT, which restricts output to a limited number of fields, just in case you actually just put in a query for an entire (huge) dbase table, freezing up the system for quite a hwile. (unfortunately the exact keyword tends to differ between dbase engines. Extra points for oracle for being the dumbest implementation of that ever. It’s something like WHERE columnid 100 and columnid

  5. Mathieu Fenniak
    2006/08/14 at 8:40 am (8 years ago)

    Reinier – Thank you for your comments.

    Regarding keys – I have stated multiple times in this article that integer primary keys are a useful tool and that I would recommend them in a variety of scenarios. In fact, I even showed the use of a secondary key by using an integer primary key and a unique constraint on the SSN field, just as you suggested.

    Regarding CHECK – I am aware that it is not supported in all database engines. As far as I know, it is not supported in MySQL, as already stated in this article. I believe the place for this kind of verification is in the database, in the middleware if applicable, and in the client application. Data integrity is more important to me than ‘portable’ SQL.

    I do not appreciate your comment, “Google for ‘database normal form’ and learn something useful”. I do know many useful things. I know how to tie my shoes. I know how to purchase velcro shoes to avoid tying my shoes. I know how to buy sandles to avoid using velcro. I know not to wear socks with sandles (my wife taught me). And I know what database normal forms are.

    And finally, notice that this is part 1 of a series of articles. Part 1 is entitled “Constraints”. I intend to write a great deal more that includes many of the topics you have suggested. I do not intend to write about database normalization — it does not fit in with this article’s topic of just-past-basic SQL.

    And yes, Oracle certainly does have the dumbest implementation (or lack-thereof) of LIMIT/OFFSET — at least we agree on that. ;)

  6. Peteris Krumins
    2006/08/19 at 12:53 am (8 years ago)

    I wear socks with sandles and I have no idea why it is considered geeky.
    It is much more comfortable with socks than without.

  7. brandon mcginty
    2006/12/24 at 10:14 am (7 years ago)

    Wonderful job.
    Have been looking for this kind of tutorial for quite a long time.
    Thanks Much!

  8. omul
    2007/08/27 at 2:48 am (7 years ago)

    Since I finally understood the constraint basics, I find it very useful. Thanks!