What Is a Database Concurrency Conflict?
Let’s start with a quick recap of what database concurrency conflicts are and why you need to solve them in the first place.
Most database applications in this world are multi-user applications. This means that, at any given point in time, you can expect multiple persons and/or processes reading from and writing to a database. Given that multiple persons/processes are updating data in a database, it is only a matter of time before two separate persons/processes will try to update the same piece of data. Because a typical update cycle consists of:
- Read into memory
- Update in memory, and
- Write back to the database
there will be occasions where two users will both read the same data into memory. User 1 will update the data and write those changes back to the database before user 2 does the same thing. Now you have a concurrency control conflict because user 1 read the data before user 2 wrote it back to the database. Why? Because if user 1 writes his data back into the database he will overwrite the changes made by user 2, causing them to be lost. Basically, whoever saves their changes second will win, overwriting the changes made by whoever saves first.
This kind of database concurrency issue can occur both with humans or automated processes or a combination of the two. A concurrency issue is more likely to occur with human users as the read/update/write cycle is likely to take much longer. However, that said, the same concurrency issue can occur between automated processes and it is harder to solve because in the case of an update by a human you can ask what the user wants (do they want to overwrite changes made by another user?) and respond to that while a process needs to have all actions fully automated.
The Current State
Let’s first take a look at what others generally say and do about solving database concurrency issues. Typically solving the problem is divided into two basic approaches:
- Pessimistic concurrency control
- Optimistic concurrency control
Below I will give a brief description of the two and the different options for handling them. Here I’ll just clarify the problem. I will in no way offer a complete coverage of concurrency handling and all of its aspects.
Pessimistic Concurrency Control
In the case of pessimistic concurrency control, it can prevent a collision because the user/process must take some action before the application can change the data. This action can be a number of things but typically involves locking the data in the database thereby preventing another user from holding the same lock.
- Simple to implement. Because database servers support and enforce locking mechanisms they can very easily implement pessimistic concurrency control. Because a user needs to place the lock before making any changes, the database server informs the user before it makes a change that there is a conflict.
- Very secure. Because the database server implements the locking very reliably and you as a developer can be assured that nothing will be able to ignore the lock and change the data anyway.
- Not very scalable. Locking data in a database requires an open connection to a database. This means every user must have at least one open connection to the database, which means more resources and licenses. When using older database servers, locking might also prevent other users from reading the data.
- Prone to deadlocks. Suppose two users both want to change two pieces of data-A and B. User 1 first locks A and user 2 first locks B. Now both users want to lock the second part of the data but cannot because another user has already locked the other piece they want. You have a deadlock if both users decide to wait until the data is available.
- Locks can last a long time. If a user starts changing some data, he or she has a lock on the data until they save it. If the user is distracted somehow or goes to a meeting without saving his changes, the data remains locked and no one else can make any changes until the first changes have been committed.
You can also use pessimistic locking by using soft locks instead of real database locks. This means that you update a field to indicate that a user is busy with the data and has it “locked.” This prevents the scalability problem because you don’t need to keep the connection open. However this approach has several drawbacks because the database doesn’t enforce a true lock and another piece of code can decide to ignore the lock. Additionally, your code must manually remove the lock and if this isn’t done the data remains locked forever.
Even though pessimistic locking has it’s place, it has a number of drawbacks and doesn’t combine very well with the disconnected nature of data in a .NET application.
Optimistic Concurrency Control
When using optimistic concurrency control the user doesn’t lock the data to prevent concurrency issues but detects and resolves them when writing to the database. Developers typically use several different approaches with optimistic concurrency control:
- No checking and last one in wins. Not really a concurrency control as the server ignores the problem and the last user to update the data overwrites the changes made by the first user. The SQL UPDATE command has only the primary key as the filter in the SQL WHERE clause. This kind of concurrency control is only suitable for single-user applications.
- Comparing the fields the user changed. In this case, as part of the update the application compares the data the user wants to change to the data in the database to verify if it is the same as when the user read the data before it commits the change. If the originally read data and the data in the database aren’t the same, the server doesn’t commit the user’s change and alerts the user. The SQL UPDATE command has the primary key and changed fields with their old values and the SQL WHERE clause.
- Comparing all fields. Using this version of the optimistic concurrency approach, prior to committing an update, the application will check all fields instead of just the fields being changed. Even though this seems excessive it is actually better than just checking the changed fields because of the nature of the standard ADO.NET classes. A “feature” of the ADO.NET data classes is that they will execute a SQL UPDATE command with all fields, not just the fields that the user changed. This means that even though no other user has updated the specific field(s) that the user wants to change, another user may have changed a field in another column. In this case the SQL UPDATE command has a SQL WHERE clause containing all fields in the table.
- Comparing the row version. In this case, the data has an extra row version field, also known as a timestamp field, which the server changes every time it updates the data. Filtering is much simpler and faster than comparing all fields because the server only needs to filter on the primary key and the row version in the SQL UPDATE. This may appear to be the same behavior as comparing all fields, however, you should exercise care because these two do not result in the same behavior. If one user sends an update that updates all data with the current values the data’s row version will still be updated. This means that a second user who wants to change something in the same row will have a conflict when the server checks the row version but this second user would not have a conflict when comparing all fields.
- Scalable. The application doesn’t need to keep a connection open to the database allowing this to scale to a large number of users.
- Simple to implement. Specially when using a row version it is relatively easy to implement.
- Little risk of deadlocks. Because the database server keeps no locks preventing applications from doing their work there is also no risk of a deadlock. The only remaining risk is in the application logic itself.
- Not very secure. Because the database cannot enforce its use the possibility remains that a rogue allocation decides to ignore a row version and update the data regardless.
- Single-row oriented. The approach treats all rows as a single unit of work, something especially true with the row version field.
Back to the Real World
Other writers have described the problems with database concurrency in much more detail in numerous articles but these articles have one common flaw. Those writers view and discuss the whole problem from a database perspective instead of a user’s perspective. What do I mean? Well, every explanation and proposed solution to database concurrency that I’ve read always focuses on the concurrency conflict and detection of a single whole row in a table. In the real world an update will often span multiple rows, tables, and sometimes even more than one database. However, when it comes to the way users see their data, this often doesn’t map to single rows being updated.
I’ll take the following examples from an accounting package but the principles apply to most, if not all, medium- to large-size applications.
A user sees data in this way:
- Simple reference data. This kind of data usually maps to a single row. For example, you might have a list of VAT codes with the appropriate data.
- Complex application data. This typically consists of data that spans more than one row or table. Think of an invoice or a purchase order as an example.
- Regular application data. This kind of data is often stored in a single row in the database but the user actually considers it as a set of related but semi-distinct sets of data with different owners. An example of this would be article data. While an application might store all data about an article in a single row, users might divide the data based on who “owns” that part of the data.
- Basic article data might consist of a SKU and a description used by everyone.
- Sales information. The row might include a sales price, VAT information used when invoicing, and the number of units sold but not shipped. All this data is owned and updated by the sales department and read-only for others.
- Purchasing information. The row might include a preferred supplier, the maximum purchase price, and the number on order but not yet received. The purchasing department owns and maintains this data.
- Inventory. The row might include the number of items in stock. The warehousing department owns the inventory data and they will update the number of items to ship or receive as they process orders.
Simple Reference Data
For most applications that handle simple reference data, the standard way that they handle data concurrency works very well because the set of data the user sees maps exactly to a row in the database. This means that this base is well covered and needs very little extra attention. Unfortunately this is also the kind of data that doesn’t change very often so the chance of a concurrency issue occurring is very low.
Many applications do not handle complex data very well at all. Consider the case where two users both open the same order, each user changes a different line in the order and then they try to save the order. Technically speaking there is no conflict because each user is changing a different row in the database. However a user thinks about a single order and not about different rows in a database, so as far as he or she is concerned this is a multi-user conflict with two people changing the same piece of data. To some extent you can solve this problem procedurally in an organization and some people might justly argue that the chance of the above scenario happening is very small. While that may or may not be true, sooner or later the order will move from the sales department to another department, maybe for approval or shipping. At this point you’ve involved two very different parts of the organization; one updates the detail lines and another takes ownership of the order by updating the header record. Again, technically speaking you don’t have a conflict because different users will update two different tables but in the real world this is a very real multi-user conflict.
To solve such a scenario you need to consider the separate rows in the database as a whole instead of individual items. You could change the way the application stores an order in the database and consider storing the data in a single row. Using an XML field in SQL Server 2005 you could store all details as XML elements. While this solution would solve the concurrency issue, this approach also has a number of drawbacks, most notably the lack of relational integrity checks between the order details in the XML and the rest of the database which would need to be done in code and cannot be done by the database itself.
Another approach for solving the complex data concurrency issues would be to check a row version of the order header row when updating the detail rows. This would basically work as follows:
- Start a database transaction.
- Check the row version of the order header row.
- Update the order header row. You must do this, even if you don’t make an actual change to the order header, to update the row version.
- Update, delete or insert all order detail rows.
- Commit the transaction.
The problem here is that the database cannot really help you enforce this and there isn’t another process that just updates an order detail row without updating the order header row version as well. To solve this problem you can update the order header row version using a trigger defined on the order detail table. This way you can always be sure that the header row version is updated but it does mean that the in-memory order header row must be refreshed if any of the detail rows have been updated. Again this is not an ideal situation but at the very least very consistent and testable. You can test this by changing the same order twice; the second time directly after you’ve saved the first change. If you can save the second change than the row version in memory is the same as the row version in the database.
Regular Application Data
Most solutions don’t handle this case very well either. The main problem is that both the database and the .NET class for working with data are based around a complete row. Both the tools inside Visual Studio and classes like the SqlCommandBuilder are based around a SQL SELECT command and will generate a SQL UPDATE command for every field in the SQL SELECT. This is fine for simple reference data because a user typically updates the whole row but in the case of regular application data, that doesn’t make much sense.
Let’s take a look at a simple example to see the shortcomings. Suppose that you store debtor information in a database. This information contains the address, the current balance, and the credit limit for a debtor. Your application automatically updates the balance every time a debtor pays a bill or places a new order and the user cannot update the balance directly, basically the balance is a calculated field that really exists because of performance reasons. One person in the customer relation department changes the address of the client while someone in the finance department changes the debtor’s credit limit. Technically you have a concurrency conflict as both users want to update the same row at the same time. However the customer relation department owns the address data while the finance department owns the credit limit so procedurally there is no conflict. You cannot really solve the issue by giving each user only a subset of the data because the finance department needs to know where a debtor lives in order to make an assessment of their credit worthiness. The finance department doesn’t need to update the debtor’s address so they may only need read-only access to the address fields. The same is true the other way round; the customer relation department may need to be aware of the debtor’s credit limit but no one from that department will ever update that information.
Even if both changes from this example get made by people that have permission to update all debtor information, you can hardly call this a concurrency conflict. Clearly a SQL UPDATE statement that simply includes all fields from a SQL SELECT isn’t going to cut it here. You could solve the problem by updating each field as a separate entity. This would avoid and solve the problem described above but it overly simplifies the problem as certain fields, such as the complete address, are logically grouped together.
Solving the Problems
You really need the concept of read-only fields and to group the remaining fields into separate SQL UPDATES. To do this you need a CommandBuilder-like class that takes a data definition as input and produces a collection of SQL UPDATE command objects to update the data. In the SQL UPDATE command collection you would have one command for the address and a second for the credit info, presuming the current user has the rights to update both. The code then goes through the collection and populates each command with parameters containing old and new values for each field that your user wants to update. If all the old and new values for a specific command match there is clearly no change in that part of the data and the command doesn’t need to execute. Once the code has loaded all SQL UPDATE commands the code can execute those commands. In this case you do not want to wrap these updates as part of a single transaction. If all SQL UPDATE commands execute successfully there was no conflict and the user can go on with their job. If a conflict occurs it is limited to a specific subset of the data and your application can update the other parts of the data without any delay. Because of the group wise nature of the updates, you cannot use a SQL row version. Instead you need to include the complete list of fields being updated together with the primary key in the SQL WHERE clause of each command.
Listing 1 demonstrates a simple example of how an update could work. In this example I use the ExtendedProperties column to store the field grouping information. Columns with the same group name are bundled together into a single SQL UPDATE command.
The main program uses a typed table adapter to load the titles table from the Pubs database. All remaining functions are part of the CommandBuilder class created in the main function.
In the ConfigureDataTable() function (Listing 2) each column receives an UpdateGroup extended property. You’ll use this during the update to determine which fields are grouped together. This information isn’t determined at update time because you might use the same table for multiple purposes in different business objects with different field groupings. Additionally some fields might be read-only-something that my sample code does not take into account.
The UpdateTable() function (Listing 3) first retrieves a collection of SQL UPDATE commands. Note that for this example I have skipped the SQL INSERT and SQL DELETE commands as you handle these in the same way as normal.
In Listing 4 you can see the GetUpdateCommands() function loops through all the field groups and creates a different SQL UPDATE command for each one. All separate commands are bundled together into a collection and returned.
The SplitColumnIntoGroups() function (Listing 5) takes all the columns in a table and splits them into separate update groups. This would be a good point to exclude read-only columns and possibly the primary key columns as these are typically not updatable.
In Listing 6, the CreateUpdateCommand() function creates a single SQL UPDATE command for each group of fields. The SQL WHERE clause consists of the primary key of the row as well as the fields that need to be updated. The CreateUpdateCommand() function compares each field both against the old value as well as its new value in order to avoid seeing the same change by two users as a conflict where it really doesn’t matter that a field is overwritten with the same value.
While my technique isn’t a complete solution for all update concurrency issues, I believe it is a step in the right direction. It is a work in progress and I feel that others will discover the best way to handle the separate cases. In the mean time, I hope that my solution helps you create applications that are friendlier and less technology-oriented in their nature.