One of the projects I work on involves processing large datasets and saving them into SQL Server databases. Recently, the team added Google Analytics data to the download process and we found ourselves faced with the prospect of loading hundreds of thousands of records daily. Initially, we chose to process this data using a rather naive process. We loaded the records into the database one at a time using a SQL Server stored procedure (one of our business' requirements is to use stored procedures). It didn't take long to realize that this wasn't a practical way to load large datasets in a reasonable amount of time. The question: How do we load large datasets into SQL Server from our application's C# code?

Zip Code Loader Problem

For this article, you'll be solving a common business problem: creating a data table of zip codes and related metadata. In this example, you'll create a C# list of zip code records from a CSV (Comma Separated Value) file downloaded from here: http://simplemaps.com/resources/us-cities-data.

This file contains the city, state, longitude, and latitude properties for approximately 27,000 zip codes. This code snippet is a representation of the data contained in the CSV zip code file.

zip,state,city,lat,lng
35004,AL,Acmar,33.584132,-86.51557
35005,AL,Adamsville,33.588437,-86.959727
35006,AL,Adger,33.434277,-87.167455
35007,AL,Keystone,33.236868,-86.812861
35010,AL,New Site,32.941445,-85.951086

The next step is to create a C# class that represents this structure. The next snippet is a C# class definition that matches the structure of the CSV file you'll be converting.

public class ZipCodeRecord
{
    public string ZipCode { get; set; }
    public string City { get; set; }
    public string State { get; set; }
    public decimal Longitude { get; set; }
    public decimal Latitude { get; set; }
}

Loading the CSV File

Strangely enough, the .NET Framework doesn't have a native library for processing CSV files. To perform this operation, use an open source application called CSVHelper. The source code for CSVHelper can be found at this location on Github: https://joshclose.github.io/CsvHelper/.

Installing CSV Helper is simple. The developers of that library have created a NuGet package to assist in its installation. To install the CSVHelper package perform the following steps:

  1. From Visual Studio select TOOLS > NuGet Package Manager > Package Manager Console
  2. Type: Install-Package CsvHelper

This installs the CsvHelper library into your project and adds the proper references. Once the library is installed, you can use the code in Listing 1 to open the CSV file and create a list of ZipCodeRecord objects from its contents.

Listing 1: Code used to parse a CSV file into a list of objects

private List<ZipCodeRecord> LoadZipCodeFile()
{
    var path = Path.GetDirectoryName(
        Assembly.GetExecutingAssembly().Location);

    var retval = new List<ZipCodeRecord>();

    var reader = File.OpenText("cities.csv");
    var csv = new CsvReader(reader);

    while (csv.Read())
    {
        var record = new ZipCodeRecord();
        record.ZipCode = csv.GetField(0);
        record.State = csv.GetField(1);
        record.City = csv.GetField(2);
        record.Longitude = csv.GetField<decimal>(3);
        record.Latitude = csv.GetField<decimal>(4);
        retval.Add(record);
    }

    return retval;
}

Naive Data Loading

Once you've successfully parsed the CSV file, it's time to store it in a database table. The first step to storing the data in SQL Server is to create a table and the corresponding stored procedure for inserting the data. Listing 2 is the T-SQL code for creating a new database, table, and stored procedure.

Listing 2: SQL Server Zip Code Database and Table Structure

Create Database CodeMagazineZipCode
Go

Use CodeMagazineZipCode
GO


Create Table ZipCodes
    (Id int primary key identity,
    ZipCode varchar(50),
    State varchar(50),
    City varchar(300),
    Longitude decimal(14,6),
    Latitude decimal(14,6))
GO

Create Procedure CodeMag_Add_ZipCode_Naive
@ZipCode varchar(50),
@State varchar(50),
@City varchar(300),
@Longitude decimal(14,6),
@Latitude decimal(14,6)
as

Insert Into ZipCodes (ZipCode,State,City,Longitude,Latitude)
    Values
(@ZipCode, @State, @City, @Longitude,@Latitude)

GO

Once you've created your database, table, and corresponding stored procedure, it's time to write some C# code to insert the zip code information into the database. Listing 3 opens a SQL connection to your database, iterates through the list of zip code data, and calls a stored procedure to insert the actual data. The zip code table has approximately 27,000 records in it and writing the records one at a time takes approximately 1 minute 30 seconds to write the data. In most worlds, this is an unacceptably long time to write a large batch of records.

Listing 3: Naive C# implementation of writing zip code records one at a time.

private void LoadRecordsNaive()
{
    var recsToLoad = LoadZipCodeFile();

    var connString = "Server=(local);
    Database=CodeMagazineZipCode;
    Trusted_Connection=True;";
    using (var conn = new SqlConnection(connString))
    {
        conn.Open();
        var cmd = new SqlCommand();
        cmd.CommandText = "CodeMag_Add_ZipCode_Naive";
        cmd.CommandType=CommandType.StoredProcedure;
        cmd.Connection = conn;

        foreach (var zipCodeRecord in recsToLoad)
        {
            cmd.Parameters.Clear();
            cmd.Parameters.Add("@ZipCode", zipCodeRecord.ZipCode);
            cmd.Parameters.Add("@State", zipCodeRecord.State);
            cmd.Parameters.Add("@City", zipCodeRecord.City);
            cmd.Parameters.Add("@Longitude", zipCodeRecord.Longitude);
            cmd.Parameters.Add("@Latitude", zipCodeRecord.Latitude);
            cmd.ExecuteNonQuery();

        }
    }

}

Understanding SQL Server Data Types

In SQL Server 2000, Microsoft added support for a new data type called Table. The Table datatype is an in-memory construct that behaves like a table stored on disk with one distinction: It's memory based and thus ultra-fast. It has one particular benefit: It's a data type that can have operations performed on it, including the ability to be passed to other routines. The syntax for using the table data type is as follows:

DECLARE @Movies TABLE (
    Name varchar(100),
        YearReleased char(4)

Insert Into @Movies (Name, YearReleased)
    Values ('Star Wars','1977')
Insert Into @Movies (Name, YearReleased)
    Values ('Jaws','1974)

Select * from @Movies Order By Year Released

As you can see, a Table variable behaves like any other table with the added benefit of being fully memory-resident. The beauty is that you can create a custom SQL Server data type using a TABLE structure as the basis for its underlying type. Improving the speed of the zip code loading process requires the use of a data type.

Creating the High-Speed Implementation

Listing 4 represents the script for creating a custom data type along with the stored procedure used to insert data passed from the .NET application

Listing 4: The script passing the data type and stored procedure from the .NET app

Use CodeMagazineZipCode
GO

Create Type ZipCodeType AS Table
    ( ZipCode varchar(50),
    State varchar(50),
    City varchar(300),
    Longitude decimal(14,6),
    Latitude decimal(14,6))

GO

Create Procedure CodeMag_Add_ZipCode_HighSpeed
@ZipCodeData ZipCodeType READONLY
as

Insert Into ZipCodes (ZipCode, State, City,Longitude,Latitude)
    Select ZipCode, State, City,Longitude,Latitude
        From @ZipCodeData
GO

Gotchas

If you need to alter your Table-based SQL Server data type, you'll need drop any stored procedures that access it before you can change it. SQL Server binds the structure of that data type to any code that might access it. If you attempt to alter a custom data type you will receive the following error:

Msg 352, Level 15, State 1, Procedure CodeMag_Add_ZipCode_HighSpeed, 
Line 10
The table-valued parameter "@ZipCodeData" must be declared with 
the READONLY option.

Another item of note is that you'll receive an error if you don't define your stored procedure parameter to be READONLY.

Sending the Data

After creating your data type and stored procedure, you need to send the data to the server using an ADO.NET SqlCommand object. Sending data to a custom table date type requires two constructs to be used. The first is an ADO.NET DataTable object. The DataTable must have columns for each respective column in your custom data type. For the custom data type defined in Listing 4, your data table should be defined as follows:

/load the data into a datatable
var dt = new DataTable("ZipCodeData");
    dt.Columns.Add("ZipCode");
    dt.Columns.Add("City");
    dt.Columns.Add("State");
    dt.Columns.Add("Longitude");
    dt.Columns.Add("Latitude");

If you don't have a matching number of columns, you'll receive an error when you pass data to your stored procedure. If you see the following error, you know you need more columns in your data table:

"Trying to pass a table-valued parameter with 4 column(s) where 
the corresponding user-defined table type requires 5 column(s)."

The second construct required is a SqlParameter object defined as a Structured data type. The following code is how you define a Structured SqlParameter type.

var param = new SqlParameter();
param.ParameterName = "@ZipCodeData";
param.SqlDbType = SqlDbType.Structured;
cmd.Parameters.Add(param);

Finally, you need to iterate through your zip code collection and add rows for each item. Listing 5 demonstrates how to send your data to your SQL Server data type.

Listing 5: Send data to the SQL Server data type

private void LoadRecordsHighSpeed()
{
    var recsToLoad = LoadZipCodeFile();

    var connString = "Server=(local);
    Database=CodeMagazineZipCode;
    Trusted_Connection=True;";

    using (var conn = new SqlConnection(connString))
    {
        conn.Open();
        var cmd = new SqlCommand();
        cmd.CommandText = "CodeMag_Add_ZipCode_HighSpeed";
        cmd.CommandType = CommandType.StoredProcedure;
        cmd.Connection = conn;
        var param = new SqlParameter();
        param.ParameterName = "@ZipCodeData";
        param.SqlDbType = SqlDbType.Structured;
        cmd.Parameters.Add(param);

        //load the data into a datatable
        var dt = new DataTable("ZipCodeData");
        dt.Columns.Add("ZipCode");
        dt.Columns.Add("City");
        dt.Columns.Add("State");
        dt.Columns.Add("Longitude");
        dt.Columns.Add("Latitude");
        foreach (var zipCodeRecord in recsToLoad)
        {
            var row = dt.NewRow();
            row["ZipCode"] = zipCodeRecord.ZipCode;
            row["City"] = zipCodeRecord.City;
            row["State"] = zipCodeRecord.State;
            row["Longitude"] = zipCodeRecord.Longitude;
            row["Latitude"] = zipCodeRecord.Latitude;
            dt.Rows.Add(row);
        }

        cmd.Parameters["@ZipCodeData"].Value = dt;
        cmd.ExecuteNonQuery();
    }
}

Conclusion

The speed differences between the two implementations of the zip code loader are astonishing. The naive version takes approximately 1 minute 26 seconds and the high speed version takes 0.62 seconds. Yes, that's correct; the high speed one loaded 26,000+ records in less than a second. When dealing with millions of records, the savings really add up with this approach.