LINQ

At PDC 2005, Microsoft introduced brand new technology known as LINQ, which stands for “Language Integrated Query.”

The feature-set hiding behind this acronym is truly mind-boggling and worthy of a lot of attention. In short, LINQ introduces a query language similar to SQL Server’s T-SQL, in C# and VB.NET. Imagine that you could issue something like a “select * from customers” statement within C# or VB.NET. This sounds somewhat intriguing, but it doesn’t begin to communicate the power of LINQ.

Needless to say, processing of any sort of information is of utmost importance in software. Much of this “information” is stored in databases in the form of rows and tables. To process that data, developers use relatively sophisticated query and data manipulation mechanisms. Yet not all data is stored in databases. I would even argue that today, most data is not stored in databases. Much of it is also stored in places like XML files, HTML pages, e-mails, and the like. The ability to query this sort of information is currently much less developed than for databases. Furthermore, data is not useful just stored in databases or XML files. Instead, applications bring data into memory to process, and once data leaves its original place of storage, the fundamental need to handle and manipulate that data does not change, yet in current versions of .NET (as well as many-but not all-other programming languages), the ability to handle data at that point is relatively poor. It is easy to retrieve a list of customers joined with their invoice information from SQL Server, but it is not easy to use customer information in-memory in .NET and join it with the customer’s e-mails. From a .NET point of view, different types of information is usually available in object form. Unfortunately, Microsoft has not provided a good way to join lists of objects or perform any other sort of query operation.

LINQ solves this problem.

In fact, LINQ solves this problem and many others as well. This makes LINQ one of the features at the very top of my “technologies I want today” list. Unfortunately, Microsoft has only made LINQ available as a CTP (Community Technology Preview), which means that it isn’t even in beta yet. Ultimately, the expectation is that LINQ will ship with Visual Studio “Orcas.” You can install the LINQ CTP bits on top of Visual Studio 2005, which provides a number of additional assemblies as well as new versions of the C# and VB.NET compilers. Using this constellation, you can use and compile the new LINQ syntax Visual Studio. (Note: IntelliSense and syntax coloring are not always appropriate for the new features since the Visual Studio editor is not yet aware of the new LINQ features).

A First Example

In SQL Server, queries are pretty simple. For instance, you can easily query all records from a Customer table in the following fashion:

SELECT * FROM Customer

The return value is a “result set” (which really behaves and appears very much like a table) that contains all fields from the Customer table. The overall situation is relatively simple and predictable for the compiler (or interpreter) that has to process this statement. “Customer” is always a table (or equivalent construct, such as a view, which is really also a table in terms of behavior and functionality). Inside the Customer table you’ll find rows of data, and each row is composed of a number of fields, all of which you expect to be part of the result set.

Using LINQ you can perform a similar operation right in C# or VB.NET. The main difference is that C# doesn’t deal with tables but objects, and in particular, lists of objects (be it collections or arrays or any similar construct). To start out with a simple example, I will use one of the simplest data sources LINQ can use: an array of strings. Here it is in C#:

string[] names =
    {"Markus", "Ellen", "Franz", "Erna" };

Or, the VB.NET equivalent:

Dim names As String() =
    {"Markus", "Ellen", "Franz", "Erna" }

Using the new LINQ syntax your code could query from this “data source” very similar to querying from a table in T-SQL. Here is a simple VB.NET query that retrieves all “records” from that array:

Select name From name in names

This is somewhat similar to the T-SQL statement above. The main difference is the “from x in y” syntax that you might find a little confusing at first. Let’s take a look at what really happens here. Fundamentally, you’ll retrieve data from a list of objects called “names.” This list is an array-object which is the equivalent of the Customer table in my previous T-SQL statement. The main difference is that while it is completely clear that a Customer table contains rows, it is not at all clear what objects are in collections of other objects. Therefore, you also need to specify what you expect inside that collection. In my case, I’ve stated that I want to refer to each object inside the names array as “name.” You might compare this to a for-each loop:

ForEach name As String in names
    ' name.xxx
EndFor

You must remember to name each element you expect inside the collection to subsequently use that named reference to specify the expected result (among other things). In the above example, I stated that I want to select the entire string (called “name”) as my result set, since that is really all there is to select in this simple example.

Of course, since VB.NET is an object-oriented environment, the result set must also be a list of objects. I therefore need to assign the SELECT statement (or the result of the SELECT statement) to a variable reference:

Dim result As IEnumerable(Of String) = _
    Select name From name in names

SELECT statements return a list of type IEnumerable<T>. In other words, the result set is a list typed as the generic version of IEnumerable. In my case, the elements in that generic IEnumerable list are strings, since each “name” in the SELECT statement is a string.

Of course, C# also supports LINQ natively. Consider this C# version.

IEnumerable<string> result =
    from name in names select name;

The main difference here is that C# always puts the “select” part as the last part of the command. This looks a bit odd at first, but I could argue that it makes more sense. For instance, a few paragraphs above where I described what the VB.NET example does, I had to start out my explanation with the “from” part. Also, it is more convenient for IntelliSense. Once you’ve typed in the “from name in names” part, IntelliSense can display a sensible list of possible “selectable” members, while the VB version can not do so by the time you’re likely to enter the select part. Ultimately, this comes down to personal preference since the functionality is exactly identical. (This statement seems to be true for the majority of features in C# and VB.NET.)

A More Useful Example

My example so far was perfectly functional, but it was also completely useless because the result set is identical to the source version. This example makes more sense.

IEnumerable<string> result =
    from name in names
    orderby name
    select name;

In this case my result set is ordered by the name. I can print these names in the following fashion:

foreach (string s in result)
{
    Console.WriteLine(s);
}

This prints the following result to the Output window:

Ellen
Erna
Franz
Markus

I can, of course, also add a WHERE clause to my query.

IEnumerable<string> result =
    from name in names
    orderby name
    where name.StartsWith("E")
    select name;

And my result looks like this.

Ellen
Erna

LINQ will bring practically all standard query operators (GROUP BY, SUM, JOIN, UNION,…) to the core C# and VB.NET languages.

Note: The C# flavor is LINQ is currently documented in a much more complete fashion. For this reason, I am using mostly C# examples. Nevertheless, VB.NET supports LINQ just as well as C# does. Some would even argue that VB.NET supports LINQ better than C#.

The Magic of Objects

Everything in .NET is objects, and therefore, LINQ is based exclusively on objects. This little fact turns the LINQ query language into something that is a lot more powerful than a query language that just deals with data. To understand why, I must show you a few examples.

Listing 1 shows a Customer class, which I will use for some examples, as well as a helper method that instantiates a list of customers. Once you have this list of customer objects in memory, I can query from that list like so:

List<Customer> customers =
    Helper.GetCustomerList();
IEnumerable<Customer> result =
    from c in customers
    orderby c.CompanyName
    where c.ContactName.StartsWith("A")
    select c;

This returns a list of customers where the contact person’s name starts with an “A.” My example also sorts the result by company name. Note that the result set is an enumerable list of Customer objects, since I selected “c”, and “c” is the name I assigned to each customer in the list. Of course, the result could have also been something entirely different, such as a single property of that Customer object.

IEnumerable<string> result =
    from c in customers
    orderby c.CompanyName
    where c.ContactName.StartsWith("A")
    select c.Country;

In this example, the result is a list of country names (strings) for the same customers.

Note that not just the SELECT clause changed, but the declaration of the result type as well. In the previous example I used IEnumerable<Customer>, while the current example results in IEnumerable<string>. The result type is dictated by the SELECT part of the command and can not be altered in any other way. Therefore, one could argue that it is redundant and developers should not have to declare the type of the result variable. As it turns out, Anders Hejlsberg (the “father” of C#) agrees with that viewpoint and has added a new feature to C# 3.0 known as “type inference.” Using this feature, I could also define the last example in the following fashion:

var result =
    from c in customers
    orderby c.CompanyName
    where c.ContactName.StartsWith("A")
    select c.Country;

The declaration of “result” as “var” simply indicates to the compiler that it is to infer the real type based on the expression. The compiler can analyze the SELECT statement and therefore figure out that “var” really needs to be “IEnumerable<string>” (in this example). Don’t confuse “var” with “variant” in which scripting languages use. Instead, “var” is still a strongly typed statement. You just leave it up to the compiler to figure out what the type should be.

You can also use type inference in other instances. Look at these perfectly fine and strongly typed C# 3.0 statements:

var name = "Markus";
var frm = new System.Windows.Forms.Form();

But, I digress. There still are many unexplored things you can do with objects as data sources. In the examples so far, I’ve shown you how to perform simple queries that use features available on .NET standard types such as strings. Selecting names starting with “A” is the equivalent of the following SQL Server statement:

SELECT * FROM Customers
    WHERE ContactName like 'A%'

SQL Server knows a number of standard types (such as strings) and can thus apply certain operations, such as an “=” or “like” operator. In LINQ, on the other hand, data sources could be any type of objects, and the features and abilities of those objects are only limited by your imagination. The Customer class I’ve used in my examples has such custom functionality. Here is another example. You could use the IsOddCustomer() method, which tells you whether or not the customer number is odd (or even). You could use this method in LINQ queries:

var result =
    from cust in customers
    where cust.IsOddCustomer()
    select cust;

This has significant implications since it means that you have complete control over the behavior of the WHERE clause (or any other part of the statement for that matter). For instance, it is possible to include a significant amount of business logic in the code called by the WHERE clause, which would not be feasible in the same way in SQL Server. For instance, the method called could in turn call Web services or invoke other objects. (Note that it is fine to do this in terms of architecture, because this code is likely to run in the middle tier).

Another aspect of using objects instead of data in a query language is that the result set can be any type of objects. In the following example, I’ll assume an array of objects of type ShortCustomer. This is a list of Customer objects where each object has two properties: Country and PrimaryKey. I can use a LINQ query to SELECT all primary keys for customers from a certain country, but instead of returning that primary key directly, I can use it to instantiate new Customer objects.

var result =
    from c in customerList
    where c.Country == "USA"
    select new Customer(c.PrimaryKey);

This means that for each selected primary key, the code must instantiate a new Customer object. (Presumably, the Customer class loads the complete customer information into the object when launched this way, but this is completely up to that class). The result of this query is a list of Customer objects. This is interesting, because in essence, the result set is a list of objects that was in no way contained in the original query source in any way other than the objects being identified by their primary key.

Now I’ll spin this example a bit further as well and do this:

var result =
    from c in customerList
    where c.Country == "USA"
    select new CustomerEditForm(c.PrimaryKey);

This returns a list of edit forms for each customer from the US. The only problem at this point is that those forms are not displayed yet, so we still need to make them visible.

foreach (Form frm in result)
    frm.Show();

Of course, this query may end up opening a very large number of windows, so you probably don’t want to use it in real-life applications. However, this example demonstrates how you could apply the LINQ query language to anything .NET has to offer and not just data. Of course, this also works the other way around. For instance, you could query all controls on a form that have certain content (or no content, or…) and then join the result with data from a different data source and union it together with… well, you get the idea.

Anonymous Types and Object Initialization

When you run a query, you often expect a result set that contains a limited selection of information contained in the data source. Consider this SQL Server example:

SELECT CompanyName, ContactName FROM Customer

This returns two fields from the much larger list of fields in the Customer table. Of course, you might want to do this in C# and VB.NET. However, since every result must be an object (or a list of objects), this requires some object type with exactly these two properties. Chances are that you don’t have such a class, and creating such a class for each query result would be cumbersome and seriously take away from the power of the query language. Therefore, new features are needed and C# 3.0 will offer them! Two in particular are important for this scenario: object initialization and anonymous types.

Object initialization deals (surprise!) with the initialization of public members (properties and fields). Consider this conventional example:

Customer cust = new Customer();
cust.CompanyName = "EPS Software Corp.";
cust.ContactName = "Markus Egger";

Using object initialization I could also write this example in a single source line (single statement):

Customer cust = new Customer() {
    CompanyName = "EPS Software Corp.",
    ContactName = "Markus Egger"};

Note: Due to column width constraints in the magazine, this statement ends up as three lines, but it is only a single line of source code as far as the compiler is concerned.

This feature is particularly useful in LINQ queries:

var result =
    from n in names
    select new Customer() {ContactName =
        n.FirstName + " " + n.FirstName};

The second important C# 3.0 feature, anonymous types, allows you to create a new object type simply based on necessity and requirements derived from the type’s usage. Here’s an example:

new {CompanyName = "EPS Software Corp.",
    ContactName = "Markus Egger"};

While similar to my previous example, the new operator does not specify the name of the class that is to be instantiated. In fact, you don’t need to instantiate a defined class. Instead, the compiler realizes that you need a type with two properties based on the fact that the code attempts to initialize them. Therefore, the compiler creates a class behind the scenes that has the required properties (and fields) and uses it on the spot. Note that the only way to use such a type is through type inference (see above).

var customer = new {
    CompanyName = "EPS Software Corp.",
    ContactName = "Markus Egger"};

You can use object initialization and anonymous types features in queries. For instance, you can query from a list of Customer objects (Listing 1) and return brand new objects with two properties.

var result =
    from cust in customers
    select new {
        Name = cust.CompanyName,
        Contact = cust.ContactName};

Note that this example also would not be possible without type inference since there would be no way to define the type of the “result” variable, since the name of that type is unknown.

Object Syntax

Purists may have noted that the LINQ syntax is similar to T-SQL, but it is not entirely C#-like. In other words: almost everything else in C# is expressed as objects, while LINQ introduces the longest C# command sequence ever. As it turns out, the SELECT syntax is only window dressing. Behind the scenes, the compiler actually turns every SELECT statement into pure object syntax. Consider this example:

var result =
    from c in customers
    where c.ContactName == "Egger"
    select c.Country;

You could also write this statement like so:

var result =
  customers.Where( c => c.ContactName == "Egger" )
  .Select( c => c.Country );

This will be normal C# 3.0 syntax. However, not a lot of people are using C# 3.0 yet so I need to explain a few details. I’ve already discussed the new “var” keyword used by type inference (see above). I need to explain a new feature of C# 3.0 called lambda expression which you see as the passed parameter. Lambda expressions are an evolution of C# 2.0’s anonymous methods. Using lambda expressions you can pass code instead of data as method parameters. Basically, the Where() and Select() method accept a delegate as their parameters, and the lambda expression provides the code for the delegate to execute.

The expression itself appears a bit unusual at first but is easy to understand. It starts out with input parameters (“c” in this case) followed by the “=>” operator, followed by the return value (or alternatively, a complete method body). You could also express the lambda expression c => c.ContactName == "Egger" as a complete method.

public var MyMethod(var c)
{
    return (c.ContactName == "Egger");
}

Note that this example uses type inference in the lambda expression to determine the parameter as well as the return type. You could also explicitly type parameters for lambda expressions.

(Customer c) => c.ContactName == "Egger"

Lambda expressions are very powerful and can do everything delegates and anonymous methods can do, plus a few extra tricks I will show you below. Unfortunately, a complete exploration of features provided by lambda expressions is beyond the scope of this article.

The remaining mystery is the puzzling appearance of the Where() and Select() methods. For the object-syntax version to work, every object in .NET would have to have these methods. And in fact, with LINQ, they do! The reason is a mechanism that is also new in C# 3.0 called extension methods. These are special static methods defined on a static class. Whenever such as class is in scope (either because it is in the current namespace or by way of a using statement), then the extension method gets added to all objects that are currently in scope who do not already have a method of identical signature.

You should only use this somewhat radical feature when you really need to. However, in some scenarios, it is extremely useful. As an example, consider the string type and the possible need to add new methods to that class. In many scenarios you can do this through subclassing, but if you want to add a method to all strings, you cannot do that with subclassing. Using extension methods, this isn’t technically possible either, but through a little compiler magic, you can at least create the illusion of an added method. Consider the following example which shows a “ToXml()” extension method:

public static class EM
{
   public static string ToXml(
      this object extendedObject)
   {
       return "<value>" +
         extendedObject.ToString() + "</value>";
   }
}

Note that this is a method that is only different from standard static methods in that it uses the “this” modifier with the first parameter. The “this” modifier indicates that the first parameter is a reference to the object that is extended (extension methods must always have at least one parameter, which is a reference to the object it extends).

With your extension method created you can use it on all objects as long as the “EM” class is in scope (either because it is in the current namespace, or because it is in scope due to a USING statement). Therefore, the following statement is now valid:

string name = "Markus";
string xmlName = name.ToXml();

Note that the parameter does not appear in this version. Instead, the parameter is the object the method is seemingly used on. Behind the scenes, the compiler changes this example to standard object notation.

string name = "Markus";
string xmlName = EM.ToXml(name);

Extension methods create the illusion of added methods, and LINQ uses this ability extensively to add methods such as Select(), Where(), and Join(). Note that you can only add extension methods to objects that do not already have methods of identical name and signature.

Extension methods have a number of side effects that turn out to be quite useful. For one, developers can use individual pieces of functionality LINQ provides, without having to use other, possibly unwanted LINQ functionality. For instance, this example takes the contents of an array and returns them grouped by string length; a feature that is not available on arrays without LINQ.

string[] names =
    {"Markus", "Ellen", "Franz", "Erna" };
var result = names.GroupBy(s => s.Length);

Considering the different features available through LINQ (sorting, grouping, calculations, joins, unions,…), this certainly is a rather interesting side-effect.

Another side effect of extension methods is that extension methods are only added on objects that do not already have a method of that name and signature. This means that developers can purposely implement such methods to explicitly replace how certain features of LINQ work. For instance, if you do not like how LINQ calculates averages using the Average() function (or “average” keyword), then you can implement your own Average() method which you can then use on your object. (Standard LINQ functionality keeps being used on all other objects).

DLINQ

The LINQ functionality I have introduced so far introduces query features to an object-oriented environment. Nevertheless, you’d find it useful if you could use the same functionality and feature-set seamlessly against “real” databases. DLINQ provides this type of functionality. DLINQ is a set of special classes provided in addition to regular LINQ features. DLINQ will provide object-oriented representations of database objects such as tables and fields. Listing 2 shows a DLINQ representation of the Customers table in the Northwind SQL Server demo database. Note that the class itself is just a standard C# class, but the attributes are DLINQ attributes. You can use them to map relational data onto objects and to express information that is not available in C# (such as a field being of type “nchar(5)”).

Once database objects are represented in object-notation, you can use LINQ to query data directly from the database. To do so, you first open a connection to the database. In DLINQ you’ll do this through a DataContext object, which is conceptually very similar to a database connection. Once the context is established, the table-mapping class has to be instantiated. DLINQ does this through a Table<> generic that is typed as the special mapping class. With these objects in place you can use standard LINQ syntax. The following example queries all customers from the SQL Server Northwind database whose company name starts with an “A”.

DataContext context =
    new DataContext("Initial Catalog=Northwind;” +
    "Integrated Security=sspi");
    
Table<CustomerTable> customers =
    context.GetTable<CustomerTable>();
    
var result =
    from c in customers
    where c.CompanyName.StartsWith("A")
    select c;

Note that the syntax used in the actual query is standard C# LINQ syntax. SQL Server does not support methods on field names, nor does it support “StartsWith()” in any way. Nevertheless, this works perfectly fine. The WHERE clause in this statement is internally handled as a lambda expression by the C# compiler (see above). One of the advanced features of lambda expressions is the ability to compile them either as IL (Intermediate Language) that can be executed on the CLR, or to compile them as a pure data representation of itself known as an expression tree. Whether the compiler creates IL or expression trees depends on your exact use of the lambda expression. In all previous demos I’ve used in this article, the compiler would have compiled the lambda as IL. In this example, the compiler will create an expression tree.

Expression trees are language neutral since they are only data. This allows DLINQ to translate the expression into something the database can understand and thus, you can execute the example above on SQL Server.

To see how expression trees are represented in memory, you could take a lambda expression and assign it to an expression tree delegate, which forces the compiler to create an expression tree instead of IL code. You can then explore individual pieces of data within the expression tree.

Expression<Func<int,bool>> expr =
    para1 => para1 < 10;
    
BinaryExpression body =
    (BinaryExpression)expr.Body;
ParameterExpression left =
    (ParameterExpression)body.Left;
ConstantExpression right =
    (ConstantExpression)body.Right;
MessageBox.Show("Expression: " +
    left.Name + " " + body.NodeType + " " +
    right.Value);

The output of this code snippet is this:

Para1 LT 10

Some readers might be wondering whether that can really work with all possible expressions. Actually, yes. In a pure .NET environment, the list of possible expressions is unlimited, and some such expressions can be so complex that translations would be impossible. However, when running queries against a database you have a limited and well-defined list of possible expressions. After all, since you can only map SQL Server character fields to .NET strings, that limits the list of possible expressions to the features available for strings and character fields. The only exception is the ability to define custom field types in SQL Server 2005, but in that case, you’d create the custom fields using a .NET language, so no translation is needed.

XLINQ

What DLINQ is to data, XLINQ is to XML, and more. XLINQ provides a handful of extra objects that provide the ability to run standard LINQ queries against XML data sources. Consider the following XML string:

< customers >
 <customer>
  <companyName>Alfred’s Futterkiste</companyName>
   <contactName>Maria Anders</contactName>
 </customer>
 <customer>
    ... More customer records ...
 </customer>
</ customers >

You could load this XML string into an XElement object which you can then use as a LINQ-query data source just like any other object-based data source.

XElement names = XElement.Parse(xmlString);
var result =
    from n in names.Descendants("customer")
    where n.Descendants("companyName")
        .Value.StartsWith("A")
    select n.Descendants("contactName").Value;

This query returns a list of strings that contains the names of all contacts for customers whose company name starts with “A”. As you can see, XLINQ provides objects that represent alternate ways of parsing XML. You can also use these objects outside of LINQ.

XLINQ also provides another interesting feature: The ability to create XML. This is also done using XElement and other XLINQ objects. Consider this XML snippet:

< root >
    <sub>Test</sub>
</ root >

You can create this snippet using XLINQ objects.

XElement xml = new XElement("root",
    new XElement("sub","Test");
Console.Write(xml.ToString());

This example creates a new XElement instance by passing the name of the element as the constructor parameter. The content of that element is another XElement, which is also passed to the constructor of the first element. The second element is also instantiated with an element name. The second parameter provides the content, which in this case is the string “Test”.

You can use this in queries as well. The next example queries all customers from the USA and returns it as an XElement structure, which in turn can be converted to a string.

XElement customerXml = new XElement("Customers",
    from c in customers
    where c.Country == "USA"
    select new XElement("Customer",
        new XAttribute("ID",c.CustomerId),
        new XElement("Company",c.CompanyName)));

The result of this example is this:

< Customers >
  <Customer ID="ALFKI">
    <Company>Alfred’s Futterkiste</Company>
  </Customer>
  <Customer ID="XXX">
     ... More customer records ...
  </Customer>
</ Customers >

The VB.NET version of LINQ takes this idea even a step further. VB.NET incorporates XML directly into the core language, allowing developers to create the same result in the following fashion:

Dim x = _
    <Customers>
        Select <Customer ID=(c.CustomerId)>
          <Company>(c.CompanyName</Company>
        </Customer> _
        From c In customers _
        Where c.Country = "USA" _
    </Customers>

The support of XML natively in the VB.NET language has been the source of numerous discussions. Some feel that it makes the language “messy”. Personally, I like that developers have the choice between the purist, object-oriented approach of C#, and the straightforward, productivity-driven approach of VB.NET.

Conclusion

LINQ is powerful. Much more so than can be expressed in a single article. Unfortunately, LINQ also isn’t available yet. It is one of those technologies that is so immediately applicable that it is hard to wait for the release version. That release version should also include other functionality such as insert, update, and delete commands. I look forward to all of these things, just as I am looking forward to more information being available for the VB.NET version of LINQ, which appears to be at least as promising as the C# version.

LINQ

Published in:

Filed under:

A First Example

A More Useful Example

The Magic of Objects

Anonymous Types and Object Initialization

Object Syntax

DLINQ

XLINQ

Conclusion

Listing 1: The Customer class used in several examples in this article

Listing 2: The DLINQ representation of the SQL Server Customers example table (Northwind database)

This article was filed under:

This article was published in:

Have additional technical questions?