String processing is an important aspect of today's applications.

Web applications in particular, rely heavily on string transformations and special formatting of data, as well as generating strings for output to the Web dynamically in the form of HTML, XML or even binary data. In this article, Steve demonstrates and compares performance of some of the powerful string functions in Visual FoxPro.

For the past few years, it seems I'm constantly working with strings in VFP. Years ago, I developed a product called Steven Black's INTL Toolkit, which translates VFP applications on the fly. More recently I developed the Visual FoxPro Wiki, at http://fox.wikis.com, which is a very text-intensive website. I've also used VFP to generate a variety of data and text-centric web applications. So I've learned a thing or two about working with strings, and this is what I'd like to share with you now.

For this article, I'll use the text of a book. Perhaps you've heard of it - it's Tolstoy's War And Peace. I downloaded the complete text of War And Peace at http://altair.stmarys-ca.edu/studwork/integral/warandpe.txt, where among many other places books are available online.

How big is War And Peace? It's 3,271,833 characters long. That's a big string as far as demos go. Surprisingly, most of the things we'll do in this article easily have sub-second execution time in VFP. The results below are using my Pentium I 233 (P233) laptop ? you can expect much better results on more current hardware.

Does a sub-string exist?

There are many ways to determine if a sub-string exists within a string. The $ command returns True or False depending on if a sub-string is contained in a string. This command is fast. Try this: LOCAL t1, t2, x

x = FILETOSTR("WarAndPeace.TXT")   && 3.3 mb in size
t1 = SECONDS()
? "THE END" $ x    && .T.
t2 = SECONDS()
?[ "THE END" $ x], t2-t1    && 0.180 seconds

The AT() and ATC() functions are also great for determining if a sub-string exists. The former having the advantage of being case insensitive. Moreover, both functions return the position of the sub-string, which is more useful than the Boolean value $ gives you.

LOCAL t1, t2, x, lnAtPos
x = FILETOSTR("WarAndPeace.TXT")   && 3.3 mb in size
t1 = SECONDS()
lnAtPos = ATC("the end",x)
t2 = SECONDS()
?lnAtPos           && 97837
?[ATC( "the end", x)], t2-t1    && 0.110 seconds
?SUBS(x, lnAtPos,20)   &&  "the end of it..."

The OCCURS() function tells you if a sub-string exists, and how many times the sub-string occurs. This code will count the number of occurrences of a variety of sub-strings in War And Peace.

LOCAL t1, t2, x, i
LOCAL ARRAY aStrings[5]
aStrings[1] = "Russia"  && 775 occurences
aStrings[2] = "Anna"    && 293 occurences
aStrings[3] = "Czar"    && 4 occurences
aStrings[4] = "windows" && 23 occurences
aStrings[5] = "Pentium" && 0 occurences

x = FILETOSTR("WarAndPeace.TXT")   && 3.3 mb in size

FOR i = 1 TO ALEN( aStrings)
    t1= SECONDS()
    ?OCCURS(aStrings[i],x)
    t2 = SECONDS()
    ?[OCCURS( "]+aStrings[i]+[", x)], t2-t1, "seconds"  && 0.401 seconds avg
ENDFOR

Locating sub-strings is something VFP does really well. All these functions are easily sub-second on a 3.3 MB string.

Locating sub-strings

One of the basic tasks in almost any string manipulation is locating sub strings within larger strings. Four useful functions for this are: AT(), RAT(), ATC(), and RATC(). These functions locate the ordinal position of sub-strings starting from the left (AT()) or from the right (RAT()). Both of these functions have case-insensitive variants: ATC(), and RATC(). All these functions are very fast and scale well with file size. For example, let's look for “THE END” in War And Peace.

LOCAL t1, t2, x
x = FILETOSTR("WarAndPeace.TXT")   && 3.3 mb in size
t1 = SECONDS()
?AT("THE END",x)  && 3271915
t2 = SECONDS()
?[AT( "THE END", x)], t2-t1    && 0.180 (seconds)

You can also check for the nth occurrence of a sub-string, as illustrated below, where we find the 1st, 101st, 201st...701st occurrence of the word “Russia” in War And Peace.

LOCAL t1, t2, x, i, nOcc
x= FILETOSTR("WarAndPeace.TXT")   && 3.3 mb in size
FOR i= 0 TO 7b
nOcc= i*100+1
t1= SECONDS()
?AT("Russia",x, nOcc  )
t2= SECONDS()
?[AT("Russia", x, ]+Transform(nOcc)+[)], t2-t1  && 0.180 sec on average
ENDFOR

Two other functions are seemingly useful for locating strings: ATLINE() and ATCLINE(). These return the line number of the first occurrence of a string. Two things to note: First the result of these functions is sensitive to the value of SET MEMOWIDTH, and secondly the performance of these functions is relatively pathetic.

LOCAL t1, t2, x, lnline
x = FILETOSTR("WarAndPeace.TXT")   && 3.3 mb in size

t1 = SECONDS()
lnline = ATLINE("Anna",x)
t2 = SECONDS()
?lnLine                                  && 16
?[ATLINE("Anna",x)], t2-t1, "seconds"    && 0.771 seconds

t1 = SECONDS()
lnline = ATLINE("CZAR",x)
t2 = SECONDS()
?lnLine                                  && 0
?[ATLINE("CZAR",x)], t2-t1, "seconds"    && 1075.467 seconds (!!)

t1 = SECONDS()
lnline= ATLINE("THE END",x)
t2 = SECONDS()
?lnLine                                     && 54223
?[ATLINE("THE END",x)], t2-t1, "seconds"    && 1081.835 seconds (!!)

Which leads to the following general observation: Any function that is sensitive to SET MEMOWIDTH is extremely slow on large strings and does not scale well.

Traversing text line-by-line

Iterating through text, one line at a time, is a common task. VFP developers have been doing it for years using the MEMLINES() and MLINE() functions.

LOCAL x, y, i, t1, t2
SET MEMOWIDTH TO 8000
x = FILETOSTR(home()+"Genhtml.prg")  && 767 lines

t1 = SECONDS()
FOR i= 1 TO MEMLINES(x)
    y= MLINE(x,i)
ENDFOR
t2 = SECONDS()
?"Using MLINE()", t2-t1, "seconds"  && 22.151 seconds

This is pathetic performance! 20+ seconds to iterate through 767 lines! Fortunately, there's a trick to using MLINE(), which is to pass the _MLINE system memory variable as the third parameter.

LOCAL x, y, i, t1, t2
SET MEMOWIDTH TO 8000
x = FILETOSTR(home()+"Genhtml.prg")  && 767 lines

t1 = SECONDS()
_mline= 0
FOR i = 1 TO MEMLINES(x)
    y = MLINE(x,1,_MLINE)
ENDFOR
t2 = SECONDS()
?"Using MLINE() with _MLINE", SECONDS()-z, "seconds"  && 0.451 seconds

This results in a fifty-fold improvement! A surprising number of VFP developers don't know this idiom with _MLINE, even though it's been documented in the FoxPro help since version 2 at least.

However, starting with VFP 6, all these tricks are obsolete, since ALINES() is a screaming new addition to the language. Let's see how these routines look and perform with ALINES().

LOCAL x, i, y, ti, t2
LOCAL ARRAY laArray[1]
x = FILETOSTR(home()+"Genhtml.prg")  && 767 lines

t1 = SECONDS()
FOR i = 1 TO ALINES(laArray,x)
    y = laArray[i]
ENDFOR
t2 =SECONDS()
?"Using ALINES() and traverse array:", t2-t1, "seconds"  && 0.020 seconds

Another twenty-fold improvement in speed! clearly, if you are using MLINE() in your applications, and you are using VFP 6, then it's time to switch to ALINES(). There are two things to keep in mind: ALINES() is limited by VFP's 65,000 array element limit, and ALINES()does not rely on SET MEMOWIDTH, but only works on carriage return breaks.

Just for fun, let's rip through War And Peace using ALINES().

LOCAL x, i, y, z, k
LOCAL ARRAY laArray[1]
x = FILETOSTR("WarAndPeace.TXT")

t1 = SECONDS()
FOR i = 1 TO ALINES(laArray,x)   && 54,227 array elements
    y = laArray[i]
ENDFOR
t2 = SECONDS()
?"Using ALINES() and traverse", t2-t1, "seconds"  && 3.395 seconds

Considering we're creating a 54,337 element array from disk, then traversing all 54,337 elements, assigning each element's contents to a memory variable, and we're back in 3.4 seconds, that's very slick.

What about just creating the array of War And Peace:

LOCAL x, t1, t2
LOCAL ARRAY laArray[1]
x = FILETOSTR("WarAndPeace.TXT")
t1 = SECONDS()
?ALINES(laArray,x)     && 54,227 array elements
t2 = SECONDS()
?"Using ALINES() to load War and Peace", t2-t1, "seconds"  && 2.203 seconds

So, on my Pentium I 233 laptop using VFP 6, we can load War and Peace from disk into a 54,000-item array in 2.2 seconds. On my newer desktop machine, a Pentium II 500, this task is sub-second.

Traversing text word-by-word

You could recursively traverse a string word-by-word by using, among other things, the return value from AT(" ",x,n) and SUBSTRING(" ",,). However, if you are doing that, you're missing a great feature of VFP.

Two little-known functions are great for word-by-word text processing are the Words() and WordNum() functions. These functions, available by loading the FoxTools.FLL library, return the number of words and individual words respectively. These work much faster than anything you can write with AT(" ", x, n) and SUBSTRING(" ",,).

Let's see how they perform by first counting the words in War And Peace.

LOCAL x, t1, t2
LOCAL ARRAY laArray[1]
x = FILETOSTR("WarAndPeace.TXT")    && 3.3 mb in size
t1 = SECONDS()
?WORDS(x)   && 565412
t2 = SECONDS()
?"Using WORDS() on War and Peace", t2-t1, "seconds"  && 0.825 seconds

The Words() function is also useful for counting all sorts of tokens since you can pass the word delimiters in the second parameter. How many sentences are there in War And Peace?

LOCAL x, t1, t2
LOCAL ARRAY laArray[1]
x = FILETOSTR("WarAndPeace.TXT")    && 3.3 mb in size
t1 = SECONDS()
?WORDS(x, ".")   && (Note the ".")    26673
t2 = SECONDS()
?"Using WORDS() countING sentences in W&p", t2-t1, "seconds"  && 0.803 seconds

WordNum() returns a specific word from a string. What's the 666th word in War And Peace? What about the 500000th?

LOCAL x, t1, t2
x = FILETOSTR("WarAndPeace.TXT")    && 3.3 mb in size
t1 = SECONDS()
?WORDNUM(x, 666)  && Anna
t2 = SECONDS()
?"Finding the 666th word in W&P", t2-t1, "seconds"  && 0.381 seconds
t1 = SECONDS()
?WORDNUM(x, 500000)  && his
t2 = SECONDS()
?"Finding the 500000th word in W&P", t2-t1, "seconds"  && 1.001 seconds

Similarly to Words(), we can use WordNum() to returns a token from a string by specifying the delimiter. What's the 2000th sentence in War And Peace?

LOCAL x, t1, t2
x = FILETOSTR("WarAndPeace.TXT")    && 3.3 mb in size
t1 = SECONDS()
?WORDNUM(x, 2000, ".")
t2 = SECONDS()
?"Finding the 2000th sentence in W&P", t2-t1, "seconds"  && 0.391 seconds

Substituting text

VFP has a number of useful functions for substituting text. STRTRAN(), CHRTRAN(), CHRTRANC(), STUFF(), and STUFFC().

STRTRAN() replaces occurrences of a string with another. For example, let's change all occurrences of “Anna” to “the McBride twins” in War And Peace.

LOCAL t1, t2, x
x = FILETOSTR("WarAndPeace.TXT")   && 3.3 mb in size
? "Words in W&P:", WORDS(x)   && 565412 words
t1 = SECONDS()
x = STRTRAN(x,"Anna", "the McBride twins")
t2 = SECONDS()
? t2-t1, "seconds"    && 2.314 seconds
?Occurs("the McBride twins", x), "occurences"  && 293 Occurences
? "Words in W&P:", WORDS(x)   && 565412 words

That's over 125 replacements per second. What about removing strings?

LOCAL t1, t2, x
x = FILETOSTR("WarAndPeace.TXT")   && 3.3 mb in size
? "Words in W&P:", WORDS(x)      && 565412 words
t1 = SECONDS()
x = STRTRAN(x,"Anna")
t2 = SECONDS()
? t2-t1, "seconds"    && 2.293 seconds
? "Words in W&P:", WORDS(x)  && 565168 words

So it appears that STRTRAN() both adds and removes strings with equal aplomb. What of CHRTRAN(), which swaps characters? Let's, say, change all “s” to “ch” in War and Peace.

LOCAL t1, t2, x
x = FILETOSTR("WarAndPeace.TXT")   && 3.3 mb in size
t1 = SECONDS()
x = CHRTRAN(x,"s","ch")
t2 = SECONDS()
? t2-t1, "seconds"    && 0.521 seconds

Which isn't bad considering that there are 159,218 occurrences of character “s” in War And Peace.

However, don't try to use CHRTRAN() when the second parameter is an empty string. The performance of CHRTRAN() in these circumstances is terrible. If you need to suppress sub-strings, use STRTRAN() instead.

String Concatenation

The final thing I'd like to mention about strings and VFP is concatenation speed.

Since many common tasks, like building web pages, involves laying down strings one element at a time, you should know that string expressions of the form x=x+y are very fast in VFP.

LOCAL t1, t2, x
x = FILETOSTR("WarAndPeace.TXT")   && 3.2 mb in size
t1 = SECONDS()
x = x+ "<b>Drink Milk!</b>"
t2 = SECONDS()
? t2-t1, "seconds"    && 0.000 seconds

The same type of performance applies if you build strings small chunks at a time. This is a typical scenario in dynamic Web pages, whether a template engine or raw output is used.

LOCAL t1, t2, x, y, count
t1 = SECONDS()
x = ""
y = "VFP Strings are fast"

FOR count = 1 to 10000
    x = x + y
ENDFOR

t2 = SECONDS()
? t2-t1, "seconds"    && 0.030 seconds
? len(x)		    && 200,000 chars
RETURN

This full optimization occurs as long as the string is adding something to itself and as long as the string concatenated is stored in a variable. Using class properties is less efficient, but also realizes some of this performance gain. String optimization does not occur if the first expression on the right of the = sign is not the same as the string being concatenated. So:

x = "" + x  + y

is not optimized in this fashion. The above takes 25 slow seconds!

So appending strings to strings is blazingly fast in most common situations. The VFP development team worked to optimize this process, and the result is VFP can assemble web pages (among other things) very, very quickly.

Conclusion

Most of the performance measurements made in this article are borderline amazing especially given the less than current hardware that the test was run on. The Pentium 233 is a good baseline however for typical machines in use in desktop applications today. Server applications of course will likely use the latest hardware and can expect even better performance (often another 5-10 fold improvement) from the newer hardware.

Most of the operations I've talked about are sub-second when applied on a string the size of War And Peace. Maybe I'm easily amused, and easily amazed, but I'll say this: when it comes to working with text in an easy interactive environment, give me Visual FoxPro. Please.