Fast case insensitive search with SIMD

One of the goals for this IDE (Bold, you're on the website) is to handle large files. A person should be able to open a ~2GB log file or JSON file and search for a string without the editor locking up for multiple seconds. We ended up taking <120MS (your mileage may vary) by using SIMD, specifically AVX (not to be confused with AVX512). To keep this article straightforward some details are ignored and the code has been simplified. In this article, we talk about why we optimized, some details about SIMD and the code at a high level. You don't need prior SIMD knowledge. You can find the example source code here

It's not unusual for a programmer to deal with a 1+GB generated file. It could be a log file, json, some kind of text export (csv, SQL, etc), or if you're a compiler enthusiast it could be generated C code. Something you may want to do is search for a serial number, hex code, or some kind of ascii string. Because it's not unusual to be in this situation and because it isn't too difficult to implement, we used SIMD to optimize this case. There's a few cases we optimized but the code is similar for each of them. This article will cover a case insensitive search with 4+ ascii letters since that's the most complex variant.

High-Level Overview:

SIMD (single instruction, multiple data) are instructions a CPU can use to work on 16/32/64 bytes wide data (not bits, bytes). Without going into too many details you can think of a SIMD register as a tiny array where operations are applied to every element. Element sizes depend on the instruction you use. However, unlike normal code there are many missing instructions which makes programming with SIMD harder. A situation we'll run into is compare. You can do == and > but only those, and they must be signed values. In our situation we want to know if a byte has an ascii value >= 'A' and <= 'Z' but since >= and <= don't exist we need to write the check in a different way. Since compares are signed, an 8bit value of 128 will be treated as -128 which could get in the way if you want to compare values >= 128 but we won't need that in our situation. You can find intrinsics AVX documentation here We're using AVX because those instructions existed for 10+ years and are available on most machines.

Our goal is to do something like this. The next paragraph describes the psuedo code


	findInsensitive(const char*lowerUserText, const char*fileText) {
		assert(strlen(lowerUserText)>=4)
		firstLetter = lowerUserText[0]
		secondLetter = lowerUserText[1]
		thirdLetter = lowerUserText[2]
		for i = 0; i < fileText.size(); i += 32 {
			letters = load(&fileText[i])
			indexesWithUppercase = letters >= 'A' && letters <= 'Z'
			lowercaseLetters = Add(letters, 32, indexesWithUppercase)
			fistMask = lowercaseLetters == firstLetter
			secondMask = lowercaseLetters == secondLetter
			thirdMask = lowercaseLetters == thirdLetter
			//more later
		}
	}

To search without case sensitiveity we'll search letters where both the source and user text are lowercase. (A-Z becomes a-z). In our loop we increment i by 32 because that's how many bytes an AVX instruction can check in a single operation. letters are our SIMD variable that holds 32 chars at one time. Since there is no >= and <= we'll have to rewrite that line. The results of a comparison is -1/255 for true and 0 for false (the letters and mask are 8 bits, there will be 32 of these). This variable is often called a mask. Some instructions can take a mask value and only apply the operation when the mask has the slot as true/set. Add doesn't have that so we'll need to rewrite that line. We want to add '32' to all the uppercase letters. 'a' - 'A' == 32 so adding 32 to any uppercase (ascii) letter will result in the lowercase letter. Finally, there's first/second/third mask which we'll explain later.

The first problem we run into is AVX does not have a >= or <= instructions. Our only options are cmpeq (==) and cmpgt (>). Since we want to check a range cmpgt is our only option. cmpgt works with signed ints so 128 will be treated as -128. In C++ it's "undefined behavior" for signed integers to overflow but with SIMD they are allowed to wrap around. One solution we can use is add a value to each letter so 'A' to 'Z' will occupy the highest 26 values in a 8bit integer. If we do that any values larger than 'Z' will wrap to a negative value which allows us to not need a replacement for <=. To have A-Z the last 26 values we want 'Z' to map to 127. 'Z' is 90, 127-90 gets us 37 so we want to add 37 to each letter. 'A'+37 is 102 and we want to include it when we compare, we'll compare with > 101. Our new formula is


	indexesWithUppercase = (letters + 37) > 101

Our next issue is Add, since there isn't an add (or an OR) operation with a mask we can't use the mask value directly with add. Compare operations produce a value where true will produce 8 set bit (-1/255) or 8 clear bits for false. AVX allows us to use ANDs and ORs so what we can do is write 32 & Mask which will set all the true cases to 32 and false to 0. We can use that so only uppercase values will add 32 which will transform those letters to lowercase.


	some32s = indexesWithUppercase & _mm256_set1_epi8(32)
	lowercaseLetters = letters + some32s

Now that we have no uppercase we can search for the user's case insensitive string easier. The last part of our simd code is


	fistMask = lowercaseLetters == firstLetter
	secondMask = lowercaseLetters == secondLetter
	thirdMask = lowercaseLetters == thirdLetter

	bits = (as32bits(fistMask)<<2) & (as32bits(secondMask)<<1) & as32bits(thirdMask);

	while (bits) {
		i = tzcnt(bits) //described below
		if (InsensitiveCompare(file+i+1, needle+3, needleLen-3)) {
			results.push_back(i-2);
			ignoreBits = (1ULL<<(i+needleLen))-1;
			bits &= ~ignoreBits;
			continue;
		}
		bits &= bits-1; //clears lowest bit
	}

This part may be easier to understand in a debugger. The actual instruction for as32bits is _mm256_movemask_epi8. As an example if our SIMD register had the letters 'ABAA' and we wrote cmpeq 'A' the SIMD register (little endian) will have 11111111 11111111 000000000 11111111. Using _mm256_movemask_epi8(j) we will get 13 (1101) in an int and will allow us to use bitshifts and normal math. The input in our example source is "12ABC". We want to search "abc". movemask on first letter will give us 00100, second would give us 01000 and third is 10000. By shifting the results of the first letter by 2 and the second letter by 1 we'll have all 3 bits aligned on the last letter and we can use & to confirm all 3 letters were matched. Next is converting the set bit (16 or in binary 10000) to an index

Here we use tzcnt (trailing zero count). In the case of 16 (10000) there are 4 zeros before the first set bit, tzcnt will return 4. If you look at our input ("12ABC"), input[4] is 'C' which is correct because that's where we set the bit in our algorithm. tzcnt is a nice instruction that prevents us from writing a loop counting zeros. The simd code compares the first 3 letters so what we want to do is compare the 4th and the rest of the letters in the word to see if they match. Since tzcnt gives us the index of the third letter we use &file[i+1] to start comparing at the 4th letter.

Results: Our original code didn't use SIMD but checked if the first three letters matched before calling InsensitiveCompare. It took ~1second to search a 1GB json file. After using AVX it took <120MS. Searches feel much nicer after this change. SIMD may be a little difficult to work with from not having all the instructions you're used to, but once you have strategies on how to change your algorithm you can get a pretty large improvement, even if your data is coming from RAM.