beanz Magazine

Stop Words

Daniel Dionne on Flickr

A clever technique to speed up database searches also is an interesting concept.

Imagine you have written search engine software. Now you want to speed up how fast your software searches the database. How do you do that? What are some trade-offs?

Let's say someone types this phrase in your search engine:

The rain falls mainly in the plain in Spain in the winter

Notice there are three instances of the word the in this phrase. What if you replaced the with an asterisk, like this:

* rain falls mainly in * plain in Spain in * winter

If you have 10 million records in your database used to provide search results, replacing the three characters of the word the with one character, an asterisk, could save you a lot of space in your database, plus make searches faster by having less data to parse.

Now notice this search phrase also uses the word in repeatedly. Let's replace that with an asterisk as well:

* rain falls mainly * * plain * Spain * * winter

Clearly this chunk of data has more unique words to parse: rain, falls, mainly, plain, Spain, winter. In theory, removing the and in will yield more accurate search results.

To search your database, you might run several queries, one for the word rain, another search for falls, another for the word plain, another for Spain, and another for winter. Each of these searches would be faster for not having to parse the words the and in.

Words like the, in, at, that, which, and on are called stop words. Coined by Hans Peter Luhn, an early pioneer of information retrieval techniques, stop words are words so common they can be excluded from searches because they increase the work required by software to parse them while providing minimal benefit. People rarely search only for the the word the, for example.

However, if you want to search for information about the band The Who, and any phrase that might include a stop word, your search engine may or may not return accurate results. Stop words can accidentally prevent correct results. Removing the word which from your search database might not cause problems. Removing the word the probably will.

One clever solution might be to mark the occurence and position of stop words while also removing them from a database. In our example above, you might replace the instances of the word the with the number 1 and instances of the word in with the number 2, like this:

1 rain falls mainly 2 1 plain 2 Spain 2 1 winter

This provides the benefits of not using stop words with the speed gained from removing stop words from the database. In a later step in your search results processing, you could include the words the and in by translating instances of the with 1 and in with 2. Instead of a dumb asterisk, you use a single character space in a more subtle and meaningful way.

Another solution for handling stop words has to do with how search terms are entered. Using double quotes around a phrase tells the search engine to treat the phrase as a single block. Your search engine code could look for double quotes and treat them as a single block. So this search phrase would return accurate results even as it uses a stop word:

"The Who” song lyrics

If you substituted instances of the word the in your database with the number 1, your search might look for "1 Who” with a search for song and another search for lyrics.

As with all the examples and possibilities in this article, what actually is coded and how a search engine is designed and built is extremely complex and hard to predict. These details are generalizations to explain the concept of stop words and how they impact search engines.

What search engines leave in and out of their databases depends on the informed opinions and experience of the programmers who design and create the engine. As with many parts of computing, there is no 100% best way to solve the problem of providing accurate search results quickly. Stop words is simply one approach among many. Think about that the next time you type the or at into a search engine.

Learn More

Wikipedia: Stop Words

http://en.wikipedia.org/wiki/Stop_words



Also In The November 2013 Issue

An Interview with Troy Hunt

Troy Hunt is a software architect and Microsoft Most Valued Professional (MVP) focusing on security concepts and process improvement in a Fortune 50 company. He's based in Australia.

1Password, LastPass, RoboForm

If you use a password you created that is less than eight characters, your password is vulnerable to hacking. Here are three ways to create and use secure passwords online.

How to Write Secure Code

Coding securely doesn't have to kill the joy of programming. In fact, learning how to code securely provides insights into languages and computing.

How to Code HTML Email

How to code an HTML email like the ones you open every day turns out to be an offbeat software coding challenge.

What is an SSL Certificate?

How to tell if a web page is secure is one of the most basic yet least obvious ways to protect your data online.

Where to Find Command Line Interface Software

One key computing skill is the ability to use command line interface (CLI) software to enter commands to control a computer. Here are some options.

Lua

Lua is a comparatively simple programming language used in a wide range of places, from digital TVs to video games to phone applications. It's also designed to be simple to use and lightweight.

Arrays

Here is how three programming languages handle a common problem: how do you organize and keep track of useful data?

Linux Command List for Command Line Interfaces

Some of the most common commands you'll need for a command line interface (CLI), in a Linux command list.

Computer science education cannot make anybody an expert programmer any more than studying brushes and pigment can make somebody an expert painter.

News Wire Stories for October 2013

Must read stories about computer science, software programming, and technology for September 2013.

Learn More Links for October 2013

Links from the bottom of all the October 2013 articles, collected in one place for you to print, share, or bookmark.