beanz Magazine

Stop Words

Daniel Dionne on Flickr

A clever technique to speed up database searches also is an interesting concept.

Imagine you have written search engine software. Now you want to speed up how fast your software searches the database. How do you do that? What are some trade-offs?

Let's say someone types this phrase in your search engine:

The rain falls mainly in the plain in Spain in the winter

Notice there are three instances of the word the in this phrase. What if you replaced the with an asterisk, like this:

* rain falls mainly in * plain in Spain in * winter

If you have 10 million records in your database used to provide search results, replacing the three characters of the word the with one character, an asterisk, could save you a lot of space in your database, plus make searches faster by having less data to parse.

Now notice this search phrase also uses the word in repeatedly. Let's replace that with an asterisk as well:

* rain falls mainly * * plain * Spain * * winter

Clearly this chunk of data has more unique words to parse: rain, falls, mainly, plain, Spain, winter. In theory, removing the and in will yield more accurate search results.

To search your database, you might run several queries, one for the word rain, another search for falls, another for the word plain, another for Spain, and another for winter. Each of these searches would be faster for not having to parse the words the and in.

Words like the, in, at, that, which, and on are called stop words. Coined by Hans Peter Luhn, an early pioneer of information retrieval techniques, stop words are words so common they can be excluded from searches because they increase the work required by software to parse them while providing minimal benefit. People rarely search only for the the word the, for example.

However, if you want to search for information about the band The Who, and any phrase that might include a stop word, your search engine may or may not return accurate results. Stop words can accidentally prevent correct results. Removing the word which from your search database might not cause problems. Removing the word the probably will.

One clever solution might be to mark the occurence and position of stop words while also removing them from a database. In our example above, you might replace the instances of the word the with the number 1 and instances of the word in with the number 2, like this:

1 rain falls mainly 2 1 plain 2 Spain 2 1 winter

This provides the benefits of not using stop words with the speed gained from removing stop words from the database. In a later step in your search results processing, you could include the words the and in by translating instances of the with 1 and in with 2. Instead of a dumb asterisk, you use a single character space in a more subtle and meaningful way.

Another solution for handling stop words has to do with how search terms are entered. Using double quotes around a phrase tells the search engine to treat the phrase as a single block. Your search engine code could look for double quotes and treat them as a single block. So this search phrase would return accurate results even as it uses a stop word:

"The Who” song lyrics

If you substituted instances of the word the in your database with the number 1, your search might look for "1 Who” with a search for song and another search for lyrics.

As with all the examples and possibilities in this article, what actually is coded and how a search engine is designed and built is extremely complex and hard to predict. These details are generalizations to explain the concept of stop words and how they impact search engines.

What search engines leave in and out of their databases depends on the informed opinions and experience of the programmers who design and create the engine. As with many parts of computing, there is no 100% best way to solve the problem of providing accurate search results quickly. Stop words is simply one approach among many. Think about that the next time you type the or at into a search engine.

Learn More

Wikipedia: Stop Words

http://en.wikipedia.org/wiki/Stop_words



Also In The November 2013 Issue

My Adventures with Raspberry Pi

Open source hardware geared towards artists, hobbyists, designers, and students, is a viable and far less expensive alternative to build your own computers.

Beth Rosenberg Talks How Tech Kids Unlimited Helps Kids Who Learn Differently

With a wave of kids with special needs graduating high school, how can technology help them with resumes, college, jobs, and careers?

Stop Words

A clever technique to speed up database searches also is an interesting concept.

More Fun with Raspberry Pi

Here are some videos, and links to even more videos, to learn how to use your Raspberry Pi and have all kinds of fun with Pi projects.

My goal wasn’t to make a ton of money. It was to build good computers.

News Wire Stories for November 2013

Interesting stories about computer science, software programming, and technology for the month of October 2013. More stories can be found at the Software Programming and Computer Science News Wire link at the top of every page of this site.

Bubble Sorts

With a bubble sort, numbers sort themselves as they bubble to the left of a group of numbers. Here's a fun catchy video to explain.

The 7 Bridges of Königsberg

This month's math puzzle dates back to 1735 when it was first solved by Leonhard Euler, a Swiss mathematician and physicist.

Pair Programming

From the start of computing history, people have tried to optimize the software programming process. This includes having two coders work together to code software.

Learn More Links for November 2013

Links from the bottom of all the November 2013 articles, collected in one place for you to print, share, or bookmark.

Icon-itis

The release this fall of Apple's iOS7 operating system is a great opportunity to explore the history of computer interface design.

Functions

Managing inputs and outputs is a key problem programming languages face. Here's how a few languages use functions to manage and transform data.