Search upgraded (yet again)

Written By: - Date published: 12:47 am, March 15th, 2010 - 4 comments
Categories: admin - Tags:

My absence this weekend from moderation and posts has partially been because of going to the excellent Fabian seminar today. However it has largely been due to trying to get a workable search system running.

One that didn’t bring the whole server crashing around my ears every few months and cut the issue of everyone having to stop receiving data for up to a minute when someone used search. One that allowed comments to be accessed as easily as posts, which Search Unleashed achieved (which eventually failed because the re-index took too long). It looks like it is finally working correctly, but I’d like feedback on any errors that show up.

For the technically minded, this is running on the Sphinx search engine with the WordPress Sphinx Search Plugin, and some custom coding in the search sidebar and search page template for our new theme. It runs a delta update for changed items every 5 minutes, and a complete rebuild at 0300.

This is the third time I’ve attempted to get this system running. Each of the previous attempts have been stymied by sphinx disliking our constrained memory during peak hours. But we’ve bumped the RAM up, and it also turns out that there were some conflicting permission issues (grrr) when running the cron jobs with an active website.

Normally I’d have consigned something as annoying to configure as this to the dustbin, but the engine is blindingly fast and perfect for our scaling needs with nearly 5000 posts and 150,000 comments currently onsite. This is expected to keep rising at rates similar or higher than have been the case so far.

Sphinx can reindex our entire database in about 30 seconds (whereas Search Unleashed required hours), and will do the delta every 5 minutes in well under a second. Almost all of the delay in presenting the search result is from assembling the display rather than the search. It is a seriously good engine and has been (so far) worth the effort.

The user interface is a little primitive at present and I also need to store and get your user selected advanced settings in your cookies. But it is good enough to look for loading problems over this week. It has a number of options that will be useful for advanced searching, especially for authors like r0b…

Sphinx extended queries syntax allows the following special operators to be used:

  • AND: hello&world hello world
  • OR: hello | world
  • NOT: hello -world or hello !world
  • field search operator: @title hello @body world
  • phrase search operator: “hello world”
  • proximity search operator: “hello world”~10

Here’s an example query which uses all these operators:
“hello world” @title “example program”~5 @body python -(php|perl)

More information about extended syntax you can find at Sphinx.

The following field operators are available:

  • @title – search in post or page title
  • @author – search in post, page, comment author (still a bit flakey)
  • @body – search in post, page, comment body
  • @category – search search in blog categories

For instance, I tested such things as lprent&fabian lprent fabian, pigs-flying, and (supershitty|super-city|super city|supercity)& franklin.

Using the author you can search for f*ck @author robinsod (that allows you to exclude the people swearing at him).

If you use the advanced and turn pages and comments off, you can resolve a long-standing argument.

  • @title “helen clark” and @title “john key” will respectively tell you how often each was mentioned in the title of a post.
  • @body “helen clark” and @body “john key” will respectively tell you how often each was mentioned in the body of a post.

If I see that argument again…..

The time to run a search seems to be roughly the same regardless of how complex I make the query.

Anyway, let me know if you find any issues using the form in contact. I’ll be keeping an eye on it for the next 3 or 4 days for performance issues.

Hopefully I’ll now have more time  to write posts and tackle the minor teething issues like the misbehaving Contribute Past captcha, and less time trying to maintain a fragile search mechanism that was slowly fading under increased loads.

BTW: For those of you who were at the Fabian seminar today. I was paying attention to the panel and discussion. But I was also coding during the seminar. This had been bugging me most of the weekend to the point where I had to go to the seminar, tether the iPhone to the net, and finish the damn search page interface.

Update: Been through the site settings for the first time since it last had a server move. You’ll see a significant improvement in performance in the search, and probably in just reading the site.

Update: Found a couple of errors in the post, which you will see corrected with the original in strikeout. There are a number of enhancements to the system (like @author). I added some additional examples while testing.

But I was paying attention to the seminar as well

4 comments on “Search upgraded (yet again) ”

  1. lprent 1

    Ummm, one thing it misses is the comment author name…. and post author name..

    Ok that is a problem in the query, which I’ll adjust in the sphinx.conf tomorrow sometime

    • mummybot 1.1

      Good on ya mate! 😉

      • lprent 1.1.1

        Did a burst of database optimization tonight as well. I’m pretty sure that the site will be significantly faster for reading and probably the same or slightly better for writing.

        Sphinx was loading down the system (as well as some of the other queries) so I tuned up a couple of tables with indexes based on what was showing in the slow query log. Then I tuned mysql to use more of the memory/disk/cpu tradeoff. Then did some tweaks to the php settings based on its logs.

        I’d better give up some of these late night sessions.

      • lprent 1.1.2

        Just while I remember, the HTML ins and del need some css coloring and decoration. del should be coloured, and the insertions should be a different colour.

The server will be getting hardware changes this evening starting at 10pm NZDT.
The site will be off line for some hours.