Author Archive for ‘ Eric Pugh’

Things I Learned About Last Week

Posted Tuesday, March 9th, 2010 by Eric Pugh

Last week was the crucial week on my current Lucene -> Solr project for making our goals.  A lot of work the previous couple of weeks came together.  I wanted to take a couple of minutes and just record some of the little things that I’ve been learning about:

Solr

Sunspot is the up and coming solution for integrating Solr into Ruby on Rails, and fortunately enough, the 1.0 release (followed quickly by 1.0.1!) has just come out last week.  Between acts_as_solr and Sunspot, Sunspot wins hands down for it’s support of a master/slave Solr configurations, embedded Solr for testing, richer indexing semantics, and not being tied to ActiveRecord.  The companion sunspot_rails gem does give wonderful ActiveRecord integration however.

Solr cores are the bees knees!  We’ve built a simple RoR webapp using HTTParty and the Solr API that allows you to perform all the admin functions for cores, and allows you to quickly clone a core for your own nefarious purposes!  Simplifies hacking around with a new schema or configuration without having a local copy of Solr running.  Allows multiple QA environments to potentially share a single Solr infrastructure.

Solr master and slave setup in a single VM.  While pointless from a scaling perspective, it’s a really great way to work out the kinks!  It’s funny to see a slave core polling the same Solr VM its in for updated segments!

JRuby

Doesn’t suck after all.  Actually, maybe I should say that JBoss, when combined with JRuby, means that JBoss doesn’t suck so much.  I had the aforementioned Solr core admin tool bundled up as a WAR file with JRuby, and was able to deploy it to an existing environment that had JBoss installed!  I didn’t have to install ruby on the box, (or JRuby for that matter!)  I just deployed the WAR file and bamn, off to the races.  Ops folks get the JBoss they love, I get the Ruby on Rails that I love.

And on a related note, Warbler was the key to thinking JRuby is cool.  I’d never actually had to package up a RoR app, so Warbler came to the rescue.  And you know what?  It was nice to build a single file that I knew had everything that I needed in it that could be scp’ed around!  And thanks to some cool code in the environment.rb, my app was able to load up the right configuration file for the environment based on an environmental variable set in JBoss.

Virtual Machines

I recently migrated a Linux VPS based RoR + Solr app (see a trend in tech choices ;-) )  to a Windows environment.  And to deliever the new Windows environment, I used VirtualBox to host the Windows Vista environment on my Mac laptop.

A couple of notes:

  • VirtualBox may not have all the snazzy integration points of Parallels with the host computer like seamless application sharing, but it seems to be much lighter weight.  Starts up quicker, and I don’t get the spinning beach ball of death as much.
  • If you are shipping a 11 GB file, you can’t use a 16 GB USB Memory Stick…  Turns out the biggest file is 4 GB.  (Although I never tried formatting the stick as NTFS, maybe that would have allowed a single 11 GB file???)
  • Uploading 11 GB to a remote out on the internet server will take a long long long time.  Even on a really fast network. connection.
  • If you need to format an external USB hard drive as NTFS on a Mac, it is possible!  Just fire up your trusty Windows Vista image in Parallels, plug the USB drive in, download and install the correct USB drivers so the drive doesn’t show up as a network share mapped to the Mac, and then use the built in reformatting tools!  Warning: This will take a loooong time!
  • Lastly, if you are using VirtualBox, and you attempt to create a Windows XP machine, and attach a Windows Vista hard disk image to it, VirtualBox will let you!  And then Windows won’t start.  sigh.

Notes from using LucidWorks for Solr Distro

Posted Thursday, January 28th, 2010 by Eric Pugh

I’ve been playing with the LucidWorks for Solr distribution of Solr 1.4, and wanted to share some of things I had noticed about it. The LucidWorks distro is Solr 1.4 with patches and enhancements from Lucid added in.

Installer

The first thing you’ll notice is that an installer (and uninstaller) is provided that walks you through the basic steps of installing Solr. Now Solr itself is pretty darn simple to work with already, but you do need to compile the code, which means you need Ant to be installed. The Lucid installer avoids that need, and  adds support for running Solr in Tomcat as well as Jetty. And, assuming you have a support agreement with Lucid, it supports downloading plugins from Lucid to extend your Solr platform. Right now the only free plugin is the Reference Guide PDF. Having an installer available definitely checks a box for the systems type folks who may be installing Solr, but it doesn’t really do anything crazy special. Also, one nit is that if you install into /opt/dirA, and then want to install into /opt/dirB, you have to delete ~/.LucidWorks/ directory as the install dir is cached!  But it does demonstrate what might be coming from Lucid in future updates!

Installer Targets Screen

Installer Targets Screen

Another enhancment from Lucid is a Tray Application for managing your Solr instances. However, this turns out to just be a basic (on OSX at least!) menubar application that allows you to start/stop a local Solr server. There doesn’t seem to be any options to stop and start remote servers, or monitor the health of running Solrs, so I think this is something you use once and never again! Hey Lucid, it would be great though if the Tray App integrated stoplight monitoring of Solr instances and popped open web pages to admin pages to perform various tasks on your collection of Solr servers!

Directory Layout

The directory that you’ve installed Solr into should look very familiar. In fact, too familiar to me! I’ve gone back and forth on the way that Solr is distributed with source code as well as compiled jars. While Solr used to be a tool that only Java centric shops would look at, it’s now gone mainstream, to where many, if not most, organizations that use Solr are not traditional Java shops! I really wish I could download a version of Solr that didn’t have the src directory, was just a stripped down ready to go application. Admittedly, the example application that is part of the source functions as a template, but it has been bemoaned by myself and others that folks just use and abuse the configuration of what was meant as an example app, to their detriment!

So I was hoping that the LucidWorks distros’ Installer would function as that smart template by walking me through including/excluding various extensions like DIH, Clustering, and Extraction. But at least in this first version, no such luck. The support though for for picking either Tomcat or Jetty as a container shows what could be in the offing though!

While the LucidWorks distro still ships with the hoary old example directory is still there, there is now a lucidworks directory. When you run the new toplevel start.sh shell script it starts Solr with solr.solr.home=lucidworks/solr directory. Something to note is that the start.sh has complete paths defined in it from the installer:

cd /Users/epugh/solr/solr2/LucidWorks/lucidworks/jetty/../

It really should at least have a single variable at the top that you can changing depending on what environment you are in.

The lucidworks project is also setup as a single index project.  Since the future is multicore configurations, I’d like to see that as the default in more examples.  (The example app needs a bit of work as well to better show off multicore as a first class feature!)

solrconfig.xml

Doing a diff on the example and lucidworks versions of solrconfig.xml shows its pretty much the same as the one from the example app, but with the correct configurations for DataImportHandler and the Velocity based search UI called Solritas. Solritas is a nice tool for helping you “wedge” Solr into places by providing a simple Velocity template based translation layer, and even build a GUI, within your Solr environment. Solritas hasn’t received a lot of buzz, so it’s nice seeing it turned on by default! The clustering functionality is also specified, but not sure if the solr.cluster.enabled=true startup parameter is actually required or not.

The other oddity is that the Lucid monitoring product for Solr, SolrGaze, isn’t enabled by default! Doesn’t seem like the most ringing endorsement for the software. I’m excited by the prospect of better visiblity into the internals of Solr, so I enabled it.

schema.xml

Diffing the two schema.xml files reveals the addition of the Lucid KStemmer com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory for fast non-aggresive text stemming. According to Lucid it provides:

Large field performance shows a 220% performance increase, while small fields show a 1140% increase compared to the original UMASS code.

SolrGaze

SolrGaze promises to make it easier to see what is going on inside of Solr. Anything that makes it simpler for operations folks instead of developers to manage Solr is good in my book. I ran into one nit which was I opened up SolrGaze using the url http://localhost:8983/gaze/index.html. It barfed connecting to Solr to display gathered metrics, but if I used http://127.0.0.1:8983/gaze/index.html then everything was fine.

I haven’t had to chance to really play with Gaze yet, so I’ll post a more in-depth review soon.

Summary

All in all, the Lucid distro would be what I would recommend for a first timer to download, or someone doing a spike of development and needing a quick install of Solr.  Not requiring Ant to be installed is a wonderful thing, and being pre-configured for Clustering, DIH, and Solritas means you get to see a working Solr install, complete with a full featured GUI, right out of the box.  In terms of using for a production deploy, there is less to recommend it since you’re going to want to strip down to just the bits and bobs that your require for your specific needs.  I haven’t delved down into what SolrGaze provides, so that feature may be the tipping point for deciding to use the Lucid distribution.

Erik Hatcher, Solr Committer, reviews Solr 1.4 Enterprise Search Server

Posted Monday, January 11th, 2010 by Eric Pugh

When I first got involved in writing Solr 1.4 Enterprise Search Server I knew that one of the folks I wanted to have review the book was Erik Hatcher, a Solr committer, and who introduced me to the project.

He has written a very indepth review, that I’ll admit I was nervous to read! But he summed it up as:

Grand Finale
I spelled out a lot of fiddly feedback above, and I expect the great addendum wiki page will factor in any keepers from this review. Of course most of the review points out mistakes or differences of opinion, that’s what a review is for, though this is a solid, useful book. So, if you’re considering using Solr, this book is for you. If you’re already using Solr, you’ll likely pick up a useful trick or three. Go get it!

As you can see from the level of detail in his post, when we come out with a second version of the Solr book, updating it for changes between when we published it and the final release of Solr 1.4 will be very easy!

Streaming Index Progress Results to Browser

Posted Friday, December 11th, 2009 by Eric Pugh

I recently needed to index from a local filesystem several thousand static webpages into Solr. I was already using Ruby on Rails for the admin interface, so I quickly threw together an action to index the documents using HPricot and RSolr. To monitor the progress I just output to standard out using puts

def index_bulk_html
  solr = RSolr.connect :url=>SOLR_URL
  count = 0
  files = Dir.glob("/Users/epugh/Documents/code/www.somesite.com/**/*.{html,htm}")
  files.each do |file|
    path_ends_at = file.index("www.somesite.com")
    unless path_ends_at.nil?
      puts("<strong>Processed #{count} of #{files.size}</strong>") if count % 100 == 0 

      url = "http://#{file[path_ends_at,file.size]}"
      title, content = parse_html(file, title, content)

      puts "Bad Content:#{!page_content.blank?} #{url} #{title}"

      begin
        solr.add :id=> url, :url=>url, :mimeType=>"text/html", :title => title, :docText => page_content
        solr.commit
        count = count + 1
      rescue RSolr::RequestError
        puts "<strong>Could not index #{file}</strong>"
      end
    end
  end
  puts "Imported #{count} webpages successfully."
  solr.optimize
  redirect_to root_path

end

This worked great, but I realized that indexing over 10,000 documents takes a long time, and meanwhile the user is staring at the browser slowly loading, wondering if things had frozen or not! So I wondered if I could somehow stream some info back to the user. Fortunately Rails has already solved that problem! ActionController has the ability to render as text a proc object, and stream the output:

  # Renders "Hello from code!"
  render :text => proc { |response, output| output.write("Hello from code!") }

So I quickly wrapped my existing code in a large proc, changed the puts to output.write, and now stream out to the browser constant progress reports:

def index_bulk_html
    solr = RSolr.connect :url=>SOLR_URL
    count = 0
    files = Dir.glob("/Users/epugh/Documents/code/www.somesite.com/**/*.{html,htm}")
    render :text => proc { |response, output|
      files.each do |file|
        path_ends_at = file.index("www.somesite.com")
        unless path_ends_at.nil?
          output.write("<strong>Processed #{count} of #{files.size}</strong>") if count % 100 == 0 

          url = "http://#{file[path_ends_at,file.size]}"
          title, content = parse_html(file, title, content)

          output.write "Bad Content:#{!page_content.blank?} #{url} #{title}"
          output.flush

          begin
            solr.add :id=> url, :url=>url, :mimeType=>"text/html", :title => title, :docText => page_content
            solr.commit
            count = count + 1
          rescue RSolr::RequestError
            output.write "<strong>Could not index #{file}</strong>"
            output.flush
          end
        end
      end
      output.write "Imported #{count} webpages successfully."
     }
    solr.optimize

  end

Thank you Rails, Hpricot, and RSolr for making life so simple!

James Bach, the bad boy of Testing?

Posted Monday, October 26th, 2009 by Eric Pugh

So, is James Bach (@jamesmarcusbach) the bad boy of testing?

I flew up to Boston on Monday to lead some workshops on Continuous Integration. I checked into my room at the Hyatt and then went downstairs to see who was around. I ran into a couple of speakers milling about, and eventually joined one of them, and we headed over to the MIT Press bookstore, me to look for my Solr book. I wasn’t too sure of the name of the other speaker I was with (I asked once, but couldn’t quite remember what it was…). So we got to the book shop, I ask my fellow speaker again: James Bach. The name I was familiar with, but couldn’t quite place it… I ended up buying Parentonomics, and then we go for coffee.

So, over coffee, he asks me about what my topic is, and I gave him the brief summary of my two CI related workshops. Wow.. Little did I realize that I was sitting with the guy who rails against the “fetish” that Agile folks have for automated testing! That his entire approach to “testing” is to use skilled, motivated folks who do “sapient testing”. And I’m the guy who’s selling an approach that REQUIRES automated tests! That encourages expanding the use of automated testing!

He actually walked me through a process of talking about how to “think like a tester”, and it was really great mini-workshop.. He definitely subscribes to the socratic approach, and believes in his message, I was sweating at the end of it! That chat probably sparked more ideas in less time over that coffee then anything else this week. I also heard a lot of ideas and phrases that were echoed in Michael Bolton’s keynote later on in the week. Clearly a lot of collaboration between the two!

Probably the biggest idea that James and chatted about was the idea that automated tests really aren’t automated tests, they are automated checks. They verify that the expected behavior of the code was met. His argument that if you want to do testing, real testing, then computers, automated processes can’t meet that need, only people can.

Now, I don’t know if I believe that is completely true, but I am very aware that the “manual testing” where long test scripts written as Word documents are executed by human beings by hand are really a waste of human potential. And that those test scripts are really, to use James terms, “check scripts” because the people are not using any creativity! In fact, a lot of my interest in CI comes from the idea that people should not do monkey testing, that machines can do it much better, and my frustation with the perception that testing is a low value activity and can be easily shipped off to low skilled folks.

I think that this shift away from the term “test” for automated tests is actually happening in many places. In the Ruby world, we have libraries like Shoulda that are moving from using words like assert to other words like should. A Cucumber test really shows how controlled the space that a test needs to be to work well in an automated fashion:

Scenario: See all vendors
Given I am logged in as a user in the administrator role
And There are 3 vendors
When I go to the manage vendors page
Then I should see the first 3 vendor names

So while I don’t know if I am bought in on the idea that only people can do “testing”, and machines can only do “checking”. Tools like Heckle try to simulate aspects of what a human can do. While not suggesting that we can automate the “does my website look okay after someone changed the CSS” type of work today, in the future our automated testing will be more capable then just “checking” because we will move beyond the very constrained tests we have today to ones that mimicing the richness of the simulators that Airline Pilots use. Instead of testing the training given to pilots, we’ll be testing the robustness of software via simulations!

At any rate, James Bach, while taking a rather provocative approach to sharing his ideas, does subscribe to my favorite bullet in the Agile Manifesto: Individuals and interactions over processes and tools.

Here is him giving a great presentation with the subversive title of “How to Fake a Project” that was incredible entertaining, and also quite thought provoking:

James Bach talking about "Faking a Test Plan"



What do you think? Automated testing is a fetish of the Agile community?

Eric Pugh to speak on Solr at Shenandoah Ruby Users Group October 27th

Posted Tuesday, October 20th, 2009 by Eric Pugh

From the Meetup site:

We’ll look at the thriving Ruby ecosystem that has grown up around integrating with Solr. From Ruby gems that integrate with Solr like solrb and rsolr, to general search solutions like acts_as_solr and sunspot. We’ll also look at a complete “shrink wrapped” catalog solution for Solr using BlacklightOPAC.

You’ll lean the basics of getting started with Solr, and an understanding of what Ruby solutions are available to simplifying adding great search to your site!

As usual, food and beverages will be provided.


Click here to check out
The Shenandoah Ruby Users Group!

OSC will attend and sponsor EdUI Conference 2009 in the University of Vir

Posted Sunday, August 23rd, 2009 by Eric Pugh

edUI Conf

OSC is proud to announce that we will attend and sponsor this year’s EdUI Conference 2009 which is bein held at the University of Virginia on 21st-22nd September 2009. A number of folks from the OSC team will be attending, and stop by our booth in the Vendor Hall on the second day and introduce yourself!

EdUI 2009 boasts a powerhouse lineup of renowned and popular headliner speakers, most often found at the Web industry’s premier events. In addition to these, it features a series of presentations, selected through a proposal process, to allow peers, colleagues, and geek kindreds to enlighten one another with their expertise and ideas. Our very own Arin Sime will be speaking on The Facebook API: Thinking About UI in a Social Way.

Solr 1.4 Enterprise Search Server Book is Released!

Posted Wednesday, August 19th, 2009 by Eric Pugh

Solr 1.4 Enterprise Search Server Book Cover

I am very proud to annouce the first book on Solr has been published by Packt. This has been a labor of love for myself and my co-author David Smiley, and we are excited to see the book now “in the wild!”. Below is a copy of the email sent to the Solr community:

Fellow Solr users,

I’ve finally finished the book “Solr 1.4 Enterprise Search Server” with my co-author Eric. We are proud to present the first book on Solr and hope you find it a valuable resource. You can find full details about the book and purchase it here:
http://www.packtpub.com/solr-1-4-enterprise-search-server/book
It can be pre-ordered at a discount now and should be shipping within a week or two. The book is also available through Amazon. You can feel good about the purchase knowing that 5% of each sale goes to support the Apache Software Foundation. For a free sample, there is a portion of chapter 5 covering faceting available as an article online here:
http://www.packtpub.com/article/faceting-in-solr-1.4-enterprise-search-server

By the way, we realize Solr 1.4 isn’t out [quite] yet. It is feature-frozen however, and there’s little in the forthcoming release that isn’t covered in our book. About the only notable thing that comes to mind is the contrib module on search result clustering. However Eric plans to write a free online article available from Packt Publishing on that very subject.

“Solr 1.4 Enterprise Search Server” In Detail:

If you are a developer building a high-traffic web site, you need to have a terrific search engine. Sites like Netflix.com and Zappos.com employ Solr, an open source enterprise search server, which uses and extends the Lucene search library. This is the first book in the market on Solr and it will show you how to optimize your web site for high volume web traffic with full-text search capabilities along with loads of customization options. So, let your users gain a terrific search experience

This book is a comprehensive reference guide for every feature Solr has to offer. It serves the reader right from initiation to development to deployment. It also comes with complete running examples to demonstrate its use and show how to integrate it with other languages and frameworks

This book first gives you a quick overview of Solr, and then gradually takes you from basic to advanced features that enhance your search. It starts off by discussing Solr and helping you understand how it fits into your architecture—where all databases and document/web crawlers fall short, and Solr shines. The main part of the book is a thorough exploration of nearly every feature that Solr offers. To keep this interesting and realistic, we use a large open source set of metadata about artists, releases, and tracks courtesy of the MusicBrainz.org project. Using this data as a testing ground for Solr, you will learn how to import this data in various ways from CSV to XML to database access. You will then learn how to search this data in a myriad of ways, including Solr’s rich query syntax, “boosting” match scores based on record data and other means, about searching across multiple fields with different boosts, getting facets on the results, auto-complete user queries, spell-correcting searches, highlighting queried text in search results, and so on.

After this thorough tour, we’ll demonstrate working examples of integrating a variety of technologies with Solr such as Java, JavaScript, Drupal, Ruby, XSLT, PHP, and Python.

Finally, we’ll cover various deployment considerations to include indexing strategies and performance-oriented configuration that will enable you to scale Solr to meet the needs of a high-volume site

Sincerely,

David Smiley (primary-author)
dsmiley@mitre.org
Eric Pugh (co-author)
epugh@opensourceconnections.com

A huge round of thanks goes to David for bringing me into this project and being such a great partner on it! With 5% of the proceeds going to the Apache Software Foundation, here’s hoping it’s a great success!