Streaming Index Progress Results to Browser

I recently needed to index from a local filesystem several thousand static webpages into Solr. I was already using Ruby on Rails for the admin interface, so I quickly threw together an action to index the documents using HPricot and RSolr. To monitor the progress I just output to standard out using puts

def index_bulk_html
  solr = RSolr.connect :url=>SOLR_URL
  count = 0
  files = Dir.glob("/Users/epugh/Documents/code/www.somesite.com/**/*.{html,htm}")
  files.each do |file|
    path_ends_at = file.index("www.somesite.com")
    unless path_ends_at.nil?
      puts("<strong>Processed #{count} of #{files.size}</strong>") if count % 100 == 0 

      url = "http://#{file[path_ends_at,file.size]}"
      title, content = parse_html(file, title, content)

      puts "Bad Content:#{!page_content.blank?} #{url} #{title}"

      begin
        solr.add :id=> url, :url=>url, :mimeType=>"text/html", :title => title, :docText => page_content
        solr.commit
        count = count + 1
      rescue RSolr::RequestError
        puts "<strong>Could not index #{file}</strong>"
      end
    end
  end
  puts "Imported #{count} webpages successfully."
  solr.optimize
  redirect_to root_path

end

This worked great, but I realized that indexing over 10,000 documents takes a long time, and meanwhile the user is staring at the browser slowly loading, wondering if things had frozen or not! So I wondered if I could somehow stream some info back to the user. Fortunately Rails has already solved that problem! ActionController has the ability to render as text a proc object, and stream the output:

  # Renders "Hello from code!"
  render :text => proc { |response, output| output.write("Hello from code!") }

So I quickly wrapped my existing code in a large proc, changed the puts to output.write, and now stream out to the browser constant progress reports:

def index_bulk_html
    solr = RSolr.connect :url=>SOLR_URL
    count = 0
    files = Dir.glob("/Users/epugh/Documents/code/www.somesite.com/**/*.{html,htm}")
    render :text => proc { |response, output|
      files.each do |file|
        path_ends_at = file.index("www.somesite.com")
        unless path_ends_at.nil?
          output.write("<strong>Processed #{count} of #{files.size}</strong>") if count % 100 == 0 

          url = "http://#{file[path_ends_at,file.size]}"
          title, content = parse_html(file, title, content)

          output.write "Bad Content:#{!page_content.blank?} #{url} #{title}"
          output.flush

          begin
            solr.add :id=> url, :url=>url, :mimeType=>"text/html", :title => title, :docText => page_content
            solr.commit
            count = count + 1
          rescue RSolr::RequestError
            output.write "<strong>Could not index #{file}</strong>"
            output.flush
          end
        end
      end
      output.write "Imported #{count} webpages successfully."
     }
    solr.optimize

  end

Thank you Rails, Hpricot, and RSolr for making life so simple!

Tags: , , ,

blog comments powered by Disqus