How to Integrate Apache Nutch With Solr Search Engine?

One of my recent post, Bash programming for Crawling with cURL i mentioned i was assigned to collect and index mass data from internet and specific sources at work. cURL did job very well but i wanted to build something more compact and scalable. But since my fundamental approach is “Always avoid re-inventing the wheel” , i consulted senior colleague of mine and he suggested me to integrate Apache Foundation’s two beautiful project; Nutch (Crawler) and Solr(Search Engine).

There are several posts online claims to show how to integrate Nutch and Solr but i had a little bit of hard time to achieve this. So this is why i’m writing this post.

About Apache Nutch

I’m not going to write the whole consept of web crawling here since i write about in my previous post that i shared link above. Apache Nutch is one of the best solution you might need if you need a crawler. It’s highly extensible, highly scalable, well matured and production ready web crawler. Nutch 1.x enables fine grained configuration, relying on Apache Hadoop™ data structures, which are great for batch processing.

Nutch is pluggable and modular and this provides some benefits. Nutch provides extensible interfaces such as Parse, Index and ScoringFilter’s for custom implementations e.g. Apache Tika™ for parsing. Additonally, pluggable indexing exists for Apache Solr™, Elastic Search, SolrCloud, etc.

Nutch 2.X branch is becoming an emerging alternative taking direct inspiration from 1.X. 2.X differs in one key area; storage is abstracted away from any specific underlying data store by using Apache Gora™ for handling object to persistent data store mappings.[1]

In this post, I’m going to use version 1.7

About Apache Solr

Solr is the popular, blazing fast open source enterprise search platform from the Apache LuceneTMproject. Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world’s largest internet sites.

Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Jetty. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language. Solr’s powerful external configuration allows it to be tailored to almost any type of application without Java coding, and it has an extensive plugin architecture when more advanced customization is required.[2]

Integrating Nutch with Solr

Following instructions start right after you start freshly installed Ubuntu and ends with successful indexing of crawled data

1) SYS SPECIFICATION
VM OS: Ubuntu 14.04 (LTS) Desktop AMD64
Memory: 4000 mb
Hdd: 15 GB
Oracle VirtualBox version:4.3.12 r93733 runs on OS X v10.9.4
Hardware: MacBook Pro, Retina, 13 inch, Late 2013

2) Fixing screen resolution and system update
You might have an issue with screen resolution as i had. Stop wasting your time and install following packages
sudo apt-get install virtualbox-guest-utils virtualbox-guest-x11 virtualbox-guest-dkms
sudo apt-get upgrade
sudo apt-get update

3) Installing Java and Setting JAVA_HOME
sudo apt-get -y install openjdk-7-jdk

type sudo nano /etc/environment command and add following line at the end of the file:
JAVA_HOME=”/usr/lib/jvm/java-7-openjdk-amd64/”

After you saved your changes and close the file, you should reload it by this command:
source /etc/environment

Now, let’s test if JAVA_HOME is set properly by this command:
echo $JAVA_HOME

Output should be like this:
/usr/lib/jvm/java-7-openjdk-amd64

4)Downloading Solr & Nutch and Extracting Packages
mkdir nutch-solr-workspace
cd nutch-solr-workspace/
wget http://archive.apache.org/dist/lucene/solr/4.5.0/solr-4.5.0.tgz
wget http://archive.apache.org/dist/nutch/1.7/apache-nutch-1.7-bin.tar.gz
tar -xzf apache-nutch-1.7-bin.tar.gz
tar -xzf solr-4.5.0.tgz

5) Configuration
5.1. Add following XML into ${NUTCH_HOME}/conf/nutch-site.xml

<property>
 <name>http.agent.name</name>
 <value>spiderCrawler</value>
</property>;

5.2. cd ${NUTCH_HOME}
mkdir -p urls
touch urls/seed.txt

Add URLs into seed.txt file one at a line. For instance:
http://nutch.apache.org/

5.3. Editing URL Filter conf/regex-urlfilter.txt to match the interested domain

Edit end of the file like this:

# accept URLs ending with nutch.apache.org domain
+^http://([a-z0-9]*\.)*nutch.apache.org/

#accept anything else
+.

6) Nutch-Solr Integration
6.1. In ${SOLR_HOME}, move example/solr/collection1/conf/schema.xml to example/solr/collection1/conf/schema.xml.org
6.2. From ${NUTCH_HOME} copy conf/schema-solr4.xml to ${SOLR_HOME}/example/solr/collection1/conf/schema.xml
6.3. Add <field name=”_version_” type=”long” indexed=”true” stored=”true”/> at line 351 which is the end of fields of Solr’s new schema.xml file.
6.4. Start/Restart Solr and control if there is something wrong

7) Crawl with Nutch and Index into Solr
Use crawl script to do the batch operation:
bin/crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>

For example:
bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr/ 2

screenshot

You can select your Solr core from left side of screen (in my case, it was named collection1) then see how many docs you processed from Num Docs area.

The numberOfRounds from terminal command indicate the depth of crawling. It will take longer time if this value is set larger. Normally 2 is recommended, and try not to use value larger than 3, otherwise it will be too long. At the end of each round, Nutch will index crawled data into Solr. For now, just sit and wait for the script ends.

When the script ends, you will find a folder named TestCrawl which is the crawlID you used for this crawling generated. Inside of the folder lies the crawl database, link database and set of segments.

[1]http://nutch.apache.org/

[2] http://lucene.apache.org/solr/

2 Yorum

Filed under Nutch, Solr

2 responses to “How to Integrate Apache Nutch With Solr Search Engine?

  1. Geri bildirim: » How to Integrate Apache Nutch With Solr Search Engine? | Timur Aykut YILDIRIM |

  2. Thank you! This helped me out!

Bir Cevap Yazın

Aşağıya bilgilerinizi girin veya oturum açmak için bir simgeye tıklayın:

WordPress.com Logosu

WordPress.com hesabınızı kullanarak yorum yapıyorsunuz. Log Out / Değiştir )

Twitter resmi

Twitter hesabınızı kullanarak yorum yapıyorsunuz. Log Out / Değiştir )

Facebook fotoğrafı

Facebook hesabınızı kullanarak yorum yapıyorsunuz. Log Out / Değiştir )

Google+ fotoğrafı

Google+ hesabınızı kullanarak yorum yapıyorsunuz. Log Out / Değiştir )

Connecting to %s