Bash programming for Crawling with cURL

Currently i’m assigned to collect and index mass data from internet and specific sources at work. In order to achieve this, there is must-know concept called “Crawler“. After i talked with a senior colleague, i was done with whole concept and system architecture (or at least i knew more about which direction to go).

What is crawler?

According to wikipedia: “A Web crawler is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing. A Web crawler may also be called a Web spider,[1] an ant, an automatic indexer,[2] or (in the FOAF software context) a Web scutter.”

As always first thing in my consideration was avoding to invent wheel again. There are several free and open source projects but first of all i wanted to use a command line tool called: cURL

What is cURL?

As i mentioned cURL is a  command line tool which enables you to transfer data using various protocols such as FTP, HTTP etc. Personally, i liked cURL. It’s very easy to install and use tool with some handy features. But first, let’s install it:

Since i was using Ubuntu it was very easy for me; sudo apt-get install curl

But cURL runs under wide variety of operating systems. You can check it’s web site: http://curl.haxx.se/

In order to retrieve a simple web page, you can user this command on terminal: curl http://www.demo.com

But if you want to retrieve with with -o flag, should type this:

curl -o LocalFileName.html http://www.demo.com This will retrieve url and save it as LocalFileName.html

To download output to a file that has the same name as on the system it originates from, use the -O flag, for example: curl -O http://www.demo.com/demo.html

You can also use different flags for different requirements while calling cURL from terminal.

How to use cURL for crawling?

I decided to make a humble beginning. I took some url from one of the APIs that i was going to use and wrote them in a .txt file one at a line. With xargs curl < url-list.txt command, i could display display all the html content in my terminal but i was unable to get files in my desktop. And with curl -o myFile.html http://www.demo.html command i could download only 1 file. And in my case, i’m suppose to download hundreds of thousands pages per each API source. What i needed was some sort of combination of those two commands. I don’t know if there is any way to do it by only calling cURL with some flags and other commands. But i managed to solve this problem by writing a simple loop in a scripting language called “Bash

What is Bash?

Bash is a scripting programming language for Unix like systems. It is a part of GNU projects and it comes as installed in most of GNU/Linux distros. For detailed information, there is always wikipedia: http://en.wikipedia.org/wiki/Bash_(Unix_shell)

Below you can check my script:


#!/bin/bash
file=&quot;url-list.txt&quot;

while read line
do
 outfile=$(echo $line | awk 'BEGIN { FS = &quot;/&quot; } ; {print $NF}')
 curl -o &quot;$outfile.html&quot; &quot;$line&quot;
done &lt; &quot;$file&quot;

1 Yorum

Filed under Bash, Linux, Ubuntu

One response to “Bash programming for Crawling with cURL

  1. Geri bildirim: How to Integrate Apache Nutch With Solr Search Engine? | Timur Aykut YILDIRIM

Bir Cevap Yazın

Aşağıya bilgilerinizi girin veya oturum açmak için bir simgeye tıklayın:

WordPress.com Logosu

WordPress.com hesabınızı kullanarak yorum yapıyorsunuz. Log Out / Değiştir )

Twitter resmi

Twitter hesabınızı kullanarak yorum yapıyorsunuz. Log Out / Değiştir )

Facebook fotoğrafı

Facebook hesabınızı kullanarak yorum yapıyorsunuz. Log Out / Değiştir )

Google+ fotoğrafı

Google+ hesabınızı kullanarak yorum yapıyorsunuz. Log Out / Değiştir )

Connecting to %s