Indexing and searching bookmarked links with HtDig

Most people use bookmarks, and most people have too many links within their bookmarks. These links correspond to some web pages we found interesting while surfing the web; to be able to come back quickly to these Web page, we bookmarked these pages. Since I found it convenient to be able to search bookmarked links which contain specified keywords, I did it using Htdig. Then I integrated it to the Mozilla search sidebar.

Prerequisites: you must have htdig and a web server (Apache, for example) installed on your machine. Type http://localhost within your web browser. If a web page appears, a web server is installed on your machine. To check whether htdig is installed, login as "root", and type htsearch. If it works, then htdig is installed on your machine as well. Also check that you can access htdig from your web browser. To do this, type http://localhost/cgi-bin/htsearch within your web browser, and check that the output is correct.

Export your bookmarks to a file named "bookmarks.html". Then copy this file to the root directory of your web server, /var/www/html/ for example. Modify the htdig configuration file (normally /etc/htdig/htdig.conf), so that it looks like this one:
# directory where the database will be stored
database_dir:		/var/lib/htdig/db

# do not limit URLs 
limit_urls_to:		

# exclude this kind of URLs
exclude_urls:		/cgi-bin/ .cgi

# do not index the following files
bad_extensions:		.wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif \
		.jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi

# e-mail of the administrator (indicated to the indexed sites)
maintainer:		elemento@lirmm.fr

# limit the header size of the pages to index
max_head_length:	10000

# do not index documents bigger than 200kb
max_doc_size:		200000

no_excerpt_show_top:	true

# use only an exact search algorithms (from my own experience, the other algorithms do not work very well
search_algorithm:	exact:1

# somes images, which need to be accessed (make sure you have a htdig directory in the root of your web server
next_page_text:		<img src="/htdig/buttonr.gif" border="0" align="middle" width="30" height="30" alt="next">
no_next_page_text:
prev_page_text:		<img src="/htdig/buttonl.gif" border="0" align="middle" width="30" height="30" alt="prev">
no_prev_page_text:
page_number_text:	'<img src="/htdig/button1.gif" border="0" align="middle" width="30" height="30" alt="1">' \
			'<img src="/htdig/button2.gif" border="0" align="middle" width="30" height="30" alt="2">' \
			'<img src="/htdig/button3.gif" border="0" align="middle" width="30" height="30" alt="3">' \
			'<img src="/htdig/button4.gif" border="0" align="middle" width="30" height="30" alt="4">' \
			'<img src="/htdig/button5.gif" border="0" align="middle" width="30" height="30" alt="5">' \
			'<img src="/htdig/button6.gif" border="0" align="middle" width="30" height="30" alt="6">' \
			'<img src="/htdig/button7.gif" border="0" align="middle" width="30" height="30" alt="7">' \
			'<img src="/htdig/button8.gif" border="0" align="middle" width="30" height="30" alt="8">' \
			'<img src="/htdig/button9.gif" border="0" align="middle" width="30" height="30" alt="9">' \
			'<img src="/htdig/button10.gif" border="0" align="middle" width="30" height="30" alt="10">'
no_page_number_text:	'<img src="/htdig/button1.gif" border="2" align="middle" width="30" height="30" alt="1">' \
			'<img src="/htdig/button2.gif" border="2" align="middle" width="30" height="30" alt="2">' \
			'<img src="/htdig/button3.gif" border="2" align="middle" width="30" height="30" alt="3">' \
			'<img src="/htdig/button4.gif" border="2" align="middle" width="30" height="30" alt="4">' \
			'<img src="/htdig/button5.gif" border="2" align="middle" width="30" height="30" alt="5">' \
			'<img src="/htdig/button6.gif" border="2" align="middle" width="30" height="30" alt="6">' \
			'<img src="/htdig/button7.gif" border="2" align="middle" width="30" height="30" alt="7">' \
			'<img src="/htdig/button8.gif" border="2" align="middle" width="30" height="30" alt="8">' \
			'<img src="/htdig/button9.gif" border="2" align="middle" width="30" height="30" alt="9">' \
			'<img src="/htdig/button10.gif" border="2" align="middle" width="30" height="30" alt="10">'


# start the indexing at this page
start_url:	http://localhost/bookmarks.html

# tells htdig to index files which are not only on your server
local_urls_only: false

# tells htdig to index files which are at most one click away from the start page
max_hop_count: 1

Launch the indexing: type rundig (as root). This may take a few minutes, depending on the size of your bookmarks, and on the number of not responding sites.

Once it is done, type htsearch as root. The programs asks you to enter a word, type something, "test" for example. htsearch should normally come up with a list (in HTML format) of the web pages which contains the word "test".

The last step is to integrate the bookmark search to the Mozilla search plugin. Create the following file, named htdig.src:
<search 
   name="Bookmarks"
   description="Htdig Search"
   method="GET"
   action="http://localhost/cgi-bin/htsearch"
>

<input name="words" user>
</search>

Then copy the file to the Mozilla search plugin directory. On my machine, this directory is located at /usr/lib/mozilla-1.0.0/searchplugins/. Then restart Mozilla, and the sidebar should let you search your bookmarked links properly.

The author: Olivier Elemento is a PhD candidate in Computational Biology. He can be reached at elemento@lirmm.fr.