[Abstract]
[Copyright Notice]
[Contents]
[next]
Smart Cache Manual - Chapter 1
Introduction
This manual is still in DRAFT state. Many sections are still
missing or inaccurate. If you are out of the luck, see comments in the sample configurations
files or consult sources for more informations.
This manual has been converted from Smart Cache English homepage
to debiandoc-sgml
format, which allows to generate many output formats
from one source.
After conversion, this manual was extended by Radim Kolar into current
form and merged with translated Czech documentation, which is no longer
maintained.
English is not my native language and if you see any errors, just
ignore it or mail me.
, the Word already was. The Word dwelt with God, and what
God was, the Word was. The Word, then, was with God at the beginning,
and through him all things came to be; no single thing was created
without him. All that came to be was alive with his life, and that life
was the light of men. The light shines on the dark, and the darkness
never mastered it.
New Testament, The Gospel according to John, The coming of Christ.
After leaving my job, I start to use modem connection to Internet. It was
slow, but biggest problem for me was a quite high prices payed to the monopoly
Czech telecommunication company SPT Telecom (now renamed to Czech Telecom, because many people did not know what SPT is).
I have find that I need
some useful tool, which allows me to browse WWW pages off line.
I have tried several methods (See Other off line browsing solutions, section 1.4) to achieve this goal, but all
of them has some limitations and I found them totally unusable for me.
These programs are not bad, there are just not optimal for this what
I wants.
- IBM Internet Connection Server 4.0
- This is WWW server with built in proxy cache. Proxy cache uses simple
CERN-like directory structure, so it was easy to find cached files. Also proxy
cache has the switch for off line mode, when returns only cached pages.
Biggest problem with this server was, that server was that this server is based
on original CERN http daemon, which was not thread-safe. IBM ported this
daemon into OS/2, but they does not care about this and do not implemented any
locking mechanism to protect thread sensitive data and for thread
synchronization. Server complains very often about locked
.cacheinfo
files and loaded objects was not stored at disk. After
some time IBM has made new version 4.1. This version introduced new HTTP/1.1
support into WWW server and proxy cache. WWW server works with some occasional
crashes, but proxy cache was totally broken. I never managed it running, they
probably do not test this part of their product. After some time IBM abandoned
this server and recommends ICS's users to upgrade to Lotus Domino.
- Mailing pages to myself in Netscape
- I have found, that in Netscape Navigator is possible to email
entire web page. So I started mailing interesting web to me and
browsing it via Sent Mail folder. This works quite well,
off line browsing was possible (even with embedded pictures). But
Netscape do not save pictures into Sent Mail folder, It saved
pictures only into its internal disk cache, so after expiring pictures,
I was unable to see that.
- Using Netscape's internal disk cache
Netscape browser has persistent disk cache. This disk cache
is able to cache web objects between sessions and there are couple
of programs called Netscape's disk cache explorers which
allows user to browse off line via Netscape's cache. But this also
has the several limitations:
- Netscape do not caches web pages without 'Last-Modified'
HTTP header. If fact It does. Pages are stored on the disk, but never
readied back and Netscape deletes them on exit - so there are lost.
This is the biggest problem, because nowadays many web pages are generated on the
fly by WWW server, so you have only images in the cache.
- Cache is very slow when grows in size into 30-40 MB. This
is not a problem in UNIX version of the Netscape, but OS/2 and Windows
versions have this problem.
- All informations about cache are stored in one file -
index.db
. When this file gets corrupted (not so uncommon)
you will lost everything.
- Garbage collections is very strange. I have set disk cache to 50 MB,
when it grows over 50 MB Netscape deletes nearly all files and leaves
only 10 MB in cache. Too bad.
- Using Microsoft Internet Explorer's disk cache
- I do not believe what I see. This was much worse that Netscape.
MSIE 4 is stupid and it caches even badly downloaded (too short) file.
This badly downloaded file displays as good, and If you request reload
on that bad file, It does not gets reloaded, only checked via If-Modified-Since request. If you want remove this bad file from cache, you must
clear the entire cache. No MSIE, thank you.
- Using web grabbers
This looks very promising, but there are some problems:
- web grabber downloads what it wants, so it normally downloads many
useless pages and not pages which you want to see.
- web grabber has a very few configurations options. Even very
good program, such as
wget
is so stupid. This
does not apply to my new developed web downloaded with working
name loader
.
- biggest problem is with web pages refreshing. You have only
three choices - refresh all (this normally downloads entire set of pages
again), never refresh, or refresh it manually via WWW browser and Save as...
- Using Lotus Notes/Domino
- Lotus Notes can work with HTML documents the same way as it does
with it's normal Document database. You may use any Notes's features,
such as Agents or Scripts on WWW documents. This is very good for writing
Internet or Intranet applications, but not the best solution for
normal browsing. The built-in WWW browser is very limited, even when compared
to old Netscape 2. It downloads only one WWW object at once - web page with
many pictures takes very long to load. Also Notes requires too much
system resources and you can not run it on 486 computer with only 20MB RAM.
- WWW Offline Explorer (wwwoffle)
- This program does the basically same thing for offline browsing as
Smart Cache. It is written in C and is available only for Unixes. I have
performed webbench on both (SC and wwwoffle) and when using small size
cache (about 10MB) results are similar (wwwoffle is about 8% faster, in
the same benchmark as used in Smart Cache Performance, section 6.2 it has on Linux 984 pages/min),
but
on large cache size SC is much faster because wwwoffle uses just one root directory level (SC uses 2) and no www's directory level; you will end with very
large directories, which are very slow to search (at least on my machine),
also WWWOffle's history is recorded as symlinks in special directory, which makes one symlink for each visited URL.
WWWOffle do not supports old HTTP/0.9 clients.
Stored files in wwwoffle has HTTP header inside and
uses long hashed filenames, but if you use HTML interface, cache contents
can be browsed. wwwoffle has nice built-in HTML interface and may
be easily controlled by browser, also allows marking pages for later batch downloading (good thing) or update.
Batch downloading does not works well, when I tried it, it very often ends
with infinite loops on already downloaded URLs.
Summary: If you have a large network or if you want Squid, get Squid.
If you don't like SC get wwwoffle, it will do the good job also, HTML
interface in nice thing. If you dont like Squid, SC or wwwoffle, I have
no idea, what to use.
After considering Other off line browsing solutions, section 1.4, I decided to write my
own program, which solves all of these problems. I write down
following design notes:
- Perfect off line browsing support. User must not see any
difference between on line and off line browsing.
- Implement it as proxy cache. It will be independent
on the used browser and fully transparent to user.
- Use CERN-httpd like (not hashed like Apache, Squid or WWWOffle)
directory structure of the proxy cache for easy locating of cached objects.
Try to use read file names like index.html and not obscure hashed
like Q3E4R2T342XCV3F42G3H2323.
- For performance reasons, implement 2 swap directory levels (idea from Squid).
- For easy file access do not store HTTP headers inside cached objects. When I am trying
to extract binary files (pictures) from CERN, Squid or Apache's caches
with text editor (I was lazy to write special program) that claims to
support binary files (ViM) it still fails and file gets corrupted.
- Do not store all received HTTP headers, just important one.
- Program must be fully portable. I want to use it on OS/2, Linux
and Windows.
- Cache must be able to cache everything what other caches don't.
I don't to write good cache which respects headers which webmaster
uses to gain more hits. Modem lines are slow. In fact, after writing
Smart Cache I was surprised, how much faster can browsing be if we cache
something more than usual and kill some adv. banners. This really makes
a difference!
- Program must allow to block unwanted URLs. Yes, for killing adv. banners.
- Program must remain fast and simple.
- Extremely configurable and tunable garbage collection. I can't accept
the design all or nothing used in other caches. I want to control
what and how long stays on my disk.
- Possibility to continue with object downloading even if user
press STOP in the browser (idea from Squid).
[Abstract]
[Copyright Notice]
[Contents]
[next]
Smart Cache Manual
0.44
Radim Kolar hsn@cybermail.net