Internot: Offline World with ZIM Files

Wikipedia: Collaborative Human Knowledge

The past years I worked on a semantic analysis and indexing framework all on-premise, which meant to run Wikidata (Wikipedia for machines) instance locally, and hopefully Wikipedia, Wiktionary as well to lookup those as well – but how? To mirror Wikipedia.org would take a long time.

Well, the people at Kiwix.org did that for you using their mwoffliner, and addressed low bandwidth or offline world and still have access to Wikipedia, and invented a file-format called ZIM format:

one file
contains mixed media like text, images, audio, videos
optional compression for medias
fast access to the media
open definition
includes fulltext index

Fulltext Index: The Main Advantage

Nowadays (but not earlier) the ZIM files contain a fulltext- and often also title-index based on Xapian, which allows the ZIM Reader to provide fulltext search, and this is the main advantage over mirroring worthwhile web-sites yourself and open it with your browser as it will not give you an offline search engine otherwise.

Worthwhile Datasets

English Wikipedia: 19M articles, wikipedia_en_all_maxi.zim (76GB)
- common human knowledge base
Gutenberg Project: 1M books, gutenberg_en_all_2018-10.zim (54GB)
- public domain books (txt, html, pdf, epub)
StackOverflow: 71M articles, stackoverflow_en_all_2019-02.zim (136GB)
- people asking questions and people answering those

Screenshot from 2020-04-05 08-08-39 — zim/ZIM.pm: ZIM Catalog

Screenshot from 2020-04-09 13-45-44 — zim/ZIM.pm: Multi-Indices Search

Readers

Following platforms are supported so you can read your ZIM files:

Kiwix Android
Kiwix iOS
Kiwix Desktop (Windows, Mac, Linux), a bit clumsy UI
Local Web-Server:
- kiwix-tools/kiwix-serve supports multiple ZIM files aka library but lacks RESTful interface for fulltext search (download the binaries, compiling is tedious)
- zim/ZIM.pm my own solution written in Perl (2020/04), very experimental:
  - CLI & web-server, multiple ZIM files aka library support
  - RESTful API (fulltext search, catalog)

Create Your Own ZIM File

With zimwriterfs (also rather download the binary) you can create your own ZIM file:

use wget or httrack to mirror a web-site
use zimwriterfs to make a ZIM file

Mirror A Website

Either

% wget -m -k -L https://site-to-mirror.com/

% httrack https://site-to-mirror.com/ '-#L10000000'

after a while you might have the mirror of the site in hand as site-to-mirror.com/

wget: best include --wait=1 --random-wait as well to slow down and act gently to web-servers
- not usable to rerun or continue a failed mirroring
- tends to fail to mirror a site completely (missing parts)
httrack: works with multiple sites in the same directory
- good to continue failed or aborted mirroring
- introduces some inconsistencies e.g. I noticed html content is altered occasionally and changes <img src="test.gif"> to <img src="test.html"> – see httrack FAQ why; remedy:
  - use --http-10 to avoid such problems
- more complete mirrors (compared with wget)

Create ZIM File

zimwriterfs is a kind of tedious program as it wants a couple mandatory settings without falling back to common sense:

% zimwriterfs -w index.html -f favicon.ico -c "me" -p "me" \
-t "Title of the site" -d "Brief description" -l en -i \
site-to-mirror.com/ site-to-mirror.zim

Perhaps there is no favicon.ico but imgs/favicon.png, just look at the site-to-mirror.com/index.html and check for <link rel="shortcut icon" href=".. and use that reference for -f.

-w relative path to welcome page
-f relative path to favicon
-c creator of ZIM file (you)
-p publisher of the original content

The -i is powerful, it creates a fulltext index which will allow you to search the offline Website, even if the original site did not have such.

If you mirror a site which changes and a new version to be expected, use site-to-mirror-YYYY-MM-DD.zim and use --name=site-to-mirror with zimwriterfs, so the ZIM file reflects the date, but within the ZIM file the normalized name is also known.

Spiritdude's Public Notebook

Internot: Offline World with ZIM Files

Fulltext Index: The Main Advantage

Worthwhile Datasets

Readers

Create Your Own ZIM File

Mirror A Website

Create ZIM File

See Also

Leave a comment Cancel reply

Fulltext Index: The Main Advantage

Worthwhile Datasets

Readers

Create Your Own ZIM File

Mirror A Website

Create ZIM File

See Also

Share this:

Related

Leave a comment Cancel reply