Internot: Offline World with ZIM Files

657px-Wikipedia-logo-v2.svg
Wikipedia: Collaborative Human Knowledge

The past years I worked on a semantic analysis and indexing framework all on-premise, which meant to run Wikidata (Wikipedia for machines) instance locally, and hopefully Wikipedia, Wiktionary as well to lookup those as well – but how? To mirror Wikipedia.org would take a long time.

Well, the people at Kiwix.org did that for you using their mwoffliner, and addressed low bandwidth or offline world and still have access to Wikipedia, and invented a file-format called ZIM format:

  • one file
  • contains mixed media like text, images, audio, videos
  • optional compression for medias
  • fast access to the media
  • open definition
  • includes fulltext index

Fulltext Index: The Main Advantage

Nowadays (but not earlier) the ZIM files contain a fulltext- and often also title-index based on Xapian, which allows the ZIM Reader to provide fulltext search, and this is the main advantage over mirroring worthwhile web-sites yourself and open it with your browser as it will not give you an offline search engine otherwise.

Worthwhile Datasets

Screenshot from 2020-04-05 08-08-39
zim/ZIM.pm: ZIM Catalog
Screenshot from 2020-04-09 13-45-44
zim/ZIM.pm: Multi-Indices Search

Readers

Following platforms are supported so you can read your ZIM files:

  • Kiwix Android
  • Kiwix iOS
  • Kiwix Desktop (Windows, Mac, Linux), a bit clumsy UI
  • Local Web-Server:
    • kiwix-tools/kiwix-serve supports multiple ZIM files aka library but lacks RESTful interface for fulltext search (download the binaries, compiling is tedious)
    • zim/ZIM.pm my own solution written in Perl (2020/04), very experimental:
      • CLI & web-server, multiple ZIM files aka library support
      • RESTful API (fulltext search, catalog)

Create Your Own ZIM File

With zimwriterfs (also rather download the binary) you can create your own ZIM file:

  1. use wget or httrack to mirror a web-site
  2. use zimwriterfs to make a ZIM file

Mirror A Website

Either

% wget -m -k -L https://site-to-mirror.com/

or

% httrack https://site-to-mirror.com/ '-#L10000000'

after a while you might have the mirror of the site in hand as site-to-mirror.com/

  • wget: best include --wait=1 --random-wait  as well to slow down and act gently to web-servers
    • not usable to rerun or continue a failed mirroring
    • tends to fail to mirror a site completely (missing parts)
  • httrack: works with multiple sites in the same directory
    • good to continue failed or aborted mirroring
    • introduces some inconsistencies e.g. I noticed html content is altered occasionally and changes <img src="test.gif"> to <img src="test.html"> – see httrack FAQ why; remedy:
      • use --http-10 to avoid such problems
    • more complete mirrors (compared with wget)

Create ZIM File

zimwriterfs is a kind of tedious program as it wants a couple mandatory settings without falling back to common sense:

% zimwriterfs -w index.html -f favicon.ico -c "me" -p "me" \
-t "Title of the site" -d "Brief description" -l en -i \
site-to-mirror.com/ site-to-mirror.zim

Perhaps there is no favicon.ico but imgs/favicon.png, just look at the site-to-mirror.com/index.html and check for <link rel="shortcut icon" href=".. and use that reference for -f.

  • -w relative path to welcome page
  • -f relative path to favicon
  • -c creator of ZIM file (you)
  • -p publisher of the original content

The -i is powerful, it creates a fulltext index which will allow you to search the offline Website, even if the original site did not have such.

If you mirror a site which changes and a new version to be expected, use site-to-mirror-YYYY-MM-DD.zim and use --name=site-to-mirror with zimwriterfs, so the ZIM file reflects the date, but within the ZIM file the normalized name is also known.

See Also

Leave a comment