The past years I worked on a semantic analysis and indexing framework all on-premise, which meant to run Wikidata (Wikipedia for machines) instance locally, and hopefully Wikipedia, Wiktionary as well to lookup those as well – but how? To mirror Wikipedia.org would take a long time.
Well, the people at Kiwix.org did that for you using their mwoffliner, and addressed low bandwidth or offline world and still have access to Wikipedia, and invented a file-format called ZIM format:
- one file
- contains mixed media like text, images, audio, videos
- optional compression for medias
- fast access to the media
- open definition
- includes fulltext index
Fulltext Index: The Main Advantage
Nowadays (but not earlier) the ZIM files contain a fulltext- and often also title-index based on Xapian, which allows the ZIM Reader to provide fulltext search, and this is the main advantage over mirroring worthwhile web-sites yourself and open it with your browser as it will not give you an offline search engine otherwise.
Worthwhile Datasets
- English Wikipedia: 19M articles, wikipedia_en_all_maxi.zim (76GB)
- common human knowledge base
- Gutenberg Project: 1M books, gutenberg_en_all_2018-10.zim (54GB)
- public domain books (txt, html, pdf, epub)
- StackOverflow: 71M articles, stackoverflow_en_all_2019-02.zim (136GB)
- people asking questions and people answering those
Readers
Following platforms are supported so you can read your ZIM files:
- Kiwix Android
- Kiwix iOS
- Kiwix Desktop (Windows, Mac, Linux), a bit clumsy UI
- Local Web-Server:
kiwix-tools/kiwix-serve
supports multiple ZIM files aka library but lacks RESTful interface for fulltext search (download the binaries, compiling is tedious)zim/ZIM.pm
my own solution written in Perl (2020/04), very experimental:- CLI & web-server, multiple ZIM files aka library support
- RESTful API (fulltext search, catalog)
Create Your Own ZIM File
With zimwriterfs
(also rather download the binary) you can create your own ZIM file:
- use
wget
orhttrack
to mirror a web-site - use
zimwriterfs
to make a ZIM file
Mirror A Website
Either
% wget -m -k -L https://site-to-mirror.com/
or
% httrack https://site-to-mirror.com/ '-#L10000000'
after a while you might have the mirror of the site in hand as site-to-mirror.com
/
wget
: best include--wait=1 --random-wait
as well to slow down and act gently to web-servers- not usable to rerun or continue a failed mirroring
- tends to fail to mirror a site completely (missing parts)
httrack
: works with multiple sites in the same directory- good to continue failed or aborted mirroring
- introduces some inconsistencies e.g. I noticed html content is altered occasionally and changes
<img src="test.gif">
to<img src="test.html">
– see httrack FAQ why; remedy:- use
--http-10
to avoid such problems
- use
- more complete mirrors (compared with
wget
)
Create ZIM File
zimwriterfs
is a kind of tedious program as it wants a couple mandatory settings without falling back to common sense:
% zimwriterfs -w index.html -f favicon.ico -c "me" -p "me" \ -t "Title of the site" -d "Brief description" -l en -i \ site-to-mirror.com/ site-to-mirror.zim
Perhaps there is no favicon.ico
but imgs/favicon.png
, just look at the site-to-mirror.com/index.html
and check for <link rel="shortcut icon" href="..
and use that reference for -f
.
-w
relative path to welcome page-f
relative path to favicon-c
creator of ZIM file (you)-p
publisher of the original content
The -i
is powerful, it creates a fulltext index which will allow you to search the offline Website, even if the original site did not have such.
If you mirror a site which changes and a new version to be expected, use site-to-mirror-YYYY-MM-DD.zim
and use --name=site-to-mirror
with zimwriterfs
, so the ZIM file reflects the date, but within the ZIM file the normalized name is also known.
See Also
- Kiwix.org: ZIM Files: long list of multi-lingual ZIM files
- Archive.org: ZIM Archive: even longer list, older ZIM files