A Raspberry Pi and an old hard drive were gathering dust in my drawer until the Internet Archive hack made headlines. Now they’re the heart of my local web archiving system, preserving everything from favorite blog posts to open-source projects. In this article, I’ll show you my step-by-step journey to create a private Internet archive and digital preservation independence using ArchiveBox.
Why I Chose to Self-Host a Private Internet Archive
The Internet Archive’s recent security breach hit the digital preservation community and all those who benefit from its work like a thunderbolt. On October 9th, hackers compromised the site and stole a massive user authentication database containing 31 million records.
What made matters worse was that this wasn’t the end of the Archive’s troubles. Just when they managed to restore some services by October 21st, hackers gained access to their Zendesk support system, demonstrating that the vulnerability ran deeper than initially thought.
Though the Archive has since resumed operations, its future remains uncertain because security breaches aren’t the only threat to digital preservation. A recent federal appeals court ruling dealt another significant blow to the Internet Archive, finding that their digital lending library wasn’t protected by fair use doctrine and could thus be forced to remove a significant chunk of its content.
The implications are clear: the need for personal control over digital preservation has never been more apparent. The good news is that anyone can set up a private internet archive using a Raspberry Pi and ArchiveBox with ease.
My Recommended Raspberry Pi Archive Hardware Setup
If you’re ready to create your own private internet archive, then you’ll need some hardware.
First and foremost, you’ll need a Raspberry Pi. For the best experience, I highly recommend the latest Raspberry Pi 5 because its significantly improved performance means your archiving tasks will run smoother and faster, and you’ll have plenty of headroom for future expansion of your archive.
That said, don’t feel pressured if you already own a Raspberry Pi 4B with 4GB or 8GB of RAM. These models are perfectly capable of running a personal archive, and they actually have one interesting advantage over the Pi 5: hardware H.264 video encoding. This becomes particularly valuable if you plan to stream archived videos to your TV or other devices around your home.
Along with your Pi, here’s what else you’ll need:
- microSD card: A 32GB microSD card is sufficient to get ArchiveBox up and running. This microSD card will serve as the Pi’s main boot drive, so make sure to choose a quality, reliable one to prevent any data corruption issues in the future.
- External hard drive: For the actual archive storage, you’ll want a 3.5″ external hard drive with its own power supply. Why? Because 3.5″ drives offer better reliability over time compared to more modern SSDs, and reliability is what matters the most when it comes to archiving.
- Monitor, mouse, and keyboard: These are technically optional but can make setup easier, especially if you’re configuring the Pi for the first time. Alternatively, you can control it entirely remotely using tools like SSH (Secure Shell Protocol), VNC (Virtual Network Computing), or RDP (Remote Desktop Protocol).
Once you have all these items on hand, you’re ready to start setting up your self-hosted internet archive!
Preparing a Software Environment for Archiving
The first step is to get an operating system up and running on your Raspberry Pi. I personally recommend Raspberry Pi OS because, as the official OS for Raspberry Pi devices, it’s by far the most popular and supported option available. You can follow our Raspberry Pi OS installation guide if you don’t know how to put it on your microSD card.
And if you’re feeling adventurous, you might want to explore some of the alternative operating systems available for the Raspberry Pi.
Once you have the operating system installed, boot up your Pi and connect it to the internet (it doesn’t matter if you use a wired or wireless connection). Then launch Terminal and perform a system update with the command:
sudo apt update && sudo apt full-upgrade
When it comes to installing ArchiveBox, you have three options: Docker, an automatic setup script, or using your system’s package manager. I strongly recommend going with Docker. Not only does it provide the smoothest installation and update experience, but it also gives you the best security isolation and includes all the dependencies right out of the box.
Unfortunately, Docker isn’t pre-installed on Raspberry Pi OS, so we’ll need to set that up first (don’t forget to also perform the post-installation steps).
With Docker successfully installed, we’re ready to move on to installing ArchiveBox itself, which is going to be much simpler thanks to all the groundwork we’ve laid.
Installing and Running ArchiveBox
To install ArchiveBox using Docker, first create a directory where all your archived content will be stored. This will be your archive folder on the Raspberry Pi, so choose a location with ample storage, such as your external hard drive (you can navigate to it using the cd command):
mkdir -p archivebox/data && cd archivebox
Next, download the official Docker Compose configuration file that defines how ArchiveBox should run:
curl -fsSL 'https://docker-compose.archivebox.io' > docker-compose.yml
This configuration file is important because it sets up all the necessary components, including the web server and scheduled tasks. If you want to store your archive on an external drive instead of the Pi’s SD card (which is recommended), you’ll need to edit the “docker-compose.yml” file to point to your mounted drive location.
To do so, open the configuration file using any text editor, such as nano:
nano docker-compose.yml
Look for the volumes section under the archivebox service. By default, it looks something like this:
services: archivebox: ... volumes: - ./data:/data
We need to change ./data
to reflect the full path to our external drive’s data directory. For example, if your drive is mounted at /mnt/external_drive
, modify the line to look like this:
services: archivebox: ... volumes: - /mnt/external_drive/archivebox/data:/data
This tells Docker to store all ArchiveBox data in the “archivebox/data” directory on your external drive instead of using a relative path. Using the absolute path is important because it ensures Docker can always find your archive data, even if you run commands from different directories.
While you’re at it, you can also add the PUID
and PGID
environment variables to match your Pi’s user account. Find your user ID and group ID by running id -u
and id -g
, then add them to the environment section:
services: archivebox: ... environment: - PUID=1000 # replace with your user ID - PGID=1000 # replace with your group ID
Finally, comment out or remove the sonic (faster and better searching for large collections) and novnc (allows you to set up a profile with logins to the sites you want to archive) services. The configuration of these optional services is beyond the scope of this guide, so I recommend you follow the official documentation if you’re interested in them.
The minimal working configuration should look something like this:
Save the file and exit the editor. Now initialize your archive and create an admin user to access the web interface:
docker compose run archivebox init docker compose run archivebox manage createsuperuser
Once the initialization is completed, you can start the ArchiveBox server:
docker compose up -d
You can now access your ArchiveBox instance by opening a web browser and navigating to http://localhost:8000. Try it now. This is what you should see:
Configuring and Using ArchiveBox
To customize ArchiveBox’s behavior, you don’t need to edit configuration files directly. Instead, use the config
command to modify settings. For example, I always adjust timeouts and resource limits for better performance on the Raspberry Pi:
docker compose run archivebox config --set MEDIA_TIMEOUT=3600 docker compose run archivebox config --set TIMEOUT=60 docker compose run archivebox config --set MEDIA_MAX_SIZE=750mb
You can also disable submitting to archive.org to speed up archiving:
docker compose run archivebox config --set SAVE_ARCHIVE_DOT_ORG=False
All settings are automatically saved in the ArchiveBox.conf
file in your data directory, and you can view current settings anytime by running:
docker compose run archivebox config list
With the basic setup complete, you can start adding content to your archive. ArchiveBox supports multiple ways to add URLs. The most straightforward one is the web interface. You simply click the Add button, paste your URLs, and click the Add URLs and archive button.
In some situations, it can be more convenient to archive via the command line. For example, to archive a single webpage, you can run:
docker compose run archivebox add 'https://example.com'
Or to archive an entire list of URLs from a text file:
docker compose run archivebox add < urls.txt
Finally, you can import from various bookmark services, including Pocket, Pinboard, or Instapaper. Please check the official wiki for detailed instructions.
Remember that your archive is as secure as the backups you maintain. To protect all the content you’re trying to preserve, I highly recommend implementing a reliable backup strategy with the help of the best Linux backup software to protect against data loss, power failures, or accidental deletions.
article source: https://www.maketecheasier.com/turn-raspberry-pi-into-private-internet-archive/