A few months ago I saw some talk on the WordPress slack of searching plugin code to check for usage of a particular function. That sounded cool so I thought I would try. It turns out people were using Mark Jaquith’s WordPress Plugin Directory Slurper or one of its many forks.
In short the Directory Slurper is a PHP script which fetches a list of all Plugins and then attempts to download the zip archive of each. Due to the nature of PHP this was performed one at a time, each being a lengthy HTTP call, which could easily take the best part of a day. The local files can then be searched with tools like grep, ack, ag and ripgrep.
This process has served the WordPress community well for years but it felt prohibitively difficult. Not everyone can run PHP scripts locally or has enough SSD space (over 40GB required) for reasonable search times. Even with a good SSD and the fastest search tool (ripgrep) it still takes me between 15-40 minutes to do searches, this would be significantly longer with a laptop or on anything with a mechanical harddisk.
What if it could be better?
Two advantages of faster searching immediately stood out to me. First, it would significantly reduce the wait time when decisions rely on search results. Due to online communication and weekly meetings, waiting for search results has the power to significantly delay decision making. Second, reducing the barrier to entry could encourage other uses across the community- from plugin developers to security researchers.
There is also potential for improving usability, a web service would skip the need for local downloads and allow people to share search results easily.
How hard can it be?
Let us step back and take a look at the challenge ahead of us. We need to download all 64,000 plugins and 6,000 themes, then update our copy on changes (over 600 times per day). We then need to perform regular expression searches of text inside over 2,500,000 files.
Ideally we would do all of this from a central web service and much faster than is possible locally. How hard can it be?
The First Attempt
My first prototype aimed to significantlly improve the speed and ease in which people could maintain a local copy of the files. This came in the form of WPDS (WordPress Directory Slurper) which was built with Go, providing easy concurrent downloads and cross platform executables with no dependencies. This achieved its aims, being easy to run on Windows, Mac or Linux machines and downloading taking around 2-3 hours to download every plugin and theme.
Something feels wrong
Despite the success of WPDS I was still disatisfied. For all the improvements users were still left with local searches taking 15-40 minutes on good hardware. This feeling built up until I decided to shelve WPDS while I researched alternative options.
While trawling the internet for answers I came across this article by Russ Cox on the history of Google codesearch and quickly after found the etsy/hound project. Hound is fundamentally what I wanted to achieve, it just needed to scale massively and persist search results.
Google’s codesearch uses trigram indexing, splitting text into blocks of three characters. The linked article does a good job of explaining what trigram indexing is but the important result is that it can be used with regex to work out if a match is possible in a file. This allows you to avoid searching files where no match is possible, significantly speeding up most searches.
The Second Attempt
My next prototype WPDirectory, worked much like etsy/hound but instead of indexing Git repositories it indexed all WordPress Plugins and Themes.
Early tests showed that indexing a single plugin resulted in a significant search time improvement, but I was not sure if that would scale to the whole WordPress Directory on reasonable hardware. Indexing required a lot of memory and the total index size for Plugins was over 15GB.
Once I got the prototype maintaining a copy of the Plugin files it was time to test. Simply downloading and indexing the plugins peaked RAM usage at 4-5GB, with the Go garbage collector taking time to release the memory. This meant it was out of reach of the small VPS I am used to deploying/testing web services on.
Therefore I ran the development version locally on my PC (i7 920 @ 4.4ghz, 12GB RAM, 56GB SSD). Searching worked well but I noticed a curious warmup effect. After the first search subsequent searches were much faster (60 seconds to 12 seconds).
Looking into the warmup effect highlighted the main bottleneck of the service, how quickly it could get the indexes to the CPU. In this case, once the OS had cached the index files in RAM searches were much faster. Clorith kindly assisted performance testing with an NVMe SSD which showed that it worked just as well with incredibly fast read speeds.
We have a winner
At this point I knew the idea was resource intensive but worth developing, it had potential. The next task was turning it into a usable public service, primarily by giving it a user interface. This came in the form of a React based frontend and a REST API.
Fast forward a month of bug fixes and code refactoring, and you get what WPDirectory is today. I tried to keep the service as simple as possible to ensure it is easy for anyone else to run. It can be compiled for any OS and runs with zero dependencies. It uses a built in database, boltdb, to store data in a local file and stores all indexes on disk too. It runs a webserver itself automatically provisioning and refreshing SSL certificates using Lets Encrypt.
Now all it needed was somewhere to host it long-term and luckily enough DreamHost stepped in to help. They have provided an instance on their DreamCompute platform with plenty of resources resulting in most searches taking less than 10 seconds. Remember, that is searching over 2 million files in under 10 seconds, that still impresses me.
I will likely be supporting, maintaining and improving WPDirectory for years to come. There are many ways I can improve it, my main dissatisfaction is the search UI. I am certain that I can create a much more intuitive and powerful user interface by better utilising the power of React. There are also many suggestions for improving the search functionality through advanced options (file type, comments, etc).
I have loved every step of this process and I take great satisfaction in moving the process of searching the WordPress Directories forwards. It is my way of paying forwards what the WordPress and its community has done for me.
I am very greatful for the support of those who have helped me to reach this point like Marius ‘Clorith’ Jensen and DreamHost for providing a long-term home for the service. I also need to mention Mark Jaquith, Ipstenu and others for their excellent work on the original Directory Slurper scripts, without which I would never have stumbled onto this subject.