Adding A Search Engine to A Static Site
Tutorial
January 01, 2018
I have been working on getting a search engine implemented on my new Hexo-generated static website. It’s been a mission to say the least! Since I’m moving my site from the Jekyll generator to Hexo, I have a working example of what the end goal is. Here are the steps I took to translate the search function from Jekyll to Hexo, and what ultimately led to success.
Note: This walk-through assumes you already have a Hexo site set up and access to it from the command line. If that’s not the case, check out the Hexo docs, this intro to Hexo video series, or my other Hexo posts for more info about setting up a Hexo site.
Generating Search Data
This is the necessary first step for searching a static site. After all, we need something to search! For a static site, search data can be stored as a JSON file which contains documents for all of the content on your site. For example:
[ |
In real life all of these JSON documents would also contain fields for the post content, any tags or categories, etc, like in the first index. You get the idea!
Jekyll uses Liquid as a templating engine, and Jekyll will parse Liquid tags to create a JSON file containing all of this post data. But Hexo doesn’t have the same functionality with its templating engine (EJS), so we’ll need to generate the JSON in some other way.
Thankfully there are some Hexo plugins built to do just this. First I tried hexo-generator-search
(link) but found that this plugin is optimized for outputting XML data; although it can make a JSON file, the output wasn’t clean and had lots of new line \n
tags and other code remnants in the text. I also tried hexo-generator-json-content
(link]) which seemed more promising: it’s more customizable as far as which fields are included in the JSON file, and the miscellaneous characters occurred less frequently.
To install this plugin, navigate to your site’s directory in the command line and install the plugin with npm:
npm i --save hexo-generator-json-content |
Add your personal configuration to your site’s _config.yml
file (more on this below, or in the documentation), and the next time you run hexo generate
or hexo server
, a new file called content.json
will appear in your site’s root folder.
So now that we’ve got the data, let’s search it!
Writing A Search Engine
Looking through a lot of Hexo theme repos, it seemed like writing a full search engine is how most of them operate search, if they have it. But since these engines are tied in with each individual theme, it was at times difficult to read through their code and pull apart the pieces that would apply to my own custom theme. I found this post about how to do it in your own theme which was a start, but it was this more detailed post with code that laid it out more clearly (note: the post is in Chinese…thanks Google translate!). Unfortunately as awesome as Google translate is, it also translated some of the code…hah!
Ultimately though, this code kept throwing errors, and in fixing it, I realized that this search engine does way more than I even need. My goal is to generate a list of posts that contain the search term. I won’t be displaying the full results or highlighting the search terms or even creating a results page. All of the examples and tutorials I found were doing this, so rather than getting this code to work with my site, I opted to look for something more to the point of what I needed.
Implementing Lunr
Lunr is a lightweight JavaScript search engine built to work with static sites. My old Jekyll site uses Lunr to search its JSON file. First I tried basically copying the code from the Jekyll site:
jQuery(function () { |
But this code doesn’t work with the Hexo configuration…in fact it doesn’t seem to work at all. Turns out Lunr went through a major upgrade, and this code no longer works with the current version of Lunr. So now I get to start from scratch! The Lunr docs are pretty helpful, so I went through it step by step.
Step 1: Confirm It’s Working At All
Step one was getting anything to show up in the search results. The main change from the code above to the newer version is that the add
method must take place at the same time that the index is being created. I tried it with my JSON file first but it didn’t work. So following the documentation, I added a data object directly into the add
method. I also simplified the result display to see what it actually give back to you:
jQuery(function () { |
This gave the following result…not exactly diamonds, but at least we know it’s searching!
[{ |
Next I tried adding a second document into the add
function in order to try more search terms. The result is that it could only search the first document, and would otherwise return an empty array. So let’s get that working next…
Step 2: Search Multiple Documents
The reason for this is that the add
method only adds one document at a time. So we need to include a forEach
function to add each document into the index. We also need to add a forEach
to the display_search_results
function to list each result:
jQuery(function () { |
Hooray! Now when we search, a result comes up for each document. If the term is in both (like the word love), two results are listed. And if we search for a term with no results (like the word alien), we are helpfully told as much.
But we’re still seeing completely unhelpful results which basically spit back what we searched for. So now lets display something useful!
Step 3: Display Useful Results
So Lunr provides search results with a ref
number, starting at the number 1:
[{ |
The data is stored in an array, and these ref
numbers reflect the position in the array, although it’s off by one. So we can create an index variable to link ref
with the original data, and then call whatever fields we want to display on the page from the data in the forEach
loop within our display_search_results
function:
results.forEach(function (result) { |
And we have a winner! Now when we search any term within either data item, its title is returned and displayed as a list item in the search results.
Step 4: Bringing In External JSON
Now that we know the search engine displays the results we want, let’s make the data live outside of this search function. After all, a new JSON file will be generated each time a new post is published to the site, not to mention it will be a pretty large file with new posts being added on a daily basis (sometimes as long as this one!). To start let’s use the same simple data but save it to a new file in the root directory of the site; I added a third document for testing purposes:
## /test.JSON |
The Jekyll site used a jQuery method to load the external JSON file:
var data = $.getJSON("/test.json"); |
However when I load this into the current file, it throws an error: data.forEach is not a function
. By logging the value of data
I see why: this jQuery method returns an object rather than just the array of data contained in the file; and the forEach
method only works on arrays, not objects. So we need to find a way to access the array. It’s also necessary to recognize that the data may not be fully loaded as soon as the .getJSON
method is called. This was the result the first time I tried it; the data were loaded, but not in time for the search function to run, and it threw data is not defined
.
A bit of stack overflowing and documentation reading confirmed that indeed, the getJSON
method returns a promise to get the JSON, but does not actually complete it at the time the promise is made. Since JavaScript is asynchronous, it keeps processing code (keeping that promise in its back pocket!) and the data isn’t actually there when we need it. To get around this we need to be clear that any functions which rely on the data are only called once the data have been fully loaded. So we can put the whole index builder function inside a callback which will only execute once the data have been loaded:
var idx; |
Notice that we also take idx
out and declare it as a global variable; this is to ensure that it’s available to the .search
method that will be run later as part of the search field’s event listener. We can confirm that the data
and idx
variables are logging the same values as they did when we had the data locally.
But there’s still one step to go. We also need to load the data within our display_search_results
function. We can wrap the existing function components within a similar AJAX callback to achieve the same results:
function display_search_results(results) { |
And voila! We have a working search engine accessing data from an external JSON file.
Step 5: Searching Blog Data
Now that we know everything is working as it should, it’s time to try searching with the JSON file which contains the blog data. If you can recall from waaaay at the beginning of this post, we generated a JSON file using the hexo-generator-json-content
Hexo plugin. It gives us a format exactly like our test file, but with much much more data. But to start things off, lets start by only searching a few fields. We can turn fields on and off by adding rules to the site’s _config.yml
file:
jsonContent: |
We also need to edit the index builder function to take the new key-value pairs into account:
var idx; |
Notice that we use this.ref
instead of this.field
for the path field. This will tell Lunr to treat the post’s filepath as the results reference, rather than a random number like we did with the dummy data earlier. Using path
is ideal because unlike title
it’s guaranteed to be unique; we can use this to build out the links on the results list later too. With this setup, our result output is slightly different than it was before:
// result |
Since we no longer have integers as ref
values, we can’t use the same index lookup that we used with the dummy data. Instead we’ll need to pull out any matching documents from the original data, which we can do with the ref
value and an object filter:
function filter_results(data, results) { |
This function has two steps: first we take the results
(which is an array of objects including our ref
values) and create a new array which only contains the search results’ path
names. Then we take the data
array (another array of objects) and filter it by those path names; indexOf is a JavaScript array method which returns -1
if a search element (the object path
in this case) doesn’t exist in the array. The result is a new array containing only the posts relevant from the search; we store this in the variable matches
.
We can update our display_search_results
function to display results based on the matches
rather than the results
:
function display_search_results(results) { |
I think we all deserve a pat on the back, as we now have a fully operational search engine on our Hexo site 😎.
Wrapping Up
From here there are a few more customizations to do depending on your individual preferences. For example if you have custom fields in your posts or want to include your tags, categories, etc. you can add them by editing your _config.yml
file. Don’t forget to include these fields in the Lunr idx
function too! You can also edit _config.yml
to include pages in your search data, just note that the structure of objects in your content.json
file will change as a result, and you’ll need to account for this in the index builder and display_search_results
functions. Happy coding!