The Two Formats
When I launched the EWWW Image Optimizer plugin six years ago, there were three image formats we were concerned with: JPG, PNG and GIF. The JPG format was ideal for photos, PNG was for decoration, and the only thing left for GIFs was animation. Most images on the web were JPGs, and so compressing your JPGs was the quickest way to make your site faster. As with any area of computing, folks had been trying to improve upon the compression offered by the JPG format. Some worked within the JPG format itself, and others decided it was time for a new format. Along came Google, and they decided to make an attempt with a format they called WebP. Although WebP can now replace PNG and GIF, we’ll focus on the WebP and JPG formats.
Somewhere in the second year of development (of EWWW IO), someone brought this new contender to my attention. It was new, it was small, and it made your site faster with a minimal loss in quality. Since I was a sucker for new and shiny things, in version 2.0, EWWW IO was the first plugin to add support for WebP. All you had to do was run a bulk optimize, add the rewriting rules, and your site loaded faster than ever before. Who would want to keep using JPG when WebP was smaller and faster?
The Two Methods
The main problem, then and now, is that not all web browsers support the WebP format. The original solution was to rewrite urls at the server level, which was cache-friendly and simple. That’s what I thought, anyway. We soon found that server-level rewriting only worked if there were no proxy servers between the origin and the user. Thus, a new method was born, and I called it Alternative WebP Rewriting. Clever, right?
Parsing HTML code is not unlike trying to herd cats. I was familiar with Regular Expressions, which are a clever way to search any given text for any given pattern. It’s a very powerful language, and I thought certainly it could handle searching HTML for images. However, everywhere I looked, everyone said the same thing: use the built-in parser for your language. That sounded great, and I proceeded to discover the DomDocument class in PHP.
The Two Problems
So what went wrong? The problems stemmed from the HTML standard(s). In theory, everyone follows the standard, and life is easy-peasy. In reality, there is a lot of non-standard code, and the standards have also changed significantly over the last few years.
The first problem was what happened when the DomDocument parser found invalid HTML. It doesn’t just leave it alone, it tries to fix the code. This means any theme or plugin that doesn’t strictly follow the standard will break when the page is parsed with DomDocument. Now, I don’t blame the developers, not too much… The real blame belongs at the feet of the browsers, Internet Explorer, Chrome, Firefox, etc.
In most coding languages, if you don’t follow the syntax, your program breaks. It stops running, and tells you what you did wrong (or tries to). When it comes to HTML and web pages, the browser tries to run the code as-is, even if it’s invalid. But this auto-correction is a double-edged sword. It makes life easy for you, by fixing your problems, but it allows you to write broken code without ever knowing about it. Sure, if your code is broken enough, you’ll notice, and fix it. In (too) many cases, this results in plugins and themes with broken code.
The strict nature of DomDocument introduces a second problem. As I mentioned, the standards have changed dramatically over the last few years. This is great for front-end developers, as the changes open up a whole new world. This is bad news for anyone trying to develop a plugin used on hundreds of thousands of websites. Since the standards have changed, libxml has been updated to recognize code built on those new standards. A common problem we’ve seen is plugins that use <noscript> tags within the <head> portion of a web page. If that website uses libxml pre-2.8, it will move a huge chunk of code from the head into the body of the page.
Okay, sit down, and brace yourself… Version 2.8 of libxml was released six years ago, before the very first release of EWWW IO. That means this bug should be long gone, right? Based on the usage data folks have submitted over the last year, fully 25% of websites are using libxml less than 2.8. Needless to say, I’ve been looking for another way to parse HTML for a long time. Needless? Well, I said it anyway…
Rethinking WebP Rewriting
Last year, as some of you know, we released a new service called ExactDN. The code for ExactDN is derived from the Photon module of the popular Jetpack plugin. When I first started digging into the code, I discovered something surprising: it used Regex to find the images in the page. Sound familiar? Oh yeah, that’s what I originally was going to use for Alt WebP mode, until everyone said “Use the built-in parser!”
Now don’t get me wrong, the DomDocument parser can be amazing, in the right circumstances. It can save you a lot of time and effort, and if you are doing more than searching for images in a page: Use the built-in parser! The Regex language is full of tricks, and even the patterns used by Photon and ExactDN weren’t perfect. Don’t worry, I fixed them!
So, the latest version of EWWW IO now has a completely rewritten WebP parser. Beyond ripping out DomDocument, the new parser has been rewritten in OOP. This allowed me to make the parser much more consistent and reliable. The rewrite also lead to a breakthrough in WebP support for ExactDN. The absolute simplest way to enable WebP on your site, is to enable ExactDN and Alternative WebP Rewriting. Two check-boxes, and your site is faster than ever before.
Of course, you can still use WebP without ExactDN. Turn on the WebP conversion option, run a bulk optimize, and enable Alt WebP mode. It’ll take longer, and use more disk space, but it works just fine. In fact, you can even use this method for free. Yeah, it doesn’t get more awesome than faster web-sites for free! I’m super excited about this new WebP function, and I can’t wait for you to try it out. So go for it! Grab the latest version of EWWW IO, and see what it can do for your site.