This content has moved - please find it at https://blog.cyotek.com.

Although these pages remain accessible, some content may not display correctly in future as the new blog evolves.

Visit https://blog.cyotek.com.

WebCopy 1.8 - JavaScript Support

One of the long standing requests/complaints is for WebCopy to support JavaScript enabled websites, e.g. modern SPA's where JavaScript is used to build the page. Traditionally this is something I have always put onto the furthest of back burners as in order to support this natively I'd have to essentially write half a browser, something that would be a full time job and a half and not something I'm interested in doing. Other solutions did exist but I never really looked into them.

It recently occurred to me however, that I'd put into place all the building blocks I needed to have WebCopy support JavaScript execution (in a limited fashion, more on this later) using Internet Explorer. And it was easy, in fact, the hardest part was sorting out threading issues - despite the fact that WebCopy currently only crawls on a single thread, it does run on a different thread to the UI in order not to freeze it, which COM can have a problem with.

That is a big warning message!

The end result? A new Use Web Browser option can be found in the Project Properties dialog. When set, WebCopy will do its own downloading and remapping of content, but it will use an embedded Internet Explorer session to do the crawling.

The current version of WebCopy can't detect links generated via JavaScript

The screenshot above shows a scan of the WebCopy demonstration site. The page dom.php has a few lines of JavaScript to build a list of links. As seem above, previous versions of WebCopy are completely oblivious to these extra links.

A small step for software applications, a giant leap for WebCopy

The image above is the same website scanned using WebCopy 1.8 and the new option enabled - you can see how it has detected additional links, due to allowing JavaScript to execute. If you peer hard enough you will also see that it was significantly slower due to scanning the website using this technique.

Listing the cons

Although I'm pleased to be able to finally offer this functionality, there are a few caveats.

This functionality is very new, and very experimental. It is by no means certain that I have ironed out all the potential issues. Caveat Emptor!

  • Crawling may be substantially slower. HTML documents will be downloaded twice, and the headless web browsing will also add significant overhead
  • JavaScript is being executed. This can lead to your sessions being finger printed, tracked, malicious content being downloaded, any number of things
  • This functionality currently uses the latest version Internet Explorer that is installed on your computer. Not all websites play nicely with IE
  • Keeping with the Internet Explorer theme, it will share and use global cookies
  • Some options won't apply - for example the user agent. If a website is particularly unfriendly, it may serve different content to WebCopy than it does to the hosted Internet Explorer session
  • WebCopy will remap only the original document it downloads, not the JavaScript executed version. I don't plan on changing this behaviour
  • This system only supports non-interactive scripts, e.g. JavaScript that executes when the page loads. I have no intention of supporting scripts that normally require user interaction to run, e.g. clicking a button or scrolling a window
  • It occurs to me as I write this post that I have no idea what will happen if the scripts try to open a popup window. Probably nothing good!
  • Potentially more issues. Experimental code!

I don't want to use Internet Explorer, can't I use Chrome or Firefox?

Neither do I. Microsoft have dropped the ball so many times with web browsers I'm amazed they are still in the game. Although I wish they'd just decoupled Edge from the OS and updated it more frequently than giving into Google and adopting Chromium. But I've probably stated this before, plus, as usual, I digress.

To get back to the point, I expect future versions of WebCopy will support both Firefox and Chromium. However, as these browsers are several times larger than WebCopy, they won't be included by default. So I also need to have a nice system so that you can easily add extra browser engines to WebCopy from within the application and without needing to install anything.

I'm also considering supporting Edge as Microsoft appear to be adding support for this to .NET, as long as you're on the latest Windows 10. However, given that it's probably "old" Edge then this may not happen as adding support for two obsolete browsers and with one only available to a fraction of users is going to be a waste of the time I simply don't have to waste.

I'll have more to write about this in future I'm sure!

Update History

  • 2019-06-29 - First published
  • 2020-11-23 - Updated formatting

Leave a Comment

While we appreciate comments from our users, please follow our posting guidelines. Have you tried the Cyotek Forums for support from Cyotek and the community?

Styling with Markdown is supported

Comments

Takashi

# Reply

Thanks!

Gravatar

Nola

# Reply

Thank you very much for your work. This software works great. Hope to add multiple languages. Or built-in translatable files. So that users can translate on their own.

Gravatar

Richard Moss

# Reply

Hello,

Thanks for the comment. The software does support multiple languages, but I added multilingual support to my code base years after WebCopy was released so it will be a significant job to rework all the text (and there's a lot!) to allow this. It is on the list of things to do though, when time allows.

Regards;
Richard Moss

Neiz

# Reply

Wouldn't it be possible to implement this feature by using something like puppeteer ?

Gravatar

Richard Moss

# Reply

Hello,

Probably, yes. I expect it is similar to how the IE mode currently works and how future modes will work when I have more time and inclination to look into this feature. However, I won't make it a condition that a user is expected to install Node and Chrome/Chromium, so that probably means I won't be using existing solutions like puppeteer.

Regards;
Richard Moss