Julius Cerniauskas, CEO and Founder of Oxylabs

How businesses can realise the full potential of data through web scraping

Julius Černiauskas CEO of Oxylabs, a premium proxies and public web data gathering solutions company, shares his insight into how current tools are changing the way we do business.

Web scraping, the process of automating public data collection online, and proxies are intimately intertwined. Both of these tools have created immense changes in the business world, all the way from enabling completely new models to optimizing already existing ones.

Let’s start from the beginning. What have proxies and web data gathering solutions brought to the business world?
Both of these solutions lie extremely close to each other. Proxies, by themselves, are just tools that essentially are IP addresses from various machines. While these have a multitude of uses, they are primarily used for web scraping and a few other use cases.

Without proxies, web scraping would be impossible. There’s a lot of country-specific content such as pricing that couldn’t be collected without using proxies from that geolocation. Additionally, there are cases where an IP address might fail to access the website entirely. In such an event, a proxy can help circumvent these issues.

In other words, the existence of proxies has enabled automated public data collection in its various forms, primarily, web scraping. The changes it has had on a global scale, however, are significantly greater than one might expect.

Price comparison services, travel and hotel fare aggregation, some Data as a Service, and many other businesses rely upon web scraping as their core of operations. They scrape various websites, collect a large number of data points, and present the insights as their product.

Web scraping has entrenched itself within other industries as well. Ecommerce has been particularly quick to adopt web scraping for business intelligence purposes. Financial services companies use web scraping to extract signals from various publicly available sources, enabling them to create completely novel investment strategies.

What do you think lies ahead for web scraping?
We’re still at a midway point for it. I wouldn’t say it’s in the early stages of development and application as there are enough companies surrounding web scraping to form a completely new industry. Data acquisition, in general, has progressed lightyears in the past decade.

Unfortunately, it still hasn’t reached maturity. We, the industry as a whole, have progressed at an incredible rate, which didn’t allow supporting areas to catch up. As I’ve said many times, there are many questions left unanswered, particularly surrounding the legal aspects of web scraping.

Despite the lack of widespread web scraping regulation, its adoption is growing at immense speeds. Our recent survey that inquired upon how data is being managed and acquired in the finance sectors of the UK and US has been illuminating.

Web scraping is only second to internal data collection in those sectors. Additionally, 80% believe their focus will shift more towards web scraping in the coming year, indicating a trend for immense growth.

Are there any technical developments that you foresee could change web scraping?
Of course. We have always been pushing ethical web scraping on one hand and innovation on the other. There is plenty of room for technological progress, which would have a dramatic impact on the process at large.

One of the key areas of development is machine learning (ML) and artificial intelligence (AI). We have already shown through our own solutions that there is plenty for operational improvements through ML. There are some processes, such as parsing, which are extremely resource intensive but rather repetitive, making them the perfect target for machine learning.

Other processes we’ve been thinking about involve big picture data collection. There’s a certain phase in data collection, which we have started calling “exploration”, that’s highly complicated at the current state. For example, a client might want to collect pricing data for a specific category of products from a certain region.

It may seem simple, but we must first figure out where such data would be located. As such, we are thinking about ways how we could optimize the discovery and exploration phases to reduce the lag between task formulation and collection.

All of these technical developments we’re going for have a measure of importance outside competitive and innovative instincts. Any reduction in data collection costs makes web scraping accessible to smaller companies. Our goal has always been to shift web scraping more towards something someone could do at home rather than something that’s done by large corporations.

What industries would you think would benefit the most from investing more into web scraping?
It’s somewhat ironic, but I think ecommerce and financial services could do much more with web scraping even if they are the sectors that have pushed automated data collection the most. Yet, there is still so much data remaining on the table that these industries have barely scratched the surface of potential.

Out of those who are yet to utilize web scraping in a wider fashion, I think the public sector and academia are only starting to get used to web scraping. Unfortunately, the slow adoption might have a lot to do with the unregulated nature of the process.

As part of our initiative, however, we have been working with partners in the public sector all across the world who benefit from web scraping solutions. Web scraping has immense potential to further the greater good. We hope that our tools and expertise on a pro bono basis will help legitimize and pull it out of the shadows, so that everyone can benefit from automated data collection.

We’ve had a few great successes already. Our partnership with RRT, a public sector organization in Lithuania, has allowed us to develop a tool that automatically scans the IP address space for illegal imagery (i.e. child abuse) and reports offending content to the relevant authorities.

So, for those interested in partnering with us to further the greater good, If you’re interested in partnering with us, drop us a line via partnerships@oxylabs.io or fill in the partnership contact form, and we will get back to you in the following days.

 

Check Also

Kaspersky reveals phishing emails that employees find most confusing

According to estimates, 91% of all cyberattacks begin with a phishing email, and phishing techniques are …