How does it work?

Web Data For Machines

Many applications today rely on quality web data for their success.
But traditional web crawling methods can’t keep up with the exponential growth of the web, its complexity, the need for constant updates, and the ability to support customized web sources at scale.
Businesses today need web data that they can consume on-demand more than ever - whether for media monitoring, financial analysis, market research or for domain protection and data breach detection & mitigation.

That’s why Webz.io delivers clean, organized, and structured web data in a machine-readable format, so you can consume all the open, deep, and dark web data that your business needs and focus on your product.

How do we do this?

Webz.io structures web data with extracted, inferred, and enriched fields. Every source we crawl is identified as a “post,” an indexed record matching a specific news article, blog post, or online discussion post or comment.
We then extract standard fields common to these source types, including URL, title, body text, or external links.
Here’s a breakdown of the different types of fields and examples of each:

  • Extracted - Standard elements in most web pages like title, body text, and URL.
  • Inferred - This is information that is not explicitly included in the raw data, like language, country, author, and publication date.
  • Enriched - These fields have a deeper layer of meaning and need more processing power. For example, how do we know whether the word “fox” refers to the animal, the entertainment company, or Michael J. Fox?

How Webz.io is being identified in the web?

To differentiate between crawling operations and maintain ethical web scraping practices, two primary UAs are utilized.

  • webzio [Full String: webzio (+https://webz.io/bot.html)] - This User Agent is utilized by hundreds of search engines developed for social listening and intelligence platforms.
  • webzio-extended [Full String: webzio-extended (+https://webz.io/bot.html)]- This User Agent is dedicated to determining if the data collected is permissible for AI use cases.

Both UAs include a direct link to a publicly accessible webpage (https://webz.io/bot.html) for enhanced transparency and ethical crawling.