Web Data For Machines
Many applications today rely on quality web data for their success.
The exponential expansion of the web, its complexity, the requirement for frequent updates, and its ability to enable customized web sources at scale are all too much for traditional web crawling techniques to handle.
Businesses today need web data that they can consume on-demand more than ever - whether for media monitoring, financial analysis, market research or for domain protection and data breach detection & mitigation.
That’s why Webz.io delivers clean, organized, and structured web data in a machine-readable format, so you can consume all the open, deep, and dark web data that your business needs and focus on your product.
How do we do this?
Webz.io structures web data with extracted, inferred, and enriched fields. Every source we crawl is identified as a “post,” an indexed record matching a specific news article, blog post, or online discussion post or comment.
We then extract standard fields common to these source types, including URL, title, body text, or external links.
Here’s a breakdown of the different types of fields and examples of each:
- Extracted - Standard elements in most web pages like title, body text, and URL.
- Inferred - This is information that is not explicitly included in the raw data, like language, country, author, and publication date.
- Enriched - These fields have a deeper layer of meaning and need more processing power. For example, how do we know whether the word “fox” refers to the animal, the entertainment company, or Michael J. Fox?
How is Webz.io being identified on the web?
To differentiate between crawling operations and maintain ethical web scraping practices, we utilize two user agents.
- webzio [Full String: webzio (+https://webz.io/bot.html)] - This user agent is used by hundreds of search engines and the goal is to conduct social listening on intelligence platforms.
- webzio-extended [Full String: webzio-extended (+https://webz.io/bot.html)]- This user agent is dedicated to determining if the data collected is permissible for AI use cases.
Both user agents include a direct link to a publicly accessible webpage (https://webz.io/bot.html) for enhanced transparency and ethical crawling.
How does Webz.io classify URLs by a specific site type?
Webz defined guidelines to distinguish between the following site types : BLOGS | DISCUSSIONS | NEWS | REVIEWS.
There is no absolute approach that can provide 100% accuracy in classification of site type URL, but in order to normalize the process and be transparent with our customers , Webz exposed the following guidelines that we apply automatically:
Blogs
If the URL contains the word ‘Blog’ then it's a BLOG
OR
If the URL is part of major blog platform ( BlogSpot, WordPress, Medium, Ghost and more) then its a BLOG
News
If the URL domain is defined as ‘news’ based on our local DB of news domains then its a NEWS.
If most of the articles defined as news based on our AI solution from the biggest section in the domain then its a NEWS.
Discussions
If the URL's domain didn’t classify as news or blog then its DISCUSSIONS
Reviews
Manually defined for ecommerce sites