Do you provide historical data?

Yes, you can access the archive to get access to data older than 30 days.

What is the scope of your language and geographic coverage?

Webz.io supports 150+ languages across every geographic territory with online access.

Is it possible to query for posts in multiple languages?

Yes. Use a simple OR Boolean query.

For example:

(language:german OR language:Chinese)

Will search for posts in both German & Chinese.

How are results sorted?

By default (when the sort parameter isn't specified) the results are sorted by the recommended order of crawl date. You can however change the sort order by using the following values:

relevancy
published
social.facebook.likes
social.facebook.shares
social.facebook.comments
social.gplus.shares
social.pinterest.shares
social.linkedin.shares
social.stumbledupon.shares
social.vk.shares
replies_count
participants_count
performance_score
domain_rank
ord_in_thread
rating

For example, the following call will return posts ordered by the number of likes:

https://api.webz.io/filterWebContent?token=XXXX-XXXX-XXXX&format=json&q=*&sort=social.facebook.likes

Why do the thread and post URLs go through Omgili.com?

On the free plan, URLs for posts and threads redirect through Omgili.com with a 5-second redirect lag. This way we show site owners webz.io is a significant traffic referral source.

My result set shows the same article link multiple times - don't you filter out duplicates?

We do filter out duplicates. You may get the same article link multiple times if your query matches multiple comments for the same article. Webz.io searches at the post level, so results include each post that matched your query. Each post also contains information about its containing thread. One of the properties of the thread is the article link. For that reason, you might see the same link multiple times. If you want to search only for the first post (i.e only the article and no comments) add is_first:true to your query. For example:

opera is_first:true

Will return only articles (i.e no comments) containing the word "opera".

How many keywords can we track per month?

You can enter any Boolean query with no set limit to the number of tracked keywords. The plan limit refers to the number of monthly requests, which you can upgrade at any time.

How many sources do you crawl? / Can you share your complete list of sources on your crawling cycle?

It is impossible to provide an up-to-date list of crawled sites. Our site database is dynamic by nature continuously aggregating new sources. We can tell you, however, that the number of crawled sites runs to millions with over 10 million posts indexed daily.

We pride ourselves in our ability to quickly add new sources typically within a few hours of detection.

Moreover, you can use the API playground (https://webz.io/web-content-api) to confirm coverage for a particular source. Our users frequently send us source requests (often including a long list of sources). If you send us a list of sources, we will include them in our coverage and send you a confirmation within a few days.

Does your search support entity extraction (like people, companies, locations)?

Yes. You can search by person, location, or organization on news or blog posts in English. For example, organization:apple will return news or blog posts mentioning Apple the company, and not the fruit.

My search yielded more than 100 results - how to retrieve the rest of the data?

You should page through results by calling the URL on the "next" field, to retrieve the rest of the data matching your search query.

How can I get all the posts of a thread?

To extract an entire thread, use the "thread.url" filter. This will return all the posts belonging to the thread URL provided. Example:

thread.url:http://domain.com/param=val

(note that you must escape the http:// part of the URL like so: http://).

Can I get the highlighted fragments that matched my query?

Yes. Just add highlight=true as a parameter to your call.

Does the API support nested boolean expressions as well?

Boolean expressions can be nested in as many levels as you want.

For example: (exp1 AND exp2 AND exp3) OR (exp4 AND (exp5 OR exp6)) -(exp7 AND (exp8 OR exp9))

Do you limit the length of the query or the maximum number of Boolean clauses that I can use?

The maximum length of a query is 4,000 characters.

Does the API support wildcard expressions as the query?

The query syntax is based on Elasticsearch query string syntax, which means you can use wildcards.

Do you rate limit API calls?

Rate limiting of the API is considered on a per access token basis. You can make one request per second. Exceeding the API rate limit will result in a 429 HTTP error.

Can I disable stemming when searching for an exact term?

Yes. Stemming by default is enabled. To disable it just append the dollar sign ($) to the end of the keyword. For example, searching for the keyword "simplivity" will also return hits for the word "simple" since we index the stemmed version of the word, but if you want to find documents that contain "simplivity" and nothing else, search for "simplivity$".
Stemmed searches are currently supported for English, Spanish, Arabic and Russian.

Can you share the list of sites you are crawling?

webz.io doesn't rely on a white-list to crawl the web. Our crawlers find new sites and new content dynamically, so sending a list would be misleading. If you want to know if we crawl a source or not, you can either use the "site:" filter, or email [email protected] with the list of sites you want to check.

Do you have any filter that will return data only from the top sites you crawl?

Yes. There are actually multiple ways to get better quality posts either from popular websites, or even popular posts. The first way would be to use the domain_rank filter. The domain rank filter specifies how popular a domain is (by monthly traffic), so if you want to search for posts from the top 1,000 sites, use the following:

domain_rank:<1000

The second option would be to either use the performance_score, or social signals to filter for posts that were either viral or were shared/liked many times on social networks.