FAQs

Do you provide historical data?

Yes, you can access the archive to get access to data older than 30 days.

What is your language and geographic coverage?

The webhose.io supports 150+ languages across every geographic territory with online access.

Is it possible to query for reviews in multiple languages?

Yes. Use a simple OR Boolean query. For example:

(language:german OR language:chinese)

Will search for posts in both German & Chinese.

How are results sorted?

By default (when the sort parameter isn't specified) the results are sorted by the recommended order of crawl date. You can however change the sort order by using the following values:

  • relevancy
  • reviews_count
  • reviewers_count
  • spam_score
  • domain_rank
  • ord_in_thread
  • rating

For example, the following call, will return reviews ordered by the number of reviews for an item:

https://webhose.io/filterWebContent?token=XXXX-XXXX-XXXX&format=json&q=*&sort=reviews_count

Why do the thread and post URLs go through Omgili.com?

On the free plan, URLs for post and threads redirect through Omgili.com with a 5 second redirect lag. This way we show site owners webhose.io is a significant traffic referral source.

Do you filter out spam?

Each thread is given a spam score, ranging between 0 to 1, indicating how spammy the text is. For example, you can filter out threads with spam score higher than 0.5, by adding term "spam_score:<=0.5" to the search query.

My result set shows the same link multiple times - don't you filter out duplicates?

We do filter out duplicates. You may get the same item link multiple times, if your query matches multiple reviews for the same item. Webhose.io searches at the review level, so results include each review that matched your query. Each review also contains information about its containing item, one of the properties of the item, is the item's link. That's the reason you might see the same link multiple times. If you want to search only for the first review add is_first:true to your query. For example:

apple is_first:true

Will return only the first reviews containing the word "apple".

How many keywords can we track per month?

You can enter any Boolean query with no set limit to the number of tracked keywords. The plan limit refers to the number of monthly requests, which you can upgrade at any time.

# How many sources do you crawl? / Can you share your complete list of sources on your crawling cycle?

It is impossible to provide an up to date list of crawled sites. Our site database is dynamic by nature continuously aggregating new sources. We can tell you however, that the number of crawled sites run to millions with over 10 million posts indexed daily.

We pride ourselves in our ability to quickly add new sources typically within a few hours of detection.

Moreover, you can use the API playground (https://webhose.io/web-content-api) to confirm coverage for a particular source. Our users frequently send us source requests (often including a long list of sources). If you send us a list of sources, we will include them our coverage and send you confirmation within a few days.

Pricing section says '100 results per request'. Does that mean we get only 100 results?

No. If your query produced more than 100 results, you can call the URL appearing in the "next" key in the results set to receive the next page presenting the next set of 100 reviews.

How can I get all the reviews of an item?

To extract an entire thread, use the "item.url" filter. This will return all the reviews belonging to the item URL provided. Example:

item.url:http://domain.com/param=val

(note that you must escape the http:// part of the URL like so: http://).

Can I get the highlighted fragments that matched my query?

Yes. Just add highlight=true as a parameter to your call.

Does the API support nested Boolean expressions as well?

Boolean expressions can be nested in as many levels as you want.

For example: (exp1 AND exp2 AND exp3) OR (exp4 AND (exp5 OR exp6)) -(exp7 AND (exp8 OR exp9))

Do you limit the length of the query, or the maximum number of Boolean clauses I can use?

The maximum length of a query is 4,000 characters.

Does the API support wildcard expressions as the query?

The query syntax is Elasticsearch query string syntax, which means you can use wildcards.

Do you rate limit API calls?

Rate limiting of the API is considered on a per access token basis. You can make one request per second. Exceeding the API rate limit will result in a 429 HTTP error.

Can I disable stemming when searching for an exact term?

Yes. Just append the dollar sign ($) to the end of the keyword. For example, searching for the keyword "simplivity" will also return documents containing the word "simple" since we index the stemmed version of the verb, but if you want to find documents that contain "simplivity" and nothing else, search for "simplivity$".