Use the following filters to focus only on the data you need.
Escaping reserved characters
If you need to use any of the characters which function as operators in your query itself (and not as operators), then you should escape them with a leading backslash. For instance, to search for external_links:https://www.linkedin.com*, you would need to write your query as
external_links:https://[www.linkedin.com\/\](http://www.linkedin.com\/\)\*
The reserved characters are: + - = && || > < ! ( ) { } [ ] ^ " ~ * ? : \ /
Failing to escape these special characters correctly could lead to a syntax error which prevents your query from running.
Post Content Filters
Parameter | Description | Example |
---|---|---|
language | The language of the post. The default is any. | Find posts in French or Italian: (language:french OR language:italian) See Supported Languages under 'References' |
author | Return posts written by a specific author | Find posts written by Admin: author:Admin |
text | A textual Boolean query describing the keywords that should (or should not) appear in the post text. | text:(apple OR android) |
has_video | A Boolean parameter that searches only for posts that contain a video. Set has_video:false to exclude posts with videos. Without this filter, the existence of videos is ignored. | has_video:true |
external_links | Search for posts that include links to another site. | Search for posts that link to LinkedIn (Note that both the slashes and colons are preceded by a backslash): external_links:https://[www.linkedin.com\/\](http://www.linkedin.com\/\)\* |
is_first | A Boolean parameter that searches only on the first post (exclude the comments). | is_first:true |
published | A timestamp (in milliseconds) enabling you to filter posts that were published before or after certain date/time. Here is a Timestamp/Date converter | Return posts published after Thu, 30 Mar 2017 09:16:28 GMT: published:>1490865388000 published:>now-1h |
num_chars | Search for posts with a minimum or maximum number of characters in the text. The excluded languages are : Japanese, Korean, Chinese | Return posts with 1800 or more characters on the text: field num_chars:>=1800 |
sentiment | Present news articles based on their emotional tone - categorized as positive, negative, or neutral. Supported for English, Spanish, French, Italian, Catalan and Portuguese news articles. | sentiment:positive will return only positive news articles |
category | The "Category" filter allows users to refine their news feed based on a predetermined list of thematic categories. The category values are based on the top IPTC categories: - Arts, Culture, and Entertainment - Crime, Law and Justice - Disaster and Accident - Economy, Business and Finance - Education - Environment - Health - Human Interest - Labor - Lifestyle and Leisure - Politics - Religion and Belief - Science and Technology - Social Issue - Sport - War, Conflict and Unrest - WeatherSupported for English, Spanish, French, Italian, Catalan and Portuguese news articles. | category:sport will return only news article dealing with sports |
webz_reporter | Return online news articles that are generated based on factual information extracted from selected news websites using advanced Language Model (LLM) technology. Setting this parameter to True (i.e., webz_reporter:true) will limit the search to include only articles generated using the WebzReporter feature. Note that to access the WebzReporter feature, the webz_reporter GET parameter must also be set to True, as it is False by default. | webz_reporter:true |
ai_allow | If true , returns articles that are allowed to be used general usage including LLM training. If false returns articles that are not allowed only for LLM training (other usages are allowed) . In average 98% of Webz's openweb data is allowed for any purpose , where 2% is disallowed only for LLM training. | ai_allow:true |
Site Filters
Parameter | Description | Example |
---|---|---|
site_type | What type of sites to search in (the default is any) Available site types are: - News - Blogs - DiscussionsWithout this filter, all site types are included. | Only news: site_type:news News & Blogs: (site_type:news OR site_type:blogs) |
site | Limit the results to a specific site or sites. | Limit the results to posts from Yahoo or CNN: (site:yahoo.com OR site:cnn.com) |
thread.country | The article's country of origin is determined by its domain, subdomain, or site section. This is established through the country indicators in the web address (e.g. *.co.fr) or by analyzing the country that generates the most traffic. | Return posts from sites from Hong Kong: thread.country:HK To get a full country code list, visit CountryCode.org |
site_suffix | Limit the results to a specific site suffix | Return posts from sites where their top level domain (TLD) ends with .fr: site_suffix:fr |
site_full | site_full Filter sites based on the domain and optionally by sub-domain | Return posts from Yahoo answers: site_full:answers.yahoo.com |
site_category | Limit the results to posts originating from sites categorized as one (or more) , this filter is also used to filter top news per 59 countries. List Site Category Values | Return posts from sites categorized as sports or games related: (site_category:sports OR site_category:games) |
site_section | Get all the posts of a specific site section (note that you must escape the http:// part of the URL like this: http\:\/\/). | sitesection:https\:\/\/_finance.yahoo.com\/ |
domain_rank | A rank that specifies how popular a domain is | Search for posts from the top 1,000 sites: domain_rank:<1000 |
Thread Filters
A thread contains global information about the content of the whole page and its content. A thread can contain multiple posts grouped together.
Parameter | Description | Example |
---|---|---|
thread.title | A textual Boolean query describing the keywords that should (or should not) appear in the thread title. | Search for posts containing the word "glass" and not "metal" in their title: thread.title:glass -thread.title:metal |
thread.section_title | A textual Boolean query describing the keywords that should (or should not) appear in a site’s section where the post was published | Search for the posts containing the word food only under sections with a title that contains the word "restaurants": food AND thread.section_title:restaurants |
thread.url | Get all the posts of a specific thread (note that you must escape the http:// part of the URL like this: http\:\/\/). | thread.url:"https\:\/\/www.rt.com\\/news\\/487006-lavrov-who-finance-stop-unfair\\/" |
thread.published | A time-stamp (in milliseconds) filtering threads that were published before or after a certain date/time. Here is a Timestamp/Date converter | Return threads published after Thu, 30 Mar 2017 09:16:28 GMT: thread.published:> 1490865388000 |
crawled | A time-stamp (in milliseconds) filtering posts that were crawled before or after certain date/time. Here is a Timestamp/Date converter | Return posts crawled after Thu, 30 Mar 2017 09:16:28 GMT: crawled:>1490865388000 |
image_text | Find threads containing images that include the text requested. Partials values are permissible. | image_text:cola return threads that contain images with the word "cola" inside the image |
image_label | Find posts containing images with a certain object inside. The object is represented by a label such as person, Wedding, Car etc.. | image_label:person |
Social Filters
Parameter | Description | Example |
---|---|---|
performance_score | A virality score for news and blogs posts only. The score ranges between 0-10. A score of 0 means that the post didn't do well - it was rarely or never shared. A score of 10 means that the post was "on fire" being shared thousands of times on Facebook. | Search for news or blog posts with performance score higher than 8 (highly viral): apple performance_score:>8 |
social.facebook.likes | Return posts filtered by the number of Facebook likes. | Return posts with more than 10 Facebook likes: social.facebook.likes:>10 |
social.facebook.shares | Return posts filtered by the number of Facebook shares. | Return posts with more than 10 Facebook shares: social.facebook.shares:>10 |
social.facebook.comments | Return posts filtered by the number of Facebook comments. | Return posts with more than 10 Facebook comments: social.facebook.comments:>10 |
social.vk.shares | Return posts filtered by the number of VK shares. | Return posts with more than 10 VK shares: social.vk.shares:>10 |
Entities & Entity Sentiment Filters
We extract entities such as Persons, Organizations and Locations from all the English news and blog posts that we crawl. We detect the sentiment attached to Persons and Organizations (not Locations) from the top news outlets.
Parameter | Description | Example |
---|---|---|
person | Filter by person name. You should use this filter only for disambiguation, otherwise you should use a simple keyword search. | person:"barack obama" |
organization | Filter by organization/company name. You should use this filter only for disambiguation, otherwise you should use a simple keyword search. | organization:"apple" |
location | Filter by location name Important: Do not confuse this with the country filter. If you want to search for sites from a specific country, use the thread.country parameter (explained above). | location:"germany" |
entity.sentiment | Find an entity with a sentiment context attached to it. | person.positive:"obama" organization.negative:"apple" organization.neutral:"google" |
Syndication (grouping) Filters
Syndicated content is content that is classified as similar to other in the group, therefor its being tagged as such.
Each syndicated cluster is tagged as unique and is based on SIMHash algorithm.
Parameter | Description | Example |
---|---|---|
syndication.syndicated | If syndicate:true that it find all the similar/duplicated content | syndication.syndicated:true |
syndication.syndicate_id | Filter by unique ID of the group (generated by the system) | syndication.syndicate_id:7f3de401c5c89f3ac579918dd20b1f05ba6051b3 |