Firehose Output

Here's a list of the fields on the Firehose output

FieldTypeDescriptionDetails
typeStringSource type code nameforums for Online-Discussions and Messaging Boards, and Reviews
mainstream for News outlets
blogs for, well Blogs
forumStringURL of the section in the site where the thread was created
forum_titleStringTitle of the section in the site where the thread was created
discussion_titleStringTitle of the thread
languageStringLanguage in lower case e.g. “english", "spanish" etcThe exception is "chineset" for Taiwanese Mandarin.
gmt_offsetIntTime difference between the Publication's timezone and GMT
The time zone of the site
Webz does its best to automatically detect the time zone of sites. As an example, <gmt_offset>-7</gmt_offset> means the site is in GMT-7:00
topic_urlStringThread URL
topic_titleStringMain post title
topic_textStringMain post textAn empty text field is a valid value since a post could be made only from a title, or if the text field is made from emojis or images.
post_numIntNumbered order of post in the threadThe main post ID is 1. The first reply ID is 2 and so on.
post_idStringPost IDTypically, derived by truncating the post URL string to extract the anchor hashtag http://www...#post-ID
post_urlStringPost URLTypically, topic_url followed by #post-ID. To avoid duplication, you should always make sure post_url is followed by the above post_id
post_dateStringDate in YYYYMMDD format
post_timeStringTime in HHMM formatWe do not adjust the time zones on the Firehose output and display the date and time of each post as it appears on the site.
usernameStringAuthor name
post_titleStringPost title
postStringPost contentNote, an empty text field is a valid value since a
post could be made only from a title, or if
the text field is made from emojis or
images.
signatureStringAuthor’s closing message
external_linksStringexternal_links
countryString2 letter ISO country codeTo get a full country code list, visit our country codes index.
main_imageStringURL of image
ratingFloatRatingDiscussions that include 1 through 5 Ratings
is_lang_certainBoolDetected language certainty
seen_beforeBoolFalse if the post is newly-crawled or true if seen before on a threadThe "seen before" field can help with deduplicating the content.
If you exclude the posts with <seen_before>true</seen_before> you will avoid ingesting the same posts on your end.
accuracy_confidenceBoolData extraction accuracy
domain_rankIntDomain rank
site_categoriesStringCategory
reviewed_urlStringContains the URL of the reviewed entityExists only in the reviews feed. In most cases the reviewed_url will be the same as the topic_url without the anchor fragment.
entitiesStringpersons - a list of identified people names found in the post

locations - a list of identified locations found in the post

organizations - a list of identified companies/organizations found in the post
Currently available for news and blogs posts in English (excluding the comments)