Firehose Output

Here's a list of the fields on the Firehose output

Field

Type

Description

Details

type

String

Source type code name

  • *forums** for Online-Discussions and Messaging Boards, and Reviews
  • *mainstream** for News outlets
  • *blogs** for, well Blogs

forum

String

URL of the section in the site where the thread was created

forum_title

String

Title of the section in the site where the thread was created

discussion_title

String

Title of the thread

language

String

Language in lower case e.g. “english", "spanish" etc

The exception is "chineset" for Taiwanese Mandarin.

gmt_offset

Int

Time difference between the Publication's timezone and GMT
The time zone of the site

Webz does its best to automatically detect the time zone of sites. As an example, <gmt_offset>-7</gmt_offset> means the site is in GMT-7:00

topic_url

String

Thread URL

topic_title

String

Main post title

topic_text

String

Main post text

An empty text field is a valid value since a post could be made only from a title, or if the text field is made from emojis or images.

post_num

Int

Numbered order of post in the thread

The main post ID is 1. The first reply ID is 2 and so on.

post_id

String

Post ID

Typically, derived by truncating the post URL string to extract the anchor hashtag http://www...#post-ID

post_url

String

Post URL

Typically, topic_url followed by #post-ID. To avoid duplication, you should always make sure post_url is followed by the above post_id

post_date

String

Date in YYYYMMDD format

post_time

String

Time in HHMM format

We do not adjust the time zones on the Firehose output and display the date and time of each post as it appears on the site.

username

String

Author name

post_title

String

Post title

post

String

Post content

Note, an empty text field is a valid value since a
post could be made only from a title, or if
the text field is made from emojis or
images.

signature

String

Author’s closing message

external_links

String

external_links

country

String

2 letter ISO country code

To get a full country code list, visit our country codes index.

main_image

String

URL of image

rating

Float

Rating

Discussions that include 1 through 5 Ratings

is_lang_certain

Bool

Detected language certainty

seen_before

Bool

False if the post is newly-crawled or true if seen before on a thread

The "seen before" field can help with deduplicating the content.
If you exclude the posts with <seen_before>true</seen_before> you will avoid ingesting the same posts on your end.

accuracy_confidence

Bool

Data extraction accuracy

domain_rank

Int

Domain rank

site_categories

String

Category

reviewed_url

String

Contains the URL of the reviewed entity

Exists only in the reviews feed. In most cases the reviewed_url will be the same as the topic_url without the anchor fragment.

entities

String

persons - a list of identified people names found in the post

locations - a list of identified locations found in the post

organizations - a list of identified companies/organizations found in the post

Currently available for news and blogs posts in English (excluding the comments)