Here's a list of the fields on the Firehose output
Field | Type | Description | Details |
---|---|---|---|
type | String | Source type code name | forums for Online-Discussions and Messaging Boards, and Reviews mainstream for News outlets blogs for, well Blogs |
forum | String | URL of the section in the site where the thread was created | |
forum_title | String | Title of the section in the site where the thread was created | |
discussion_title | String | Title of the thread | |
language | String | Language in lower case e.g. “english", "spanish" etc | The exception is "chineset" for Taiwanese Mandarin. |
gmt_offset | Int | Time difference between the Publication's timezone and GMT The time zone of the site | Webz does its best to automatically detect the time zone of sites. As an example, <gmt_offset>-7</gmt_offset> means the site is in GMT-7:00 |
topic_url | String | Thread URL | |
topic_title | String | Main post title | |
topic_text | String | Main post text | An empty text field is a valid value since a post could be made only from a title, or if the text field is made from emojis or images. |
post_num | Int | Numbered order of post in the thread | The main post ID is 1. The first reply ID is 2 and so on. |
post_id | String | Post ID | Typically, derived by truncating the post URL string to extract the anchor hashtag http://www...#post-ID |
post_url | String | Post URL | Typically, topic_url followed by #post-ID. To avoid duplication, you should always make sure post_url is followed by the above post_id |
post_date | String | Date in YYYYMMDD format | |
post_time | String | Time in HHMM format | We do not adjust the time zones on the Firehose output and display the date and time of each post as it appears on the site. |
username | String | Author name | |
post_title | String | Post title | |
post | String | Post content | Note, an empty text field is a valid value since a post could be made only from a title, or if the text field is made from emojis or images. |
signature | String | Author’s closing message | |
external_links | String | external_links | |
country | String | 2 letter ISO country code | To get a full country code list, visit our country codes index. |
main_image | String | URL of image | |
rating | Float | Rating | Discussions that include 1 through 5 Ratings |
is_lang_certain | Bool | Detected language certainty | |
seen_before | Bool | False if the post is newly-crawled or true if seen before on a thread | The "seen before" field can help with deduplicating the content. If you exclude the posts with <seen_before>true</seen_before> you will avoid ingesting the same posts on your end. |
accuracy_confidence | Bool | Data extraction accuracy | |
domain_rank | Int | Domain rank | |
site_categories | String | Category | |
reviewed_url | String | Contains the URL of the reviewed entity | Exists only in the reviews feed. In most cases the reviewed_url will be the same as the topic_url without the anchor fragment. |
entities | String | persons - a list of identified people names found in the post locations - a list of identified locations found in the post organizations - a list of identified companies/organizations found in the post | Currently available for news and blogs posts in English (excluding the comments) |