# Data Support

# Data Ingestion

# Streaming Ingestion

Types of Streaming Ingestions:

  • Clickstream
  • Change Data Capture
  • Live Video

Clickstream Schema

  • Datetime
  • user agent
  • user id if logged in)
  • request details (get/details/video id)
  • session id (cookie, helps tie clickstream entries)
  • ip address
  • referrer (google, facebook, etc)

ClicklStream Tools Kafka logs will be send to producers clickstream logs will be send to topics. Topics are parittioned in brokers

kafka clickstream

# Change Data Capture

  • current user database only has current state; cant use this to understand why user left
  • CDC, keep a historical record of the log kafka cdc flow

# Live Video

  • ingesting video content from traffic camers, security cameras, video streaming services
  • Live Video Broker

camera send video to some collector collector will comprress break out frames .. send to broker

live video broker

# Batch Ingestion

  • periodic Database snapshot
  • useful when onboarding a new daabase to be ingested

# Data Storage

# HDFS

algo hdfs

HDFS has active and passive name node.

By default , hadoop has a replciation factor of 3 for each block.

In haddop 3, it uses Reed-Solomon (RS) encoding (opens new window) to encode the file so it can reiver lost files.

# Avro vs Parquet

Avro:

  • row oriented
  • good for queries that need all colcumns
  • good for heavy write load
  • json schema supports evolution

Parquet:

  • column oriented
  • good for heavy read load where only some columns is needed
  • good for spare data

# Data processing

Apache Spark and Apache Yarn

  1. Yarn (Yet Another Resource Negotiator)
  • scheduler allocates cluster resource
  • application manager accepts jobs to be run on cluster
  1. Node Manager (per node)
  • negotiates with resource manager for resources requested by Applicaition Master ( AM)
  1. Application Master
  • negotiates with the scheduler for containers
  1. Containers
  • abstraction representing (Ram, CPU, Disk)

# Data Orchhestration

algo airflow

dags stored in s3 workers listen to celery queue Queue is usually a redis / rabbit mq queue