Successfully reported this slideshow.
Real-time Cassandra
Upcoming SlideShare
Loading in …5
×
No Downloads
No notes for slide
- 1. Real-timeCassandra Richard Lowrichard@acunu.com @richardalow
- 2. Outline• What is real-time?• How do databases implement real-time queries?• Why is Cassandra ideal for real-time applications?• Writing real-time applications with Cassandra
- 3. What is real-time?
- 4. “Of or relating to a system in which input data is processed withinmilliseconds” dictionary.com “Occurring immediately” webopedia“...the most important requirement of a real-time system ispredictability and not performance” wikipedia “...a time frame that is very brief, appearing to be immediate.” wisegeek.com“Often real-time response times are understood to be in the order ofmilliseconds and sometimes microseconds” wikipedia
- 5. Real-time queries• ‘Give me X’• ‘How many Y?’• ‘What is the top K?’• ‘How many distinct Z from P?’
- 6. Real-time definition• Definition a query is processed in real- time if the time to get the answer is at most a constant times the transfer time plus the round-trip timetresponse ≤ C(ttransfer + tping )
- 7. Real-time definition• The more you ask for, the longer it takes• For small queries, request dominated by round trip time• No query can take less time than the time to receive it
- 8. Real-time definition• Users on faster networks expect a faster response• What we mean by real-time is getting faster
- 9. Implications• What does this mean for the database?• Use Google Analytics example• Simple query: ‘How many page views have there been from France in the last 24 hours?’
- 10. Requirement• Response is one number• With overhead, say ~1KB• Ping time 1ms• 10Mbit connection => 1KB in ~1ms• 2ms total
- 11. Solution 1• grep *.fr /var/log/apache2/*.log• Suppose have 1M hits an hour => 7GB of logs a day• Single disk would take 70s• Need a beefy server to do this• Needs to grow as your audience grows
- 12. Solution 2• Maintain a counter for each country• Increment the counter on each hit• On query just read the counter• Maybe it is on disk - 5ms seek• No need to scale speed with traffic
- 13. Implications• Real-time queries can only read about as much data as they send to the requester• Need to precompute answers• Store data in a query-centric rather than data-centric view
- 14. Age of data• A real-time query will often need to query new data• But not necessarily • Could run batch process pre-compute answers
- 15. Solutions
- 16. Solutions• How make sure don’t read any more than you have to? • Denormalisation • Organisation of data • Counters
- 17. Denormalisation• Hard drive performance constraints: • Sequential IO at 100s MB/s • Seek at 100 IO/s• Avoid random IO• Effective block size 1MB
- 18. Denormalisation• Store items accessed at similar times near to each other• Involves copying• Copying isn’t bad • Storage costs <$100 per TB
- 19. Organisation of data• If read 100 items off disk, ensure they are next to each other• Saves reading extra data around them and index lookups
- 20. Fast range queries• Get me all keys in the range E to I A F H [E, I] I M X
- 21. Fast range queries• What happens when you insert? G A Q F [E, I] A G F H vs H I I M M Q X X
- 22. Counters• For queries that simply count, increment the counter• Implement inc, dec, get• Store multiple counts e.g. week, day, hour
- 23. Cassandra and real-time• Write optimised• Fast merging• Distributed counters
- 24. Write optimised• All writes are sequential on disk• Each write is written multiple times during compactions
- 25. Fast mergingHow get from this: to this? G A Q F + G A H F I H M I Q M X X
- 26. Fast merging• Write out new ordered SSTable• When big enough, merge with existing
- 27. G BQ F G A K B Q F Z G HA A B A IF F F F KH H G H MI I K I QM M Q M XX X Z X Z
- 28. How fast?
- 29. Distributed counters• Distributed, fault tolerant replicated counters• No need for distributed locks• Super fast
- 30. Other requirements
- 31. What else do we need? Real-time analytics High value getting quick response High cost if service is down Need high availability
- 32. What else do we need? Real-time analytics High value getting quick response Need low latency Need data geographically close
- 33. Cassandra and HA• No SPOF• Choose point on consistency and availability curve • Tuneable consistency • Replication• Multi data-centre support
- 34. Cassandra and low latency• Can configure caches• Can parallelise reads• Multi-DC support enables world-wide replication• Can choose lower consistency to avoid round-trips to other DCs
- 35. Writing real-time apps with Cassandra
- 36. Real-time apps• Need to write code using a client library• Design data-model• If queries change, code changes
- 37. Acunu Analytics• Provides simple RESTful interface to Cassandra counters• Push processing into ingest phase AA event Cassandra counter updates
- 38. Acunu Analytics• Event template, e.g., select : ["COUNT", "AVG(loadTime)"], type : { time : [TIME(HOUR; MIN; SEC), ?, 0], page : PATH(/), loadTime : [LONG, 0, 0] }• Specifies “blow-up” strategy according to supported queries• Need to know basics of query in advance, but not whole thing
- 39. Features• Simple, real-time, incremental analytics• work done on ingest• sum, count, distinct, avg, stddev, min-max etc• time + hierarchy bucketing• efficient ‘group’ semantics• works with Apache Cassandra
- 40. Summary• Formalise what real-time means• Deduced how data must be stored• Explored how Cassandra has these properties• Discussed how Acunu Analytics helps when writing real-time apps
Public clipboards featuring this slide
No public clipboards found for this slide