Performance Engineering
Faster systems = Better business
We performance engineer software systems to accelerate your business
Web and mobile apps
User experience on apps is tuned for both perceived and real performance. Optimized via CDN and async download of assets and media. Html, js, css are minified, merged and compressed. Data is cached on the client and server for fast serving pages. An ecommerce client's homepage always rendered under 2s.
API servers and realtime apps
For user facing APIs and realtime systems, all aspects are taken into consideration from programming language, libraries, algorithms, data structures, caching, disk and network i/o to ensure fast response times at scale. A pci-dss solution for credit cards consistently delivered APIs in 10-15 millis.
Streaming data pipelines
Data pipelines for streaming and file sources are tuned keeping i/o parallel to processing so read loop never stops. Batching, parallel & async processing and bulk writes are used to achieve high throughput. We tuned kafka to consume at 250MBps from a single node for a high volume log management product.
Databases and warehouses
RDBMS and NoSQL databases cause bottlenecks when they are not tuned or loaded beyond capacity. We tune the apps and dbs with correct usage, data models, indices, query optimization, and removing unnecessary constraints, triggers and slow operations such as "update on unique key error".
Cloud deployments
Cloud services though similar do operate differently across providers and tuned with billing and cost optimization drivers. Serverless and cloud workers, and services for compute, data, messaging, security and observability are configured for maximum performance at optimal cost.
Snowflake
Snowflake is a cloud datawarehouse and with its unique "storage separate from compute" architecture can suffer from escalating costs if not used correctly. We have deep technical expertise with it and tuned one of our client deployments for 10X throughput while reducing the bill by 4X.
“Faster systems keep users happy, are easy on ops, economical and environment friendly”
To make your systems fast, connect with us
Simple & effective engineering
We favour simple architectures of stateless services, peer to peer over master/slave, and keeping storage separate from compute
Our performance engineering principles
Max CPU usage
CPU is the most costly resource in a datacenter, so we provision it right and then max its usage. Data and requests are buffered and tasks split across cores. Service instances are scaled on-demand to incoming work.
I/O parallel to CPU
CPU should not wait for data to process. Data is read from and written on a separate thread in parallel to disk, network, message queues and databases. This ensures cpu cycles are optimized to the full.
Non core work under 10%
The work done by an app or a server is for the core business transaction and non core work to get to the transaction such as serde, decrypt, queue/file read. The target is to keep non-core work minimal under 10%.
I/O in bulk and in parallel
Disk and NW I/O has usually the slowest throughput in the stack. This is tuned with multiple disks and network calls in parallel. Caching is used where possible. We prefer bulk I/O into dbs, message brokers, APIs.
Avoid json
Json is a popular data format; unfortunately it is built for human readability and is quite heavy on the cpu. Binary proto, avro, parquet formats are preferred. Json payloads are parsed with SIMD for 10X faster times.
Protect the process
A process should always be protected from overload. Not mixing heavy and light workloads, request queues and internal capacity checks can help to distribute work among multiple instances.
Performance in numbers
1M Rps
On a JVM with
2 CPUs and 1GB RAM
10MBps
Logs processed
per core
4X
Snowflake cost reduced
10ms
P95 api latency for a pci-dss store
Outcome focused engineering
1. Defining the problem
The requirements for latency, throughput and cost are established first. Latency and throughput requirements can get mixed up when one is clearly preferred over the other. For instance, a PnL calculation for 1M entries in 10 seconds does not necessarily mean a latency requirement of 1 record in 10 micro seconds.
2. A performance audit
An audit of the system is done including its architecture, code and cost & performance in production to understand what is working well and the bottlenecks. Recommendations are identified and an action plan is prepared to tune the system. The audit takes 2-4 weeks.
3. Tuning - tactical and strategic
Tactical fixes are applied first to provide immediate relief, typically in 2-4 weeks. Tuning is done at 4 levels - hardware and software config, non-intrusive code changes, code refactoring and design refactoring to ensure systems perform and scale for the long term.
To make your systems fast, connect with us
“From 15 seconds to 3.5 seconds, 91social tuned our online store in 2 weeks"
- theorganicworld.com
Client stories
E-commerce firm sails through the holiday season taking 3X load with zero dropped orders
A San Francisco-based e-commerce firm that makes high-quality fashion affordable, was looking to improve their application readiness and performance for the holiday season. Roadrunnr helped the company optimize Shopify platform spending, improved their application responsiveness by 10X, and scaled the platform to take up to 3X load without dropping any order at all.