This is a multi-part series
Especially early in product development, small custom-built datasets are the order of the day. Jerry inserts a couple of rows into a database, or loads a video into S3, and you can at least get some testing & prototyping done. In businesses where you work with sensitive data, it’s often difficult to find a “real” dataset, since the data is owned by your customers. It’s tempting to grab a dataset you found while onsite, but then one day you’re bound to forget, demo the data in front of their competitor and land yourself in court. Bad idea.
So you work day in and day out on your module or service with some comfortable demo data, and it’s fast, and your unit tests run in a reasonable period of time. But one day your sales guy returns fuming from his big ACME onsite meeting cursing about how slow the product is. What happened? Yep — ACME data looks different than your demo dataset.
We worked on a project dealing with large volumetric data for seismic processing and interpretation and kept being given a sample seismic dataset from the mid-1990s. It was a few hundred megabytes; the system we were designing was meant to work on multi-terabyte datasets.
You see where this is going. We watched developers do their visualization and processing work on the small dataset for weeks at a time, then when it came time to demo the real dataset, a sense of panic would ensue. Effects that weren’t obvious at 200MB became glaring at 2,000,000MB. Parts of the application that were perfectly fast were suddenly slow. Our design was sound because we’d kept the larger datasets in mind, and the application performed almost magically by release. But development would still have been more efficient if programmers had done more of their day-in day-out work on the larger data. The big dataset wasn’t available early on for several good (non-technical) reasons that aren’t worth elaborating on here.
As difficult as it may be to find a large or appropriately complex dataset, it’s important. For one client, we suggested using a stochastic process to generate a very large dataset with similar characteristics to known customer datasets. In this way they could run automated and ad hoc tests against a dataset that would stress their infrastructure without allowing sensitive customer data to leak into their development organization. For a great example of this, check out what Titan did to check their scalability: they built a Twitter clone populated by lots of random robots posting and reposting tweets. (Come to think of it, isn’t that what Twitter is?)
To do this successfully, we will work with them to gauge the relevant characteristics of the data, e.g. cardinality, connectedness, size, etc., then build a program to generate the dataset. To make sure this is useful, one must continuously incorporate input from the field when performance results come in. And it can be fun. Seed the stochastic generator with the text to Moby Dick or Hamlet. Or Dr. Suess.