Big data is an extremely broad domain, typically addressed by a hybrid team of data scientists, software engineers, and statisticians. Finding a single individual knowledgeable in the entire breadth of this domain is therefore extremely unlikely and rare. Rather, one will most likely be searching for multiple individuals with specific sub-areas of expertise. This guide is therefore divided at a high level into two sections:

This guide highlights questions related to key concepts, paradigms, and technologies in which a big data expert can be expected to have proficiency. Bear in mind, though, that not every “A” candidate will be able to answer them all, nor does answering them all guarantee an “A” candidate. Ultimately, effective interviewing and hiring is as much of an art as it is a science.

Image alternative text

Big Data Algorithms, Techniques, and Approaches

When it comes to big data, fundamental knowledge of relevant algorithms, techniques, and approaches is essential. Generally speaking, mastering these areas requires more time and skill than becoming an expert with a specific set of software languages or tools. As such, software engineers who do have expertise in these areas are both hard to find and extremely valuable to your team. The questions that follow can be helpful in gauging such expertise.

Q: Given a stream of data of unknown length, and a requirement to create a sample of a fixed size, how might you perform a simple random sample across the entire dataset? (i.e., given N elements in a data stream, how can you produce a sample of k elements, where N > k, whereby every element has a 1/N chance of being included in the sample?

One of the effective algorithms for addressing this is known as Reservoir Sampling.

The basic procedure is as follows:

  1. Create an array of size k.
  2. Fill the array with the first k elements from the stream.
  3. For each subsequent element E (with index i) read from the stream, generate a random number j between 0 and i. If j is less than k, replace the jth element in the array with E.

This approach gives each element in the stream the same probability of appearing in the output sample.

Q: Describe and compare some of the more common algorithms and techniques for cluster analysis.

Cluster analysis is a common unsupervised learning technique used in many fields. It has a huge range of applications both in science and in business. A few examples include:

  • _Bioinformatics_: Organizing genes into clusters by analyzing similarity of gene expression patterns.
  • _Marketing_: Discovering distinct groups of customers and the using this knowledge to structure a campaign that targets the right marketing segments.
  • _Insurance_: Identifying categories of insurance holders that have a high average claim cost.

This article was originally published on Toptal. Click here to see the full article.