Setting Up Analytics for Your Tech Startup to Scale

Note: I wrote this article back when I was heading up growth at 500px. It’s been years since I left 500px but the material is still relevant so enjoy!

I’m a huge fan of Brian Balfour (ex-VP of Growth at Hubspot), who maintains that growth is about process first and tactics second. Tactics change and tactics fail. But if you have a process, a “machine” that lets you test and evaluate tactics to reach specific goals, then you’re paving the way to success, and ensuring that the team learns. What many neglect to realize is that this approach of focusing on process applies entirely to building out analytics as well (and by extension, company growth).

I recently joined 500px as head of growth & analytics. As I was considering joining the company, I came across this article, written by Samson Hu, that offers an in-depth look at both the technical and business challenges of building out analytics at 500px, from infrastructure and ETL considerations, to metrics, reporting, and data accuracy problems. I saw the article as a great view on what I’d be working with if I joined.

In March I did join, and since then I’ve had a number of people approach me saying how they’ve found the article to be a great resource for their startup. But at 500px we’ve moved from a basic setup to using analytics to scale the company, and I think the work we’ve done is worth sharing so that you can follow along with the evolution. When thinking of scale there are entirely new considerations and challenges.

The Problem

The problem is well characterized by Summer Hu, data scientist extraordinaire at 500px. When your company is small you have this:

2person

Information is exchanged easily and freely because there are only a few people involved. As new initiatives launch, whether product features or email programs, the results are typically analyzed by the one or two people directly involved, decisions are made based on results, and that’s that. The feedback loop that is initiative > measurement > analysis > optimization is easy to manage at this point.

Then, as your company grows, initiatives start becoming cross-functional. Instead of just one or two people managing whole projects from beginning to end, you get a team of 3–5:

3person

At this point, one person may start handling tracking and analysis more exclusively across initiatives — this person probably knows more about conversion tagging, analytics tools, or cohort analysis than others, and starts becoming the analytics subject matter expert. Still, coordination of tracking and insights communication is relatively easy due to the small group size.

But eventually you hit this (if things go well), and this is what 500px is facing:

manypersons

At this point, the company has grown to a stage where analytics management has vastly increased in complexity. Initiatives now happen without every team member knowing about all the details. Analytics knowledge will also vary across the company; while some are running in-depth statistical analyses on logs, others don’t know what a pageview is. Whoever has subject matter expertise probably can’t support the entire company’s needs because it’s not their main role.

As a result, it simply isn’t enough to be doing analytics piecemeal anymore — somebody needs to set a strategic direction and make sure data engineering happens to serve the company’s needs. A lack of a good analytics program can leave you with a whole slew of problems, including:

  • Projects launching without tracking put in place for measuring success.
  • People trying to do their own analysis but spending most of their time deciphering what the data represents, what values map to what indicators, QAing warehouse tables and conversion pixels, etc.
  • KPI charts with peaks and valleys and no known causes.
  • Hours are spent matching reports produced by different teams (eg. one team has Daily Active Users going up, the other has it going down).
  • Panic occurs when data isn’t available, because systems are now relying on the data pipelines that you’ve built, but data reliability was never built in.
  • General knowledge gaps as the team expands (roles change, interns move in and out, and suddenly nobody knows what the event tags in Google Analytics means anymore).

Some of these problems may sound mundane, but trust me, they are more than just annoyances when you want things to scale. If you’re trying to build your company’s growth machine, you can’t experiment properly if you have all these issues getting in your way.

A lot of startups spend a considerable amount of energy on picking right the analytics tools and technology, but even the best technology won’t help unless you lay the groundwork for success.

So What Is The Solution?

The Framework

There are probably many ways to tackle it, but here I’ll just delve into how we’ve approached it at 500px, which is roughly following this framework:

framework

The basics were already set up before I arrived — KPIs, reports, and data infrastructure.. The next step is to scale that out out, which requires additional investments in the team and tools, leveraging the data modeling for deeper insights, and most importantly, implementing standards and process to deal with the problems I mentioned above. Once you have a program that can scale, you can then focus on the gold — leveraging data in your products, making automated decisions (spend allocation in marketing campaigns or personalizing your homepage for example), and setting up the infrastructure so that data is delivered reliably.

Considerations

Of course, how the framework applies to you can depend on many factors. If data isn’t a strategic priority for your company, or if you don’t want to monetize your data, you may not need to take things farther than Stage 1 or Stage 2.

Also, what is the team’s composition and capabilities? Do you have a lot of data scientists, engineers, or analysts who want access to raw(ish) data and who have the capabilities and desire to do data wrangling / modeling? If it’s the latter, maybe consider decentralizing more of your data and analytics work. If it’s the reverse, then maybe you should centralize more so that a few key experts can create dashboards that serve the rest of the company.

At 500px we’ve taken hybrid model. The analytics team handles the heavier analytics specific tasks: modeling, tracking, data product management and data engineering, but always with the end goal of empowering others in the company to do their own analysis.

The key is having clear ownership of analytics within the company. When you’ve hit 500px’s size the time is probably right to have some dedicated resources for owning the analytics program, especially if data is a strategic priority for the company.

Scaling @500px

With that background, here is more detail on what we’ve done at 500px to solve our analytics scaling problems so far (moving from Stage 1 to Stage 2 in the framework above):

process

  1. Project Launch Process: When new projects launch, we ensure that analytics is kept top of mind by following a project management process that was designed by our PMO, Aneta Afflick. In our product and tech plan documents, there are analytics sections that require sign-off, forcing product managers and engineers alike to set goals and ensure tracking for new features.
  2. Designated Owners: Setting clear owners for analytics is key to scaling and growth. We have designated owners who set the analytics vision and roadmap, and these folks are held accountable for enabling the rest of the company to make data-informed decisions. As you hit Stage 3, treat your data infrastructure as a product in its own right, and assign product management and dev resources accordingly.
  3. SLAs: If the team is to depend on the data infrastructure and build solutions around the pipelines, then the data has to be reliably delivered. Our analytics team has committed to putting increasing amounts of our data under SLAs for this reason, so that the data up-time becomes just as important as our platform up-time. This way, not only do internal users know that they can monitor our programs on a given schedule, but they can build external products utilizing the data pipelines to monetize our information.
  4. Onboarding: New hires at 500px go through three days of onboarding to get up to speed and where we spend 30 minutes onweb analytics 101 and show these newbies what tools and resources they have at their disposal. We emphasize how every team member is crucial to the success of our analytics program — not only do we need everybody to be contributing to our processes and documents, but we expect new members to adopt our data-driven culture. More on those documents next.
  5. Documentation: Last but not least, the analytics team has set up a set of master documents, now living on our internal Wiki.

5a) Standards and Definitions: This document gives detailed descriptions of commonly used metrics / KPIs, and then tells employees about:

  • What data sources are involved in calculating the metric
  • How the metric is actually calculated, with links to the ETL code in github
  • How to query the metric in the data warehouse, with example queries
  • Any limitations or gotchas associated with the data

Here’s an example of what it might look like (a spin on what the Engaged Users metric might look like at 500px):

engagedusers

Putting the document together requires some effort as there are usually varying opinions on how things should be measured, but doing the work up front is well worth it! Less time is wasted on report comparisons resulting from people using different definitions, and on re-invention of queries that have already been written. The work is never done since definitions will evolve as the company changes, but with at least with this document, there is no confusion about what metrics mean and how measurements are actually made.

5b) Event Calendar. This document is a simple spreadsheet that logs any key event that impacted our metrics. This includes product launches, marketing campaign launches, data anomalies, fraud, system outages, sales events, external events, and changes in tracking. For each of these events, we log the date, a quick description, and then the important part — the impact and links to analysis details.

Here is an example of what it might look like:

eventlog

In this example, anybody analyzing metrics that include Feb 29 will quickly realize that increases may be due to some fraudulent users that hit the app that day. When seeing a spike that day, no one needs to waste any more time wondering whether the data is correct, or trying to find out what happened, and can focus on their original analysis.

Conversely, any business or financial analyst looking to assess how our programs have fared can get that information at a glance, and can use it to model future company projections and targets.

5c) Warehouse / Database Schema Docs. We have one for each datastore that our internal users query. These documents are exactly what you’d expect — we log table descriptions, column descriptions, data types, etc. The point though, is to log things that the user may not already know to speed up their analysis work and to prevent people from digging into the same issues over and over again, especially if analysis work is going to be distributed across the wider team. One key thing to map out is enumerated values; if 1 means Active and 2 means Canceled in terms of user subscription status, then make sure to log those! There isn’t a place that I’ve worked at where this hasn’t been a problem; the person setting the mapping hardly ever thinks about how others will know what the values mean.

Also key is to put any notes on when data in particular tables became available, whether there are any gaps in the data, and example data values.

5d) Data Flow Diagrams. The last documents that complete the picture are data flow diagrams that show data sources (Google Analytics, log servers, CRMs, etc), data destinations (warehouses, visualization tools, etc), and how the data gets from sources to destinations for reporting. Why is this important? Once users see how data moved, they will also understand what the limitations are. Maybe data is only refreshed only once a day, and because of the different sources, timestamps are in different time zones. Again, the point is to give the user information up front that impacts how they do analysis, so that less time is wasted in data manipulation and more time is spent on interpretation.

5e) Google Analytics UTM Mapping. This is the classic document that most marketing and product teams would create to make sure their Google Analytics tags are consistent and readable. There are plenty of link tagging tools out there (like this one) so I won’t go into too much detail here, but at 500px we have tried to extend this concept to any other tagging tool that we use (eg. Mixpanel, Segment, Amplitude, Snowplow, or whatever tool you use).

The Framework Revisited

The thing is, keeping up with the above items requires… process! Process to:

  1. Keep the documents up-to-date
  2. Make sure KPIs are appropriately set for new projects, and that
  3. Communicate insights. Whenever a team member has a new learning, make sure there’s a process for communicating that out so that others can leverage that insight. At 500px, informal learnings are communicated through slack. More formal ones are communicated through brown bags, and scheduled presentations. Key projects are reviewed at every exec staff meeting to track how they’re progressing against KPIs.

ALL this to get to the theoretical ideal of “centralized decentralization”, where our analytics program is efficient and effective as possible, through streamlined communication and information access:

everyonehappy

Measuring Success

The question, of course, is whether there is real impact to the company as a result. Which really is a question of how you measure your analytic team’s success — a topic that I will save for another post.

How do you do analytics?

I would love to hear from you on what your challenges have been and how you’ve approached the solutions.

Thanks to Summer Hu for her diagrams above, and Kevin Martin and Colin Sloss for their data engineer prowess!

Leave a comment