The new data analytics pipeline at Werkspot

Werkspot is the easiest and most reliable way to arrange home services. With the objective of beating “word of mouth” and becoming the primary channel for hiring service professionals, we at Werkspot have worked towards building a scalable platform to provide a user-friendly experience to the customers and service professionals.

In order to achieve the objective, our business strategies play a key role along with software development to bring value to customers and service professionals that want to engage with each other. To help our management team and product owners to devise the best strategies, data plays a significant role. Data provides the feedback loop that allows managers to not only verify the success and failure of the business strategies but also helps them polish their future goals, set expectations and improve strategies for better results. To be able to reinvent their plans, it is important not only to present the data in a very consistent manner but also be able to present relevant facts to end business users without them having to dig through the entire dataset.

Currently, we use Looker as the BI tool to visualize the insights from different sources of data. Important information related to Service Requests and Service Pros, Advertising costs and their associated ROI, sales team activities and financial data are made reachable to the end business users through the use of Looker views. The Data Engineering team is responsible for using the data from relevant sources and building reports in Looker to cater to the needs of our business users.

One of the advantages that Looker provides is the ability for the data engineering team to build analytics in looker without having to worry about setting up an underlying architecture to support the analytics computation. Looker is also a development driven environment with version control integrations. So data engineers can collaborate on projects for developing the required analytics in a distributed manner through their own branch on Looker, that is a working copy of the production. However, there are always two sides to a coin. With all the analytics in the Looker, it demands heavy computation. Looker explorers are a collection of various facts about a similar object. They are created by joining various relevant tables to a base table on a join parameter. These joins are expensive and therefore use up a lot of database capacity. This also shows an architectural dependency in Looker, where business related facts can only be presented by following their modeling guidelines. This results in higher query times for the relevant information to be shown to the end business user. This also makes difficult for new changes to be applied to the analytic model. The needs of the business users are expected to change over time as they focus on different KPIs to improve their business process.

To remove this overload and achieve a higher database efficiency, we are moving the analytics modeling in Looker to a different layer. Fig. 1 shows the new data analytics pipeline that we at Werkspot are moving towards.

Blank Diagram (1)

Fig 1 – New data analytics pipeline

The first layer in the pipeline is the EL layer (Extraction and Loading). Raw data from various sources are extracted and stored in our data warehouse using Stitch, ready to be ingested by the Analytics layer. Stitch is an enterprise tool that allows users to extract data from relevant sources and load it directly to their data warehouse without having to build the pipelines themselves. This layer doesn’t allow any queries to be run from the end business users’ side. High priority is given towards automating the entire process and achieving robustness with fewer developer hours. For the analytics layer, an open source tool called DBT (data build tool) shall be used to transform the raw data into relevant facts sought out by business users. Finally, the transformed data or facts will be presented to our end users through Looker. With this new analytics model, we are essentially moving to an ELT based approach, where the data from various sources is first loaded into a data warehouse and transformations are made on the data warehouse directly. There are various advantages to the above model:-

  1. The entire analytics from Looker has been moved to an underlying layer, which will result in higher Looker efficiency.
  2. One advantage of working with DBT is an enhancement in the quality of data. DBT will allow us to write both SQL and business-related tests to validate the data we are providing to our end users.
  3. Fewer scripts to maintain and higher maintainability due to shift from ETL to ELT.
  4. With analytics in an independent layer and therefore no architectural impediments, Dimensional modelling will be used to achieve higher predictability, maintainability and scalability of our Analytics layer.

Data warehouse modelling in Werkspot started with the objective of building a dimensional model. However, due to the fact that the entire analytics was built into Looker and the fact that Looker has its own architectural dependency, the current analytics is a complex version of both Entity-relationship and dimensional modelling. With the above undergoing project, we aim to bring back the model to a pure dimensional form. Ralph Kimball was the developer of this modelling concept and it basically involves segregating your data warehouse into dimensions and facts. Facts would mean the end KPIs that are of interest to the business. For our platform, one example could be the number of proposals made by service professionals on a particular day. Dimensions are the contexts that are used to build the facts. Carrying on with the above example, dimensions associated with the fact could be:-

  • Service requests associated with the proposals
  • Service professionals making the proposals
  • Time of proposal etc.

Facts and dimensions in dimensional modelling are represented through a star schema where fact is the primary table containing the foreign keys to the dimensional tables. Ralph Kimball and Margy Ross provide a nice visualization for the star schema using the example of Retail Sales in Fig 2.


Fig 2 –Dimensional Model Visualization for a Retail Sales

There are various advantages of using dimensional modelling for our data warehouse. They bring in extra simplicity and understanding of the data for the business users. In dimension modelling, if two facts are called by the same name, they have to mean the same thing. Therefore they bring in more coherence to our data warehouse and a better understanding of KPIs among all our end business users. Our present analytics in looker can be rebuilt with fewer joins and therefore, it will decrease the computation load and enhance our data warehouse performance to a great extent. Also in terms of scalability, it is more convenient to scale up the data warehouse during business expansion. In dimensional modelling, there is a granularity in which dimensions and facts are defined. Ideally, the granularity should be atomic, meaning you cannot trickle down to something more granular. In that case, if scaling up requires new dimensions and facts in our data warehouse, it would be very easy to initiate change without impacting the whole data warehouse. Also with dimensions and measures separated out, reusability of data increases to a great extent. Currently, with our legacy analytics model in looker, a lot of views use dimensions and measures that are dependent on other views. Therefore an explorer uses up way more SQL joins than it requires.

Though the advantages stated above are in theory, they are significant enough to make the change. However, a detailed analysis needs to be carried out when this transition is completely over. We are hopeful that the new analytics model will be beneficial to the entire company in terms of data delivery and computational efficiency.

Hoping for the best with fingers crossed!