Building successful Business Intelligence solutions is a well-documented process with many successful, and unsuccessful projects to learn from. The traditional BI/DW model has always been challenging, but a lot of good practices and patterns have emerged over the years that BI professionals can leverage.
A net-new BI solution or migration of an existing on-prem BI solution into the cloud creates a different set of challenges to be addressed. What I wanted to do was to try to come up with a top 5 list that may help you in considerations for your cloud BI project planning. I've been focused on building analytics, BI and Big Data solutions in the cloud in Azure for the past 2 years, so I'm going to share a few of my findings for you here.
1. Loading data into the Cloud
In my experience, this is where you will spend the bulk of your time. Getting access to data sources & loading large initial data sets into the cloud, not to mention also building out a cloud-based ETL infrastructure are challenging. You will have to address connectivity issues for hybrid scenarios, push large amounts of data into the Cloud and devise an archival strategy with cloud storage, which is quite different than a classic on-prem SAN appliance approach. If you are building a SQL Server-based BI solution in Azure IaaS VMs, use tools like AZCopy or Azure Data Factory to load your initial data sets into Azure Storage. Delta data loads will be a much-preferred approach after the initial data load and you can use Azure Data Factory, SSIS, Attunity, Informatica, etc. tools for that purpose. When archiving data and storing backups, use less expensive Azure storage such as standard storage and consider DRaaS tools like Azure Site Recovery (ASR). You will want to use Premium disks for running VM workloads in Azure, but you can use Standard for backups, which you can also use to geo-replicate your backups for data protection. If you are using shared, managed public services in Azure like SQL DB, SQL DW, Azure Analysis Services, Power BI, etc. you will find native built-in cloud adapters and capabilities that allow you to connect to data in Azure Storage, Azure Data Lake, HD Insight and other cloud-native sources. But in most cases, unless you are working on a net-new application for a new business, you will need to deploy a hybrid architecture which also needs to bring in data from on-prem data sources. In that case, you will need to install local proxy data management gateways for data movement and for Power BI.
This is a very complex and lengthy topic that requires different tools for different architectural approaches in Azure. I spend a bit more time reviewing these in more detail in this story here.
You will want to understand the latency, response times and geo-locations of your user requirements before planning your cloud BI deployment. When connecting to a public cloud, be aware that general cloud services are public IP addresses which require firewall whitelisting in many cases for controlled access. You can also configure virtual networks and VPNs to connect your corporate network to many, but not all, of the services in Azure. And if you require low-latency, high-availability connectivity to Azure, consider purchasing an Express Route circuit for direct connectivity. Just be sure to set aside the time and tasks early in your project cycle necessary to establish connectivity requirements.
Public IPs, firewalls, load balancers, VNets and VPNs will all still need to be configured even though you are leveraging a shared cloud platform. If you are going to use Power BI, which is an Office 365 public cloud service, there is new offering being released called Power BI Premium that can provide dedicated capacity for larger Power BI implementations. Take a look at this whitepaper for more on PBI Premium.
3. Data Governance
Data governance is an important aspect of any analytics project. When starting a BI solution in the cloud, you need to be even more concerned with data lineage, chain-of-custody and quality. That's because you are very likely to bring together disparate sources in cloud BI, perhaps more so than you are now with an on-prem BI solution. In the cloud, you tend to focus an inordinate amount of time on data lineage and chain of custody due to regulatory requirements, hybrid environments and moving of data around different cloud provider data centers and regions. In Azure, for example, you may use Azure Data Lake Store for raw data in the East US region, but have your Office 365 data in Central Region and SQL Servers running VMs or Azure SQL DBs running in the West US region. In that case, your data lineage becomes tricky because you may have business or technical requirements around data traceability and auditing, but your data is moving between data centers in different geographic regions.
Keep this in mind as well as the need to pay egress costs when moving data out of Azure regions. Consider a tool like Azure Data Catalog as a way to catalog, index and discover enterprise data to help with data governance.
4. Operations and Maintenance
Cloud tools to monitor your new cloud BI solution will differ in many cases from your traditional on-prem monitoring tools. But any good DBA and operations team will have a set of tools for monitoring & management that includes baselines for performance metrics, monitor changes and stddev of those changes over time, remote patching and VM/server maintenance. If you leverage PaaS services in Azure, you can take advantage of fully or partially managed services that eliminate parts of the operations & maintenance lifecycle, but you will still want to monitor for alerts and performance. These cloud-first data services in Azure include SQL DB, SQL DW, HS Insight, Power BI, ADF, etc. It is critical to monitor and set alerts either with a cloud-first app like OMS (see below), a VM-based out-of-band monitoring tool or an on-prem tool in order to successfully operationalize your cloud BI solution.
The cost models of a cloud BI solution will likely vary dramatically and will be much more flexible compared what you are used to in a traditional on-prem BI solution. But with cloud costs, you need to think of your solution as a consumer of a utility company, in which you are consuming services. Think of it as using energy from a power company, or a phone service from a phone carrier. In Azure, you will need to monitor how many hours you are consuming and how much data you are storing. This model allows you to quickly and inexpensively spin-up and tear-down environments for development and staging in the cloud. But you need to modify your budget planning and strategy from the traditional perpetual license up-front, annualized and maintenance cost model to a model that requires you to monitor usage regularly. Many Azure services including VMs and ADW, allow you to pause your service so that you only pay for compute when you are actually performing work on your data.
Cloud databases like Azure SQL DB and Cosmos DB allows you to adjust the usage throughout the day to maximize your spend based upon activities. I.e., increase capacity during large nightly data loads and early morning report execution. You can then lower your capacity, and thereby controlling your spend, on weekends or after business hours. If you are cooking Big Data for ELT in a data lake, consider taking advantage of Azure Data Lake Analytics where you only pay for compute based on the parallelism of jobs while they execute against your data in the lake. ADLA offers pay-as-you-go or discounted commitment packages, which is a common billing mechanism that you will find in cloud and Azure solutions.