Analyze Disk Utilization

Most users who know about analysis applications recognize the power and value of analysis but don't realize the breadth of situations to which analysis applications are applicable. Probably 80 percent of OLAP product demonstrations use sales data. The sample database that SQL Server 2000 Analysis Services offers, FoodMart 2000, is a case in point. Demonstrations use sales information frequently because most people can relate to sales as a common business problem—not because of any OLAP limitation. In my July 2001 column, "Tracking Performance," and in this column, I discuss two distinct types of analysis applications that demonstrate the vastly different problems that you can use OLAP technology to analyze. Last month, I wrote about an analysis application for software development. This month, I describe an application that's relevant to anyone who uses a computer—monitoring disk utilization.

If you're like me, your computer's hard disk is nearly full. Occasionally, I clean out directories that contain unused files and uninstall infrequently used programs, but each time I perform such maintenance, I free up less disk space than the time before. To help solve my problem, I wrote a small program that scans my hard disk, collects information about files and subdirectories, and stores this information in a set of comma-delimited tables in a star schema format. Next, I used Microsoft Access to import this data to Access tables. Then, I used Analysis Services' Analysis Manager to create an OLAP cube. Figure 1 shows this process.

Design and Build

Before you start any design process, you need to identify your objectives. An easy way to lay out an analysis application's objectives is to list a sampling of the questions that the analyst wants to answer. With such a list in mind, you can work backward to determine a flexible cube structure that will answer these questions. Then, you can take the next tep, which is to determine the structure of a star schema that supports the building of this cube. In my disk-utilization problem, I wanted to answer the following questions:

Which directories utilize the most disk space?
What's the average age of the files in a directory?
How many files reside in a directory (and all levels below the directory)?
What's a particular directory's growth trend over time?
Which file types (i.e., extensions) consume the most space?

To answer these questions, the cube needs three measures:

Size—consumed disk space
Age—average file age, or number of days since modification
Count—number of files

I identified four useful dimensions (including the Measures dimension):

Files—a dimension containing the disk's file structure
Extensions—a dimension containing all the disk's possible file extensions
Time—a dimension containing all the time periods during which the software took snapshots of the file system
Measures—a dimension containing the cube's measures

The two most important dimensions in a cube are Time and Measures. These dimensions are essential because many of the functions in Analysis Services' MDX query language use these dimensions. If you don't structure these dimensions correctly, you'll eliminate at least half of MDX's usefulness. Of course, poorly selected measures will make a cube useless because measures determine every numeric value in a cube.

For this hard disk application, you can set up the Time dimension in several ways. The most obvious method is to use either the file's creation date or its modification date. By using one of these dates, you can answer questions such as, How much space do files created (or modified) during this period consume? That question is mildly interesting but not as useful as the ability to see growth trends over time. Growth trends over time show you at which locations disk activity occurred during a certain time period. For example, you might discover that a particular directory tripled in size last May. You could then use the Extensions dimension to determine which types of files grew the most in that directory. Comparing the various measures would tell you whether the growth occurred because of files increasing in number or in size.

Now that you've determined the cube's structure, you can design a star schema from which to load it. The star schema must contain all the files and directories with their sizes and file extensions. To let you compare multiple snapshots of the file system over time, the star schema must also contain multiple snapshots of the files. Figure 2 shows the structure I chose to support these requirements. The fact table (FileFact) contains a record for each file and directory, specific to each snapshot date. The structure includes dimension tables for time (FileTime), files (FileDir), and file extensions (FileExt).

The file system's organization is naturally hierarchical, and I copied this structure in the FileDir dimension table. The structure is unbalanced because not all branches of the hierarchy contain the same number of levels. Because of this imbalance, I used a parent-child dimension type for the Files dimension. Parent-child dimension tables use a self-referencing foreign key to describe the relationships between parent and child dimension records. In non—parent-child dimension tables, columns represent dimension levels. The Extensions dimension doesn't have a natural hierarchy, so I used a flat, one-level dimension with a second-level All member for it.

Next, I needed to decide how to load the desired measures (i.e., Age, Size, and Count) from the fact table. This process can be tricky because multiple records might represent only one file in the fact table. Although each of the multiple records represents the same file, the size and age might differ between records because each record is from a different snapshot. To simplify the process, try imagining how the aggregations in the time hierarchy need to work. Figure 3 shows a simple representation. If the fact table contains a snapshot of a file's age and size on each day of the week, how do you determine totals for the week? One answer is to use the age and size values for the last snapshot in the time period. Similarly, in inventory analysis, you always roll up values (e.g., units, dollars) from the last day of a time period.

Analysis Services offers no native aggregation types that support this "inventory"-style aggregation, so I set up a custom aggregation. I could have used calculated members or calculated cells to solve this custom aggregation problem, but I decided that using custom rollup formulas was the most straightforward approach. Custom rollup formulas let you control how Analysis Services aggregates values for individual dimension levels. In this case, I wanted to override the Time dimension's aggregations for all but the lowest hierarchy level. The formula I wanted to use is Time.CurrentMember.LastChild.

The Age measure requires some extra work because it's aggregated as an average, for which Analysis Services also offers no native aggregation type. To determine the average, you can combine Analysis Services' built-in Sum and Count aggregation types. I already needed a Count measure, so half of my problem was solved. But I needed to create in Analysis Manager another measure—a Sum of the Age values. The average Age measure then becomes a calculated measure that divides the Age Sum by the Count. Because the Age Sum alone doesn't make sense, I set its Visibility property to false, which hides it from client applications. Hiding measures that store intermediate calculations is a common technique in Analysis Services applications.

Now that I'd designed the star schema and the cube, my final step was to design a process to load the star schema with records. Because this application was a side project rather than a production application, I tried the easiest method I knew—loading the list of files into Microsoft Excel for manipulation. However, the first snapshot exceeded Excel's capacity. I was amazed to discover that I have 90,000 files on my C partition. I ended up writing a small program that scans the hard disk and creates a set of Comma Separated Value (CSV) files that match the star schema format that Figure 2 shows. When you run the program a second time, it reads in the comma-delimited dimension tables and adds new fact table records that reference previously created dimension table records.

After the program runs, you can use the comma-delimited files to build the cube. I imported the comma-delimited files into Access tables, then built the cube from the tables. As an alternative, you can obtain an OLE DB driver that supports text files. If you have this driver, you can build the cube directly from the CSV files.

If you're interested in using this application on your hard disk, you can download the .zip file at the top of this page. The diskdata.exe program will create the CSV files in the same directory to which you download the .zip file.

Gauging Usefulness

Was the resulting cube useful? While browsing the cube, I learned several things about my hard disk that I never knew before. For example, I discovered that a third (roughly 30,000) of all the files on my hard disk resided in the Internet Cache directory. (I now regularly flush that browser cache.) Although I haven't seen enough snapshots to spot a trend, I expect to see a continual growth path in the \system and \system32 directories.

The number of possible analysis applications is limitless. Business analysis and other analysis applications probably have a bigger potential market than Excel does. Whether analysis technologies grow to meet a fraction of the number of possible applications in the next 5 to 10 years is anyone's guess. I hope this file-analysis application leads you to imagine other possible OLAP applications that can give you a deeper understanding of your activities and help your business be more successful. For another example of a file-analysis application that you can build, see "August MDX Puzzle." (For the answer to last month's puzzle, see "July MDX Puzzle Solution Revealed.")

Comments

Plain text