Skip navigation

The Case of the Good Data Gone Bad

I had just solved "The Case of the Missing Link" (http://www.windowsitpro.com/SQLServer/Article/ArticleID/47439/SQLServer_47439.html). Next was supposed to come my favorite time of day-the time when the adoring client and I retire to the local watering hole to celebrate-the time when someone else was buying the drinks. But it was not to be. Just as the thankful client was about to extend an invitation, my cell went off. And my cell was more insistent.

The voice on the other end belonged to one Conrad Spiracy Theory. An unusual name, but not nearly as unusual as the man it identified. Theory was the Chief Information Officer for a local manufacturing concern (and with Theory as the CIO, there was cause for concern). It seemed some of Theory's data had gone missing and I was his only hope for finding it. I've always been a sucker for cases in which some poor, innocent data has been taken from where it's supposed to be. I told Theory I'd be right over.

I was met in the lobby by Mr. Theory's administrative assistant, Ms. Wattsnectz. "I'm so glad you came, Mr. Powers," she led off. "Mr. Theory has just been beside himself since he discovered that the data was gone."

"I only hope I'm not too late, Ms. Wattsnectz," I replied. "How long has the data been gone?"

"Please, call me Wanda," she instructed. "The absence was discovered this morning, but the Integration Services package that imported the data ran last week."

"Last week, eh? That makes for a pretty cold trail," I noted. "SQL Server Integration Services was responsible for the data load? SSIS doesn't usually lose data unless there's a good reason. One good reason is that it's just plain bad data. Was it bad data, Ms. Wattsnectz?"

She hesitated for a moment, as if there was something she wanted to say, but couldn't. "You'll have to discuss that with Mr. Theory," was all she volunteered.

Ms. Wattsnectz opened a door and ushered me into a dimly lit office. There, behind the desk, was Mr. Theory. "I'm Conrad Spiracy Theory," he said as he rose to greet me, "but you can call me Con-everyone does. You've been briefed about the missing data, I presume."

"Yes, Ms. Wattsnectz filled me in on the way to your office."

"So what's your opinion about the reason for the disappearance?" he asked.

"I thought perhaps it was just bad data," I said. "I started to ask Ms. Wattsnectz about that, but she said I should take it up with you."

Con laughed. "That's just how they want it to look, I suppose. No, Mr. Powers, it's not bad data. My personnel in the regional offices-Boston, Los Angeles, Georgia-have all assured me that the data they're sending is completely valid."

"Georgia?" I queried. "Is that office in Atlanta?"

"No, Mr. Powers," he corrected, "we have a regional office in the former Soviet Socialist Republic, now the country of Georgia. We've been doing business with them for several months now. At first, we were hand-entering their invoice data. Now, we're importing that data into our system in the same text-file format that out state-side regional offices use. Anyway, I believe that the Russian mafia is behind this."

"The Russian mafia?" I asked in disbelief.

"They're the most likely culprits, considering our Georgian connections," he continued. "Of course, it could be the men in the black helicopters behind the whole thing. Then again, with any mysterious disappearance, alien abduction can't be ruled out."

"Perhaps it was Big Foot?" I offered.

"No, he's on Linux. We're strictly a Microsoft shop."

It was then that the pieces fell into place: the reluctant administrative assistant, the Georgian connection. I needed one piece of evidence to wrap up this case like a present for the holidays. I asked Con if I might take a look at the SSIS package in question. While I did, I requested that he ask Ms. Wattsnectz to join us.

The package was a simple one. There was a Flat File Data Source item that extracted data from a text file containing comma-separated values. The Flat File Data Source item fed its data flow to an OLE DB Destination item that dumped data into a SQL Server 2000 database table.

I examined the Error Output tab of the Flat File Data Source item and found that all of the error handling was set to Ignore errors. Someone had expected trouble, but didn't know how to deal with it! If the error handling had been at the default setting of Fail component, someone would have been alerted when the package failed. Instead, the package was ignoring any bad data it encountered, which only masked the problem and made things worse. By ignoring errors, the package would let bad data disappear without so much as a whimper, simply tossed away like yesterday's news.

I decided to use some SSIS features to lay a trap for what I suspected were bad data records lurking in the CSV source file. I changed all the error-handling settings to Redirect row. Then, I added a Flat File Destination item and linked the error flow from the Flat File Data Source item to the Flat File Destination item. (Figure 1) Now, if any bad data departed from the data flow, it would leave footprints for us to follow.

I executed my revised package just as Ms. Wattsnectz entered the office looking rather flustered. As the package ran, my trap was sprung, and I caught my quarry. Most of the data flowed to the OLE DB Destination and into the waiting SQL Server table, but as I suspected, some data went into the error flow to my newly created flat file. When the package completed, I opened the ErrorOut.txt file. (Figure 2) The file showed the raw data from each row that had caused an error. Underneath the raw data was a set of error codes. We could have looked up the error codes, but what was going on was obvious. I highlighted one of the fields and turned to the now quaking administrative assistant.

"Ms. Wattsnectz, would you like to explain to Mr. Theory what we have here?" I asked.

"I don't know," she protested feebly.

"But I think you do," I contradicted. "I think you knew exactly what we were going to find in this file even before I opened it. And I think you know why it's there."

"All right, I do know," she confessed. "What you highlighted is a date that uses Georgian date formatting. I was the one who hand-entered the invoices from Georgia. I knew they formatted their dates with periods instead of slashes. I knew that they reversed the month and day."

"To the Georgians, these dates are good data," I added, "but to the SSIS package, they appear invalid."

Con turned on his administrative assistant. "But why didn't you say something when the data was missing?"

Ms. Wattsnectz shook her head. "I didn't want you to think I was part of a conspiracy to undermine the Georgian regional office," she sobbed.

"There's a conspiracy to undermine the Georgian regional office?" asked Con with alarm.

"No," I assured him.

"That's a relief," he said with a sigh. "But we still have a problem. We're going to have to get the Georgians to change the way they enter dates into the CSV file. They'll have to relearn…"

I cut him off. "Con, if there's one thing I've learned in my time as a consulting detective, it's that it is easier to modify a computer program than it is to change the way human beings operate. I think I have a better solution to the disappearing data problem."

To prove my point, I took some action. I modified the column definitions in the Invoice CSV File connection manager, making the InvoiceDate column a string data type rather than a date. (Figure 3) This change let the Integration Services package read both the U.S. dates and the Georgian dates from the file without causing an error. Then, I opened the Flat File Data Source item, which automatically refreshed the metadata it derives from the connection manager.

Next, I added a Derived Column item. The new column created by the Derived Column item in the data flow used an expression to convert any date string containing a period to the U.S. date format. (Figure 4) Finally, I mapped my derived column to the InvoiceDate field in the database table, in place of the date that the table had previously read directly from the text file. (Figure 5) When the package ran, the missing data reappeared right where it was supposed to be.

No Russian mafia, no black helicopters, no space aliens; just the mystery of some good data gone bad. And solving mysteries is part of the job. But in my next case, the job was part of the mystery.

Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish