Identifying Duplicate .jpg Files


I've created a table in SQL Server 2000 that stores .jpg files in an image column. Unfortunately, the table includes many duplicate images. Can I compare the image value in one row with the image contents of another row to determine whether the contents are identical?

I'm sure you've tried using the equal to (=) operator to compare the columns through a JOIN clause. Unfortunately, you can't use image data types with the = operator. However, a simple workaround might help you deal with this limitation.

Listing 1 shows a code example based on the Employees table in the Northwind sample database. Imagine that you want to eliminate duplicate employee photos from your database. Because you want to find employees who have the same picture in the same table, you use a self-join to compare rows in the Employees table. You alias the Employees table as E1 and E2. You can't use an = operator to join image columns, but you can use the Substring() function that Listing 1 shows to compare chunks of data. This example will work as written if none of your image columns are more than 8000 bytes. If your image columns contain more than 8000 bytes, your SQL code will be more complex. For example, I've used the Datalength() function in the following command to determine that the longest photo value in the Employees table is 21,722 bytes:

SELECT max(datalength(photo)) FROM Employees

Each Substring() call can return only 8000 bytes, so you need three separate clauses in the JOIN statement to accommodate the existing data, as Listing 2 shows. Keep in mind that a query such as the one that Listing 2 shows can take a long time to run on a large table that has large image columns.

Hide comments


  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.