JSI Tip 3538. Amended batch file to remove duplicate records.


Phil Robyn, at Berkeley, pointed out that the batch file I had scripted at tip 3530 did not handle blank records and it did not preserve leading space characters in the records. Phil submitted the following batch:

@echo off
setlocal
if \{%1\} EQU \{\} goto syntax
if not exist %1 goto syntax
set infile=%1
if \{%2\} EQU \{\} goto syntax
set outfile=%2
type nul > %outfile%
for /f "tokens=1* delims=:" %%a in (
  'type %infile%
  ^| sort
  ^| findstr /n /v /c:"CoLoRlEsS gReEn IdEaS"'
) do call :dedup %%a "%%b"
endlocal&goto :EOF
:syntax
@echo **************************************
@echo Syntax: SortDup Input_File Output_File
@echo **************************************
endlocal&goto :EOF
:dedup
set curr_rec=%2
if \[%curr_rec%\]==\[""\] set curr_rec=$$$blankline$$$
set curr_rec=
%curr_rec%
set curr_rec=%curr_rec:
"=% set curr_rec=%curr_rec:
=% if not defined prev_rec goto :write if "%curr_rec%" EQU "%prev_rec%" goto :EOF :write if "%curr_rec%" EQU "$$$blankline$$$" ( echo.>>%outfile% ) else ( echo>>%outfile% %curr_rec% ) set prev_rec=%curr_rec% goto :EOF
Borrowing Phil's findstr idea, I countered with the following amendment:
@echo off
setlocal
if \{%1\} EQU \{\} goto syntax
if not exist %1 goto syntax
set file=%1
set file="%file:"=%"
set work=%~pd1\%~nx1.tmp
set work="%work:"=%"
set work=%work:\\=\%
sort %file% /O %work%
del /f /q %file%
for /f "Tokens=1* Delims=:" %%s in ('findstr /n /v /c:"dO nOt FiNd" %work%') do set record=###%%t###&call :output 
REM if exist %work% del /q %work%
endlocal
goto :EOF
:syntax
@echo ***************************
@echo Syntax: SortDup Input_File 
@echo ***************************
goto :EOF
:output
if not defined prev_rec goto :write
if "%record%" EQU "%prev_rec%" goto :EOF
:write
set prev_rec=%record%
set record=%record:###=%
if "%record%" EQU "" goto :blknul
if "%record%" GTR " " @echo>>%file% %record%&goto :EOF
:blknul
if defined bn_rec goto :EOF
set bn_rec=Y
@echo.>>%file%
NOTE: Neither script gracefully handles records that contain batch control characters, such as &, |, and >. Neither do they address multiple blank records of differing length or null records. I elected to handle multiple blanks records and null records by outputting a single blank record. If you don't want to output any blank records, remove the last line ( @echo.>>%file%).

NOTE: Phil's script pipes the output of the sort command directly into the findstr command, while my script lets the sort write an output file (%work%). Phil's script runs faster on very small files, while mine is twice as fast when sorting larger files.

NOTE: Phil's script script use an Input_File and Output_File, while I elected to return the results in the Input_File. I don't delete the sort output file, which I created in the same folder as Input_File. If you wish to delete it, remove the REM from REM if exist %work% del /q %work%.



Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish