Thursday, May 13, 2021

Batch script commands to split big text file in windows with efficiency comparison 命令提示字元 分割打開超大型檔案

When we try to open the big file (e.g. : server log), we may meet an error when we open with text editor.

Notepad : File is too large for Notepad .

Notepad++ : File is too big to be opened .

At this point, the only way is try to split the file using program. Surely it is always good to use those "close to metal" language like C++ . However , you may not want to install compiler , SDK etc.

Is there any convenient way ? Yes ,using batch script is always a solution .
But always remember , it may takes you more than half hour to split 1GB file into 100 small files.

Code in Batch .bat script (Method 1 Faster)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
@echo off
setLocal EnableDelayedExpansion
set limit=50000 #Rows per file
REM can be in any extension (e.g. csv ) , as long as it is a text file 
set file=YourFileName.txt
set lineCounter=1
set filenameCounter=1

set name=
set extension=
for %%a in (%file%) do (
    set "name=%%~na"
    set "extension=%%~xa"
)

for /f "tokens=*" %%a in (%file%) do (
  
    if !lineCounter! gtr !limit! (
        set /a filenameCounter=!filenameCounter! + 1
        set lineCounter=1
        echo Created !splitFile!.
    )
    REM Output filename pattern YourFileName-part1.csv , YourFileName-part2.csv
    set splitFile=!name!-part!filenameCounter!!extension! 
    echo %%a>> !splitFile!
  
    set /a lineCounter=!lineCounter! + 1
)


Code in Batch .bat script (Method 2 Slower)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
@echo off
setlocal enableextensions disabledelayedexpansion
set STARTTIME=%TIME%
set "nLines=50000" #Rows in each file
set "line=0"
REM can be in any extension (e.g. csv ) , as long as it is a text file 

for /f "usebackq delims=" %%a in ("InputFileName.txt") do (
    set /a "file=line/%nLines%", "line+=1"
    setlocal enabledelayedexpansion
    for %%b in (!file!) do (
        endlocal
         >>"OutputName_%%b.txt" echo %%a 
         REM Ouput filename pattern : OutputName_1.txt , OutputName_2.txt
         REM Filename prefix will NOT follow the input file in this way.
    )
)


  • Efficiency Comparison 

            The way of writing the script could lead to double process time.

  • Test with 100K rows , >18 MB file.

           For example : Split a file 18.7MB into 3 files (each file with max 31670 rows). 

           Using method 1 one takes 23s , while using method 2 takes 46s.

  • Test with 5000K rows , >1 GB file

           With method 1 , if a 1.17GB (1230188 KB , around 5000Krows inside) has to be split into 100                         small files (50K rows @file) , it takes 30m42s .

            With method 2, believer me, you don't want to try.

  • PC configuration reference 

           16GB RAM , with 8 x QuaCores , i5 CPU.  


Conclusion
- Better use C++ to split file over 1GB . ;)

 

No comments:

Post a Comment

Next year SF migration plan : moving out customer account to AWS , and call SF data with service account

 👀 The most challenging part is about those function with user verification. My service cloud with MIAW chat , nearly redo due to the user ...