Thursday, May 13, 2021

Batch script commands to split big text file in windows with efficiency comparison 命令提示字元 分割打開超大型檔案

When we try to open the big file (e.g. : server log), we may meet an error when we open with text editor.

Notepad : File is too large for Notepad .

Notepad++ : File is too big to be opened .

At this point, the only way is try to split the file using program. Surely it is always good to use those "close to metal" language like C++ . However , you may not want to install compiler , SDK etc.

Is there any convenient way ? Yes ,using batch script is always a solution .
But always remember , it may takes you more than half hour to split 1GB file into 100 small files.

Code in Batch .bat script (Method 1 Faster)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
@echo off
setLocal EnableDelayedExpansion
set limit=50000 #Rows per file
REM can be in any extension (e.g. csv ) , as long as it is a text file 
set file=YourFileName.txt
set lineCounter=1
set filenameCounter=1

set name=
set extension=
for %%a in (%file%) do (
    set "name=%%~na"
    set "extension=%%~xa"
)

for /f "tokens=*" %%a in (%file%) do (
  
    if !lineCounter! gtr !limit! (
        set /a filenameCounter=!filenameCounter! + 1
        set lineCounter=1
        echo Created !splitFile!.
    )
    REM Output filename pattern YourFileName-part1.csv , YourFileName-part2.csv
    set splitFile=!name!-part!filenameCounter!!extension! 
    echo %%a>> !splitFile!
  
    set /a lineCounter=!lineCounter! + 1
)


Code in Batch .bat script (Method 2 Slower)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
@echo off
setlocal enableextensions disabledelayedexpansion
set STARTTIME=%TIME%
set "nLines=50000" #Rows in each file
set "line=0"
REM can be in any extension (e.g. csv ) , as long as it is a text file 

for /f "usebackq delims=" %%a in ("InputFileName.txt") do (
    set /a "file=line/%nLines%", "line+=1"
    setlocal enabledelayedexpansion
    for %%b in (!file!) do (
        endlocal
         >>"OutputName_%%b.txt" echo %%a 
         REM Ouput filename pattern : OutputName_1.txt , OutputName_2.txt
         REM Filename prefix will NOT follow the input file in this way.
    )
)


  • Efficiency Comparison 

            The way of writing the script could lead to double process time.

  • Test with 100K rows , >18 MB file.

           For example : Split a file 18.7MB into 3 files (each file with max 31670 rows). 

           Using method 1 one takes 23s , while using method 2 takes 46s.

  • Test with 5000K rows , >1 GB file

           With method 1 , if a 1.17GB (1230188 KB , around 5000Krows inside) has to be split into 100                         small files (50K rows @file) , it takes 30m42s .

            With method 2, believer me, you don't want to try.

  • PC configuration reference 

           16GB RAM , with 8 x QuaCores , i5 CPU.  


Conclusion
- Better use C++ to split file over 1GB . ;)

 

No comments:

Post a Comment

Something about Renpy For loop error : expected statement.

 It takes me over hour to debug. The simple fact is that under label, we cannot use For loop. One while is valid to be used under label. To ...