I have a php script that frequently needs to email me the last few lines of a log file. I can’t afford to exec() a binary tail process, so the solution has to be in pure php.
Originally, the files in question never exceeded more than a few thousand lines. Unfortunately, I am encountering cases now where the files are now occasionally 50,000 lines or longer. This causes PHP’s memory consumption to explode.
Note: Code snippets provided here are not fully functional standalone shell scripts. The scripts I ran to benchmark the algorithms contain some rudimentary setup logic that is not important here, so has not been included.
My original method:
1 2 3 4 |
// tail-file.php $arr = @file( $fname, FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES ); $arr = array_slice($arr, -$lines); $buf = implode("\\\n",$arr); |
This is easy to understand and is pretty fast, all things considered. Unfortunately, the memory footprint for loading a file into an array is obscene. Loading a 4400 line log file with this method could consume more than 17mb of ram. 50,000 line files easily stressed the 256mb limit I am able to provide the process.
So, the obvious solution to the memory consumption is to avoid loading the entire file at once. What if we kept a rotating list of lines in the file?
1 2 3 4 5 6 7 8 9 10 11 |
// tail-array.php $arr = array_fill( 0, $lines+1, "\\\n" ); $fp = fopen($fname, "r"); while( !feof($fp) ) { $line = fgets($fp, 4096); $arr[] = $line; // faster than array_push() array_shift($arr); } fclose($fp); $buf = implode("",$arr); |
This method works by keeping the $lines-many most recent lines of the file in an array. Memory consumption remains sane, but the performance hit for performing so many array pushes and shifts is bad. Really bad. With small files, I can’t notice any difference between this method and the file() method… but with longer files, it adds up quickly.
Given a 51 line, 4kb file, an average execution ($lines = 20) might look like this:
[code]
ammon@zapp:~$ time ./tail-file.php a.log >/dev/null
real 0m0.015s
user 0m0.009s
sys 0m0.007s
ammon@zapp:~$ time ./tail-array.php a.log >/dev/null
real 0m0.016s
user 0m0.010s
sys 0m0.006s
[/code]
Comparable enough. But given a 50,004 line (3.3mb) log file:
1 2 3 4 5 6 7 8 9 10 11 |
ammon@zapp:~$ time ./tail-file.php b.log >/dev/null real 0m0.079s user 0m0.058s sys 0m0.021s ammon@zapp:~$ time ./tail-array.php b.log >/dev/null real 0m0.119s user 0m0.112s sys 0m0.007s |
The difference becomes quite clear. However… what if my log file grows obscenely large? I’ve got a 9 million line log file (1.6gb) lying around to test with…
1 2 3 4 5 6 7 8 9 10 11 |
ammon@zapp:~$ time ./tail-file.php c.log >/dev/null real 0m0.015s user 0m0.008s sys 0m0.008s ammon@zapp:~$ time ./tail-array.php c.log >/dev/null real 0m19.351s user 0m18.545s sys 0m0.803s |
The file() method crashes because it can’t allocate enough ram to hold a 9 million element array and the array method takes almost 20 seconds to execute. It’s slow… but at least it works.
Of course, there are other methods. The one I finally settled on is this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
// tail-seek.php $fp = fopen($fname, "r"); $lines_read = 0; if( $fp !== FALSE ) { fseek( $fp, 0, SEEK_END ); $pos = $eof = ftell($fp); do { --$pos; fseek($fp, $pos); $c = fgetc($fp); if( $c == "\\\n" ) $lines_read++; } while( $pos > 0 && $lines_read <= $lines ); $buf = fread($fp, $eof-$pos); } fclose($fp); |
This method doesn’t waste time reading the bulk of the file. It jumps to the end and scans backward until enough newlines have been located. The only problem here is that your average filesystem isn’t optimized for reading backwards… but since we’re not really reading very much data, it doesn’t much matter.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
ammon@zapp:~$ time ./tail-seek.php a.log >/dev/null real 0m0.017s user 0m0.009s sys 0m0.008s ammon@zapp:~$ time ./tail-seek.php b.log >/dev/null real 0m0.017s user 0m0.008s sys 0m0.010s ammon@zapp:~$ time ./tail-seek.php c.log >/dev/null real 0m0.023s user 0m0.015s sys 0m0.008s |
Performance is a trifle slower on small files, but it’s astronomically better on long ones. This is similar to the method used by most unix ‘tail’ commands, and is the clear winner for actual use in my application.
Of course, it needs a bit of cleanup from the state I’ve provided it in, and isn’t appropriate for all environments… but it’s a trifle better than requiring 20 seconds and 20gb of ram to execute 😉