Resolving a mysterious problem with find
The Endeavour 2024-11-12
Suppose you want to write a shell script searches the current directory for files that have a keyword in the name of the file or in its contents. Here’s a first attempt.
find . -name '*.py' -type f -print0 | grep -i "$1" find . -name '*.py' -type f -print0 | xargs -0 grep -il "$1"
This works well for searching file contents but behaves unexpectedly when searching for file names.
If I have a file named frodo.py
in the directory, the script will return
grep: (standard input): binary file matches
Binary file matches?! I wasn’t searching binary files. I was searching files with names consisting entirely of ASCII characters. Where is a binary file coming from?
If we cut off the pipe at the end of the first line of the script and run
find . -name '*.py' -type f -print0
we get something like
.elwing.py/.frodo.py/gandalf.py
with no apparent non-ASCII characters. But if we pipe the output through xxd
to see a hex dump, we see that there are invisible null characters after each file name.
One way to fix our script would be to add a -a
option to the call to grep
, telling to treat the input as ASCII. But this will return the same output as above. The output of find
is treated as one long (ASCII) string, which matches the regular expression.
Another possibility would be to add a -o
flag to direct grep
to return just the match. But this is less than ideal as well. If you were looking for file names containing a Q, for example, you’d get Q
as your output, which doesn’t tell you the full file name.
There may be better solutions [1], but my solution was to insert a call to strings
in the pipeline:
find . -name '*.py' -type f -print0 | strings | grep -i "$1"
This will extract the ASCII strings out of the input it receives, which has the effect of splitting the string of file names into individual names.
By default the strings
command defines an ASCII string to be a string of 4 or more consecutive ASCII characters. A file with anything before the .py
extension will necessarily have at least four characters, but the analogous script to search C source files would overlook a file named x.c
. You could fix this by using strings -n 3
to find sequences of three or more ASCII characters.
If you don’t have the strings
command installed, you could use sed
to replace the null characters with newlines.
find . -name '*.py' -type f -print0 | sed 's/\x0/\n/g' | grep -i "$1"
Note that the null character is denoted \x0
rather than simply \0
.
Related posts
[1] See the comments for better solutions. I really appreciate your feedback. I’ve learned a lot over the years from reader comments.
The post Resolving a mysterious problem with find first appeared on John D. Cook.