Tag Archives: grep

Searching for apples and oranges, using grep.

Not using grep? Its a step up from sorting and cutting and pasting in spreadsheets. You will feel, briefly, omniscient, when you use it to solve some problem that’s been bugging you. Here’s my latest:

You care about two keywords in a file- apples and oranges, and you also care about about their relative positions, for whatever reason. So grepping for each, separately, is nice, but you’d really like to grep for one OR the other.

Did I mention this was grep?

grep -i ‘apple\|orange’ *filename.ext*

The -i makes it case-insensitive, just like you’d want on a first pass. The “|” vertical bar is a familiar OR operator, and the only tricky parts are to a) put the whole thing in a single set of single quotes- the two words and the operator are a single syntactic unit, and b) use a back-slash to mark the vertical bar as an operator and not just a literal vertical bar.

I used apple and orange in the title because they are canonically “unrelated” things, but where this technique is really useful is when the unrelated things are in orthagonal kinds: fruits and deserts. If you’ve got your recipes filed or a cookbook on line, grep -i ‘pie\|apple’ will produce all the refernces to either. Pies involving apples will be found where ‘apple’ has ‘pie’ both above and below… As a human, you have a right to do that last bit in your head, the sorting out that we gatherer-hunters are bred for.

Cool interview questions

Q – What 3 Unix utilities would you take to a desert island, and why?

A: Find, Grep, hmm, gnuplot. Find and Grep are very capable, deep, and reliable. It would be easier not to have to re-do them myself, whereas sort, ls, df, even ps aren’t nearly as difficult to re-write myself.

Q: You have two strings, needle and haystack. Both are very long- haystack is at least as long as needle, could be longer. Potentially, millions of characters.

Q1: What’s the fastest way to determine whether all the characters in needle are found in haystack? Order doesn’t have to be the same, but if there are 27 ‘A’ characters in needle, we want to know whether there are at least 27 in haystack. And on through the rest of the, say, 8 bit character set.

A: Obviously, the trick here is to NOT do an N * N solution where each character in needle results in a read (or even a partial read) of haystack. Yet, one can’t avoid reading needle at least once. So: make an array with 256 elements, coresponding to character values 0 to 255. Initialize all array members to 0. Read through needle, incrementing the count for each character value when that character is encountered. For example: needle is “needle”. Character counts are:

d: 1
e: 3
l: 1
n: 1

Once needle has been completely read and counted, start reading haystack, and *decrement* the count (but stop at 0)  for each character value whenever one of those characters appears. The read through haystack ends when either all the character counters are “0”, which is success, or when haystack is exhausted AND at least one character counter is greater than zero.

This solution is on the order of 2N + some constant, much better than the N * N obvious technique.

Q2: The A1 technique requires 256 counters of 20+ bits – most likely 32 bit integers. So it requires 1K bytes. How can the job be done with the very least amount of memory, but as long a run time as it takes?

A2: 2, 32 bit integers, and a character, are enough. One integer is an index (indice) into needle or haystack, one is a counter. The character is incremented through all possible character values. For each character value, the count is zeroed, then needle is stepped through, adding one to the counter for every instance of the target character code. After needle is completely counted, the same index is used to step through the locations in haystack, and the integer in the counter is decremented (provided it started off positive) any time the target character is found. But not below 0. If any trial gets to the end of haystack with a non-zero counter value, return a fail and give up. If the counter gets to zero while decrementing, look no further, increment the character, start on the next character.