Publishers of technology books, eBooks, and videos for creative people

Home > Articles

  • Print
  • + Share This
This chapter is from the book

Project 29 File-Content Tips

“Is there an easy way to format the contents of text files?”

This project gives you tips for detecting the type of content a file contains and introduces some handy text-processing utilities.

Determine File Content

Command file tells you the type of content a file contains.

$ file *
   about-html.txt:  ASCII text
   fake.html:       empty
   index.html:      ASCII HTML document text
   letter.doc:      ASCII English text
   nodif:           a /bin/tcsh script text executable
   smtp-auth-plain: a /usr/bin/perl script text executable
   unix2mac:        a /bin/bash script text executable
   week1:           directory
   week1.tar:       POSIX tar archive
   week1.tbz2:      bzip2 compressed data, block size = 900k

Specify option -i if you would like the file type displayed in mime format.

$ file -i *
   about-html.txt:  text/plain; charset=us-ascii
   fake.html:       application/x-empty
   index.html:      text/html; charset=us-ascii
   letter.doc:      text/plain, English; charset=us-ascii
   nodif:           application/x-shellscript
   smtp-auth-plain: application/x-perl
   unix2mac:        application/x-shellscript
   week1:           application/x-not-regular-file
   week1.tar:       application/x-tar, POSIX
   week1.tbz2:      application/octet-stream

Search for Files with a Specific Type of Content

We can pipe the results from file to grep to look for files with specific content.

$ file * | grep -i html
   about-html.txt: ASCII text
   fake.html:      empty
   index.html:     ASCII HTML document text

This simple approach suffers from a problem: If the filename contains the search term, it will match too, regardless of the content. We must add a little sophistication to the search term to absorb everything from the beginning of the line to the colon after the filename, using a regular expression such as “^.*:”, and then search for html.

$ file * | grep -i "^.*:.*html"
   index.html:      ASCII HTML document text

The regular expression searches from the start of a line (^) for anything (.*) followed by a colon and then anything followed by html.

Process Files with a Specific Content Type

It’s easy to extend the pipeline example given above, making it pass the list of filenames to a command like Apple’s textedit.

To realize this, we use awk to pass on just the filename, which is the first field of the line.

$ file * | grep -i "^.*:.*html" | awk '{print $1}'
   index.html:

Then we use sed to chop off the colon.

$ file * | grep -i "^.*:.*html" | awk '{print $1}' ¬
   | sed 's/://'
   index.html

Finally, we use xargs to form a command line from the list of files.

$ file * | grep -i "^.*:.*html" | awk '{print $1}' ¬
   | sed 's/://' | xargs open -a textedit

In this example, the command line will be

open -a textedit index.html

The command open -a runs the specified GUI program, resulting in TextEdit’s opening index.html.

An alternative approach uses option -F, telling file to separate the filename from the content type with space-colon instead of just colon. Consequently, the first field seen by awk will be the filename without the colon.

$ file -F " :" * | grep -i "^.*:.*html" ¬
   | awk '{print $1}' | xargs open -a textedit

Search Compressed Files

Option -z tells file to look inside compressed files. Compare the output of the next two examples.

$ file week1.tbz2
   week1.tbz2: bzip2 compressed data, block size = 900k
   $ file -z week1.tbz2
   week1.tbz2: POSIX tar archive (bzip2 compressed data,
   block size = 900k)

Expand and Unexpand Tabs

The expand command expands tab characters to the appropriate number of spaces, and unexpand does the reverse. Pass option -a to unexpand to ensure that all spaces are converted; otherwise, only leading spaces are converted.

Fold Long Lines

Long lines can be broken into shorter lines by the fold command. In this example, the output has lines of no more than 40 characters. Output is displayed on the terminal screen; to save the results, simply redirect output to a file by using > name-of-output-file.

$ cat longlines
   
   this is a file with one very long line and no linefeeds in
   it to demonstrate the use of fold to break long lines into
   the specified width
   $ fold -w40 longlines
   this is a file with one very long line a
   nd no linefeeds in it to demonstrate the
   use of fold to break long lines into th
   e specified width

The fmt command is more sophisticated and breaks lines at spaces instead of midword.

$ fmt -40 longlines
   this is a file with one very long line
   and no linefeeds in it to demonstrate
   the use of fold to break long lines into
   the speficied width

Split Large Files

Use the split command to split a long file into many smaller files, each 1,000 lines long. Specify option -l to change the sizes of the smaller files.

  • + Share This
  • 🔖 Save To Your Account