Unix Command-line Text-Manipulation

This topic was published by and viewed 3976 times since "". The last page revision was "".

Viewing 1 post (of 1 total)
  • Author
    Posts

  • DevynCJohnson
    Keymaster
    • Topics - 437
    • @devyncjohnson

    The Linux command-line has numerous uses and abilities. The command-line and shell scripting are also capable of manipulating text including text within files. An introduction to many of the command-line tools for text manipulation is important to people wanting to have a better experience with the Linux operating system.

    awk

    Awk is a small C-like language that is Turing-complete and is processed/interpreted by the "awk" command in command-lines. In general, awk is faster than sed, but awk can be harder to use (according to some users). Unlike grep, awk can search for certain hex values. Awk also supports conditional statements and loops.

    Awk scripts are written like shell scripts, but they contain awk commands and the awk hashpling [code]#!/usr/bin/awk -f[/code]. Awk scripts may use the "*.awk" file-extension.

    One-liners

    • Domain Expiration - whois dcjtech.info | awk '/Registry Expiry Date:/ {print $4}'
    • List Httpd 404 Errors - awk '$9 == 404 ' /var/log/httpd/access.log
    • List Usernames and UIDs - awk -F":" '{print $1 " " $3}' /etc/passwd
    • List Users - awk -F':' '{print $1}' /etc/passwd
    • List Users (Alphabetically) - awk -F':' '{print $1}' /etc/passwd | sort
    • Remove Duplicate Lines - awk '!x[$0]++' FILE.txt > NEW_FILE.txt

    Generate Random Numbers

    #!/usr/bin/awk -f
    BEGIN {
        srand()
    }
    {
        for(i=1;i<=10;i++)
        print rand(); exit
    }

    Various implementations of awk are available

    • gawk - GNU awk is based on the POSIX awk standard and has additional features
    • jawk - Java awk (http://sourceforge.net/projects/jawk/) is an awk implementation written in Java.
    • mawk - Modified awk (http://invisible-island.net/mawk/mawk.html) is smaller and faster than Gawk.
    • nawk - New awk is AT&T's version of awk and is the standard awk implementation that uses the POSIX awk standards.
    • oawk - Old Awk is the original awk. The name "oawk" is used for compatibility.

    sed

    The Stream Editor (sed) (https://www.gnu.org/software/sed/) is a Unix utility that manipulates text based on special commands that are written using the "sed" language. Both the language used by the command and the command itself are called "sed". The language is simple and Turing-complete, and many users say is easier to learn than awk. However, awk is generally faster than sed. The sed language can be used in sed scripts which use the [code]#!/bin/sed -f[/code] hashpling and may use the "*.sed" file-extension. "sed" is commonly used for finding and replacing text.

    One-liners

    • Count Lines in File - sed -n '$=' FILE.txt Double-space a File - sed G FILE.txt > NEW_FILE.txt
    • Find and Replace - sed 's/FIND/REPLACE/g' FILE.txt > NEW_FILE.txt
    • Find and Replace (Case-insensitive) - sed -i 's/FIND/REPLACE/g' FILE.txt > NEW_FILE.txt
    • Removing Trailing Whitespace (Each Line of File) - sed 's/[ \t]*$//g' FILE.txt > NEW_FILE.txt

    ssed

    Super-sed (http://sed.sourceforge.net/grabbag/ssed/) is an enhanced version of sed that is generally faster than the original sed.

    Perl

    Perl (https://www.perl.org/) is a scripting language that is commonly used for advanced text manipulations (among other uses). Perl can also be used as an alternative to PHP on dynamic servers. Perl can be used in the command-line or in Perl scripts, which contain the [code]#!/usr/bin/perl[/code] hashpling and may use the "*.pl" file-extension. Perl is a Turing-complete computer language.

    Perl can be used as a substitute for the "sed" command. For example, sed 's/FIND/REPLACE/g'" = "perl -pe 's/FIND/REPLACE/g'. Obviously, Perl supports the language and syntax used by sed. Perl is also an excellent replacement for other text manipulation tools such as awk, cut, uniq, and others.

    grep

    Grep (http://www.gnu.org/software/grep/) is a Unix utility used to search plain-text. Grep also supports regular expression (regex) which are "wildcards".

    Example Commands

    • Case-insensitive Search - grep -i -e "FIND" FILE.txt
    • Count Instances Found - grep -c -e "FIND" FILE.txt
    • Display Line Number with Output - grep -n -e "FIND" FILE.txt
    • Invert Match - grep -v -e "FIND" FILE.txt
    • Search Files in Directory Recursively - grep -r -e "FIND" /DIRECTORY/

    Grep Variants

    • agrep - Approximate grep is a proprietary utility that supports many search algorithms, especially "fuzzy string searching".
    • egrep - Extended grep has additional regular expression features.
    • fgrep - Fixed grep does not support regex and uses the Aho–Corasick string matching algorithm.
    • pgrep - Process grep searches process names for a given string and then returns the process ID (PID).

    cut

    The "cut" command (http://linux.die.net/man/1/cut) can remove/extract bytes, characters, and fields from files. Various parameters are used to specify what part or parts of the file are to be removed or displayed. By default, the "cut" command outputs the sorted results to standard output, thus leaving the original file unchanged.

    • Display First Five Characters - cut -c1-5 FILE.txt
    • Display Third Character of Each Line - cut -c3 FILE.txt
    • List User Homes (Alphabetically) - cut -d':' -f1,6 /etc/passwd | sort
    • List Users - cut -d':' -f1 /etc/passwd

    sort

    The "sort" command (http://linux.die.net/man/1/sort) is used to sort the lines of a text file. By default, "sort" sorts alphabetically. However, the "-n" parameter can be used to sort numerically. The "sort" command outputs the sorted results to standard output, thus leaving the original file unchanged. Using the "-t" parameter, the field delimiter can be specified such as the "pipe" character (-t'|').

    FUN FACT: sort -u FILE.txt achieves the same results as sort FILE.txt | uniq.

    • Sort by the Third Column - sort -k3 FILE.txt
    • Sort by the Third Column (Reversed) - sort -k3 -r FILE.txt
    • Sort by the Third Column (Save Results) - sort -k3 FILE.txt > NEW_FILE.txt
    • Sort Files by Size - ls -al | sort -r -n -k5

    uniq

    The "uniq" command (http://www.computerhope.com/unix/uuniq.htm) removes duplicate lines in a sorted file. This means the duplicate lines must be together (each on their own line) for "uniq" to find and remove them. Typically, the "sort" command is used with the "uniq" command.

    • Count Duplicate Lines - uniq -c FILE.txt
    • Display Unique Lines - uniq -u FILE.txt
    • List Duplicate Lines - uniq -d FILE.txt
    • Remove Duplicate Lines - uniq FILE.txt

    replace

    "replace" (http://www.computerhope.com/unix/replace.htm) is a Unix utility that finds and replaces text. The general syntax is replace FIND REPLACE -- LIST_OF_FILE_PATHS. For illustration, to replace "NIX" with "Unix" in a text file, type replace "NIX" "Unix" -- FILE.txt.

    Further Reading

Viewing 1 post (of 1 total)