Monday, February 2, 2009

Extract and merge pages from PDF document

I had taken great care in typesetting my thesis. Informative running headers (current chapter/ section name that shows up on the top of the page), correct margins when typesetting one-sided or two-sided and other fine points. 

The LaTeX class file that I wrote for my thesis is based on the book class and typesets  page numbers at the top outer side.  In keeping with typographical conventions, and also because it looks nicer, the first page of the major sectioning unit, chapter or part is handled differently, namely, the page number at the bottom center.

Now grad studies insists that the page numbers be typeset uniformally throughout the thesis: either all at the top and all at the bottom.  It was not that hard to change the LaTeX class to have it put all the page numbers at the top outer side. But I had already printed my thesis on bond paper, and there was no point in incurring the expenditure again.  So all I needed was to extract the chapter and part pages from the "corrected" PDF file, print them and replace the faulty ones in the print. Hence the need to extract and merge pages from a PDF file.

Well, ghostscript (gs) is our friend and will do the job nicely. In a nutshell, the following command  will extract pages into individual files:     

 gs -dBATCH -dNOPAUSE -dFirstPage=<first_page> -dLastPage=<last_page> -sDEVICE=pdfwrite -sOutputFile=<output_file>

while the following will merge multiple pdf files in the space delimited list <input_files> into one

 gs  -dBATCH -dNOPAUSE  -sDEVICE=pdfwrite -sOutputFile=<output_file> <input_files>

Here's the shell script that I used to extract the chapter pages and merge into one PDF file

 #!/bin/bash

 declare -rx SCRIPT=${0##*/}
 declare -i extract_flag=0
 declare -i merge_flag=0

 declare -a p_num_all=( \
 1 6 7 11 21 33 42 44   \
 54 55 61 73 87 95 109 111 \
 118 119 125 138 147 163 165  \
 170 172    \
 )

 declare -i num_pages=${#p_num_all[*]}
 declare -i offset=12
 declare i
 declare f_name f_name_all
 declare out_file=chapter_pages.pdf

 case $1 in
 0) extract_flag=1
    ;;
 1) merge_flag=1
    ;;
 2) extract_flag=1
    merge_flag=1
    ;;
 *) printf "usage: $SCRIPT  <0|1|2>  <input_file> \n"
    ;;
 esac


 for (( i=1; i<=num_pages; i++ )) ; do   

     p_num=${p_num_all[i-1]}
     p_num_off=$(expr $p_num + $offset)   
    
     f_name="pp-$p_num.pdf"
     f_name_all="$f_name_all $f_name"

     if [ $extract_flag -eq 1 ] ; then
     printf "extracting page %d to file %s \n" $p_num $f_name
     gs  -dBATCH -dNOPAUSE       \
     -dFirstPage=$p_num_off -dLastPage=$p_num_off   \
     -sDEVICE=pdfwrite -sOutputFile=$f_name  \
     $2
     fi
 done
    

 if [ $merge_flag -eq 1 ] ; then
     gs  -dBATCH -dNOPAUSE      \
     -sDEVICE=pdfwrite -sOutputFile=$out_file \
     $f_name_all
 fi

No comments: