DC-X batch import – Davidsson

Batch importing stuff to DC-X often consists of these two steps.

Creating DC-XML files containing metadata to populate document.
Batch importing the DC-xml files and original files.

1. Creating DC-XML

This process is not mandatory. For example image files often contain lots of metadata in EXIF, IPTC or XMP inside the file itself. This information is pulled into DC-X on import so no need to create a sidecar file. But if the filename contains publication information or other relevant information it could be a good idea to build a sidecar file.
For newspaper migrations we often see that publication information is part of the filename and for product related import we often see product id in the filename.

The below PHP file is an example of a newspaper migration file which pulls publication information from the filename and creates a DC-XML sidecar file. It scans the the folder defined on row 12 under the path defined in row 9. As you see in this case we disregard lots of different filetypes (see row 21). The reason for this is so that we don’t need to clean up all folders and subfolders prior to running the migration. This is pretty useful as migrations tend to include really large amounts of files.

The script dig down a couple of levels of folders (see row 26 and row 34) and this is obviously something you need to change in accordance with your folder structure.

On row 52 you see the function buildXML. This function builds the actual XML files. I’d recommend that you limit the first tests by removing the // (rem) on line 28. This so you can test and validate the XML manually. I recommend that you use xmllint –format for this.

 ".$folder_file.").\n"; exit; }
				if(substr(strtolower($folder_dest_file), -3)!="pdf") {
					echo "\nError: We should really only see PDF files now (".$year. ").\n";
					exit;
					} else {
					buildXML($year, $folder_dest_file, $folder_file, $scanPathDest);
					$count++;
					}
				}
			}
		}
	}
echo "\n";


function buildXML($year, $filename, $foldername, $path) {
	
	$vTitleDate = "";
	$vStartingPageNum = "";
	$cXML = '';
	$cXML .= '';
	$cXML .= "";	
    $cXML .= "";
    $cXML .= "".DC_OWNER."";
    $cXML .= "".DC_SOURCE."";
     
	echo "\n\t\t\t\t'" . $foldername . "' -> " . $filename;	
	echo "\n\t\t\t\tYear: '" . $year . "'.";
	
	
	// Two diff file structures found.
	// This section describes shows example on how to handle variations in the filename structure.
	// Often not all data is 100% the same ;)
	// 24.BA111016.pdf
	// BRA21ONE16101121N00.pdf
	
	if(substr(strtolower($filename), 0, 3)=="bra") {
		// Exception handling.
		// Example filename: BRA21ONE16101121N00.pdf
		$month = substr(strtolower($filename), 10, 2);
		$day = substr(strtolower($filename), 12, 2);
		$vStartingPageNum = substr(strtolower($filename), 14, 2);		
		} else {
		// Default handling
		// Example filename: 24.BA111016.pdf
		$month = substr(strtolower($filename), 7, 2);
		$day = substr(strtolower($filename), 5, 2);
		$vStartingPageNum = substr(strtolower($filename), 0, 2);
		}
		
		
	$vDate = $year."-".$month."-".$day;
	$vTitleDate = $day.".".$month.".".$year;

	// Setting document title.
	$vTitle = DC_OWNER. " ".$vTitleDate . " Page: " . $vStartingPageNum;
	echo "\n\t\t\t\tTitle: '" . $vTitle . ".";

	if(!is_numeric($vStartingPageNum) or !is_numeric($year)) {
		echo "\nError: Page number or year not numeric '" . $vStartingPageNum . "' - '".$year."'.\n";
		exit;
		}
	
	$cXML .= "".$vTitle."";
	$cXML .= "".$vDate."";
	$cXML .= "".$vDate."";

	$cXML .= "";

	$cXML .= "";
	$cXML .= "";
	$cXML .= "".$path."/".$filename."";
	$cXML .= "original";
	$cXML .= "";
	$cXML .= "";

		
	$cXML .= "";
	

	$cXML .= "";
	$cXML .= "".$DC_OWNER."";
	$cXML .= "".$vDate."";
	$cXML .= "". (int) $vStartingPageNum."";
	$cXML .= "". (int) $vStartingPageNum."";	
	$cXML .= '';
	$cXML .= "".DC_PUB_TYPE."";
	$cXML .= "";

	
	$cXML .= "";
	// echo "\n".$cXML;
	echo "\n\t\t\t\tFilename: " . $path."/".$filename.".xml";
//	echo "\n";
	file_put_contents($path."/".$filename.".xml", $cXML);
	if(!file_exists($path."/".$filename.".xml")) {
		echo "\nError: File not found after creation: '".$path."/".$filename.".xml'.\n";
		exit;
		}
	echo "\n";
	}
?>

Take the above code and save it in your data folder under the name create_xml.php. When you’ve done editing the file to conform to your data it’s time to run it. If you look into the created XML files you’ll notice how we point at the sidecar parent file on line 108 (local_path tag).

php create_xml.php > dc-xml_creation.log

The script will create xml files at the same level as your original files (in this case PDF files). To prepare for the next step we need to create a file that points to all the XML files. The following steps finds the XML files and put the path of all XML files into a text file.

find . -type f -name "*.xml" -print  > files_to_import.txt

2. Import files

Time for batch import. If you don’t have any sidecar files this is pretty straight forward.

Copy the code below and save the file (filename: import.sh) onto your DC-X system in the folder where you have your data.

#!/bin/bash
WORKINGDIR=`pwd`
COUNTER=0
APPNAME="customer"

while read XMLFILE
do

    let COUNTER++

    echo `date` "[$COUNTER] working with $XMLFILE: "

    fileToImport=`basename "$XMLFILE"`

	echo "Working on: $fileToImport"
	echo "Path: $XMLFILE"
    # Change into the directory containing the XML file, so that referenced files can be found

	DIRNAME=`dirname "${XMLFILE}"`
	cd "$DIRNAME"
	  
    # Import XML into DC-X
    DOCID=`php /opt/dcx/bin/dcx_import.php --app $APPNAME --index-low-priority "$fileToImport"`    

    rc=$?
	if [[ $rc != 0 ]] ; then
      echo "Error code: " $rc

	fi

	echo `date` "[$COUNTER] DOC-ID: $DOCID "

    # Asynchronously render previews for images, PDFs etc.
    echo -n `date` "[$COUNTER] Creating previews for $DOCID: "
    php /opt/dcx/bin/dcx_export.php --app $APPNAME -t document $DOCID -r '$job = new DCX_Job($obj->app); $job->setWorkflow("recreate_previews"); $job->setPriority(2); $job->setStatus(DCX_Job::STATUS_TODO); $job->addDocument($obj->getId(), "input"); $job->save(); echo $job->getId() . "\n";'
	
	# Grab content of PDF file
	php /opt/dcx/bin/dcx_add_fulltext.php --app $APPNAME $DOCID -f
	
    cd $WORKINGDIR
done

In the above file we do some post processing. The main action is the actual import done in line 23. You can take this row and run it manually in shell/consol to import a single file. But the mentioned post processing happens on row 35 and 38. On row 35 we render a preview of the PDF file and on row 38 we ask DC-X to try to pull any text included inside the PDF file. The above post handling is not needed for images.

If you don’t have sidecar files you still need to have a text file which points to the files you want to import. Basically you can add a couple of files manually to a text file. Just add the path to the file and the filename to import, one file per line.

import.sh &> files_to_import.txt &

The import.sh script will spit out the created documents doc-id. So it’s pretty easy to find the newly created document inside DC-X to make sure everything looks ok.

Conclusion

The PHP script above is pretty specific and works only for a couple of scenarios. But the structure gives you an idea on how to build DC-XML sidecar files.
I really encourage you to approach this in a trail and error mindset. Go through each step, test, make change and try again. On the last step import maybe 10-20 files and let someone else look at the imported files in DC-X. If it’s a migration let someone who knows the old system and the data imported to have a look at the “new” data in DC-X.
All this to make sure you don’t import a million documents and find out you missed something important which forces you to post process on a production system or reimport the entire dataset.