Skip to main content

File Staging Environment

Introduction

The enov8 Test Data Management (TDM) Tool supports masking, profiling, and validation of structured file formats directly on disk. This document covers the storage configuration, filesystem permissions, and troubleshooting steps required for the TDM tool to read, mask, and write files in staging environments.

Supported Formats

The TDM tool supports the following file formats:

CategoryFormats
Text / Flat filesDelimited (CSV, TSV, pipe-delimited, custom delimiter), Fixed-width (mainframe extracts)
Structured dataXML, JSON
Columnar binaryApache Avro, Apache Parquet, Apache ORC

1. Standard File Server Configuration

1.1 Hardware Requirements

Hardware requirements are driven by the size of the files being processed and the number of concurrent masking jobs.

CPU

  • Cores: Start with 4 cores. Add 2 cores per 10 GB of data being masked concurrently.
  • Example:
    • Masking 10 GB of files at a time: 4 cores.
    • Masking 50 GB of files at a time: 12 cores.

Memory (RAM)

  • Base Requirement: 8 GB minimum.
  • Text and flat files (delimited, fixed-width, XML, JSON) are streamed or loaded line-by-line where possible. Memory usage scales with the largest single line or document.
  • Binary formats (Avro, Parquet, ORC) are fully loaded into memory during format conversion. Budget at least 3× the size of the largest single file being masked simultaneously.
  • Example:
    • Masking a 2 GB Parquet file: at least 6 GB of available RAM.

Disk

  • Type: SSDs are recommended for working directories.
  • Capacity: For binary formats, the working directory must have at least 3× the size of the largest file available as free space to accommodate temporary conversion files and masked output.

Network

  • Bandwidth: At least 1 Gbps NIC, particularly when files are stored on mounted remote filesystems.
  • Latency: Low-latency connection to the file storage. High latency on remote mounts significantly impacts throughput for large files and binary format conversion.

1.2 Supported File Storage Types

The TDM tool connects to file storage natively using the Server Type configured in the connection. No filesystem mounts are required for cloud storage — credentials are provided directly in the TDM connection configuration.


Local

Files are accessed directly via a filesystem path on the TDM server. Any storage mounted to the TDM server (local disk, SAN, NFS, CIFS) is accessible as a local path.

  • Required field: File Path
  • The TDM service user must have the appropriate read/write permissions on the path. See Section 3 for details.

FTP

Files are accessed from a remote FTP server.

  • Required fields: File Path, Server, Port (default: 21)
  • Optional fields: Username, Password

Amazon S3 Standard

Direct connection to an AWS S3 bucket using static IAM credentials.

  • Required fields: File Path, Access Key, Secret Key
  • Optional fields: Region Name

Amazon S3 Compatible

Connection to any S3-compatible object store (e.g. MinIO, Ceph, Wasabi) using a custom endpoint.

  • Required fields: File Path, Access Key, Secret Key, Endpoint URL

Amazon S3 EC2 IAM

Connection to S3 using the EC2 instance's attached IAM role. No static credentials are required — the TDM server must be running on an EC2 instance with an appropriate IAM role attached.

  • Required fields: File Path

Amazon S3 AssumeRole

Connection to S3 by assuming a cross-account or delegated IAM role.

  • Required fields: File Path, Access Key, Secret Key, Account ID, Role Name
  • Optional fields: Region Name

Azure File Storage

Connects to an Azure File Share using one of three authentication methods:

Server TypeAuthenticationRequired Fields
Azure File - Shared AccessAccount name + access keyAccount Name, Azure Access Key
Azure File - SASAccount name + SAS token via access keyAccount Name, Azure Access Key
Azure File - Connection StringFull connection string built from keyAccount Name, Azure Access Key
  • Optional fields (all Azure File types): Default Endpoint Protocol (http/https, default https), Endpoint Suffix (default core.windows.net)

Azure Blob Storage

Connects to an Azure Blob Storage container using one of three authentication methods:

Server TypeAuthenticationRequired Fields
Azure Blob - Shared AccessAccount name + access keyAccount Name, Azure Access Key
Azure Blob - SASAccount name + SAS token via access keyAccount Name, Azure Access Key
Azure Blob - Connection StringFull connection string built from keyAccount Name, Azure Access Key
  • Optional fields (all Azure Blob types): Default Endpoint Protocol (http/https, default https), Endpoint Suffix (default core.windows.net)

1.3 File Format Behaviour and Working Directory

Each file format has different temporary file requirements. Understanding these is important for configuring permissions correctly.

Delimited Files (CSV, TSV, pipe-delimited, custom)

  • Reads from the File Path configured in the connection.
  • Writes the masked output to the Masked File Directory (with an optional Masked File Suffix appended to the filename).
  • No temporary files are created in the source directory.
  • Supports configurable File Delimiter, File Quote, and File Encoding.

Fixed-Width Files (Mainframe Extracts)

  • Reads from the File Path configured in the connection, using the Width Definition to determine column boundaries.
  • Writes the masked output to the Masked File Directory (with an optional Masked File Suffix appended to the filename).
  • No temporary files are created in the source directory.
  • Supports configurable File Encoding with backslashreplace error handling for non-UTF-8 mainframe encodings.

XML

  • Reads the file at the File Path configured in the connection and parses it into an element tree in memory.
  • Writes the masked output to the Masked File Directory (with an optional Masked File Suffix appended to the filename).
  • No temporary files are created in the source directory.
  • Encoding is determined by the File Encoding setting (XML Declaration, utf-8, or utf-16).

JSON

  • Reads the file at the File Path configured in the connection.
  • Optionally validates against a schema if a Schema Path is provided.
  • Writes the masked output to the Masked File Directory (with an optional Masked File Suffix appended to the filename).
  • No temporary files are created in the source directory.

Apache Avro

  • Reads the file at the File Path configured in the connection.
  • Creates the following temporary files in the same directory as the source file:
    • {filename}.json — records converted to JSON
    • {filename}_schema.json — extracted Avro schema
    • {filename}.avro.masked — intermediate masked JSON
  • Writes final masked output to {filename}_masked.avro in the same directory, or to the Masked File Directory if configured.
  • All temporary files are deleted after masking completes.

Apache Parquet

  • Reads the file at the File Path configured in the connection.
  • Creates the following temporary files in the same directory as the source file:
    • {filename}.json — records converted to JSON
    • {filename}.json.masked — intermediate masked JSON
  • Writes final masked output to {filename}_masked.parquet in the same directory, or to the Masked File Directory if configured.
  • The tool checks that the source directory is writable before proceeding. If it is not, the job fails immediately.
  • All temporary files are deleted after masking completes.

Apache ORC

  • Reads the file at the File Path configured in the connection.
  • Creates the following temporary files in the same directory as the source file:
    • {filename}.json — records converted to JSON
    • {filename}.orc.masked — intermediate masked JSON
    • schemas/{filename}.schema — ORC schema extracted into a schemas/ subdirectory
  • Writes final masked output to {filename}_masked.orc in the same directory, or to the Masked File Directory if configured.
  • The schemas/ subdirectory is created automatically. The service user must have permission to create subdirectories in the source file's directory.
  • All temporary files and the schemas/ subdirectory are deleted after masking completes.

1.4 Network Configuration

The TDM server must be able to reach the configured file storage. Ensure the following ports are open from the TDM server outbound:

Storage TypePortProtocol
FTP21 (default, configurable)TCP
Amazon S3 (all variants)443HTTPS
Amazon S3 CompatibleCustom (configured via Endpoint URL)HTTPS or HTTP
Azure File / Blob (all variants)443HTTPS

1.5 Security

  • S3 and Azure credentials are stored within the TDM application's encrypted connection configuration. Do not embed access keys or connection strings in scripts or configuration files outside of the TDM platform.
  • For Amazon S3 EC2 IAM, use a least-privilege IAM role policy that grants only s3:GetObject, s3:PutObject, and s3:ListBucket on the specific bucket(s) used for staging.
  • For Amazon S3 AssumeRole, ensure the assumed role's trust policy restricts which accounts or principals can assume it.
  • For local storage, apply group-based permissions so only the TDM service user and authorised operators can read and write the staging directory. Ensure staging files are not world-readable.
  • Regularly audit staged file directories to ensure masked output files from previous runs are not left accessible indefinitely.

2. Troubleshooting Steps

2.1 Verify the Service User Can Access the Files

  • Purpose: Confirm the TDM service user has the necessary permissions before running a masking job.
  • Commands:
    # Test read access to source file
    sudo -u <service_user> cat /mnt/staging/sample.csv > /dev/null && echo "READ OK"

    # Test write access to the output directory
    sudo -u <service_user> touch /mnt/staging/out/test_write && echo "WRITE OK" && rm /mnt/staging/out/test_write

    # For binary formats — test write + subdirectory creation in the source directory
    sudo -u <service_user> touch /mnt/staging/test_write && echo "WRITE OK" && rm /mnt/staging/test_write
    sudo -u <service_user> mkdir /mnt/staging/schemas_test && echo "MKDIR OK" && rmdir /mnt/staging/schemas_test
  • Verification: All tests relevant to the file format being used must pass.

2.2 Check Disk Space in the Working Directory

  • Purpose: For binary format masking, if the source directory runs out of space during temp file creation, the job fails mid-run and leaves orphaned temporary files.
  • Commands:
    df -h /mnt/staging/
  • Verification: Free space must be at least 3× the size of the largest binary file to be masked. For text formats, headroom equal to the source file size is sufficient for the output file.

2.3 Identify Orphaned Temporary Files

  • Purpose: If a binary format masking job fails mid-execution, temporary files may be left behind in the source directory.
  • Commands:
    find /mnt/staging/ \( -name "*.masked" -o -name "*_schema.json" -o -name "*_masked.avro" \
    -o -name "*_masked.parquet" -o -name "*_masked.orc" -o -name "*.json" \) 2>/dev/null
  • Cleanup: Review and remove orphaned files from previous failed runs before re-running a masking job. For ORC, also check for orphaned schemas/ subdirectories.

2.4 Verify File Encoding

  • Purpose: Delimited, fixed-width, XML, and JSON files use a configurable encoding (default UTF-8). Files with incompatible encodings fail to parse.
  • Commands:
    file -i /mnt/staging/sample.csv
    file -i /mnt/staging/extract.txt
  • Expected Output:
    sample.csv:   text/plain; charset=utf-8
    extract.txt: text/plain; charset=iso-8859-1
  • Verification: Ensure the encoding configured in the TDM masking job matches the actual file encoding. For mainframe fixed-width extracts with EBCDIC encoding, convert to a supported encoding before processing:
    iconv -f EBCDIC-US -t UTF-8 extract.dat > extract_utf8.dat

2.5 Verify Binary File Integrity

  • Purpose: Binary format files (Avro, Parquet, ORC) that are corrupt or truncated fail during format conversion.

  • Commands:

    # Parquet
    python3 -c "
    import pyarrow.parquet as pq
    t = pq.read_table('/mnt/staging/sample.parquet')
    print(t.schema)
    print(f'Rows: {t.num_rows}')
    "

    # Avro
    python3 -c "
    import fastavro
    with open('/mnt/staging/sample.avro', 'rb') as f:
    reader = fastavro.reader(f)
    print(reader.schema)
    print(f'Records: {sum(1 for _ in reader)}')
    "

    # ORC
    python3 -c "
    from pyorc import Reader
    with open('/mnt/staging/sample.orc', 'rb') as f:
    r = Reader(f)
    print(r.schema)
    "
  • Verification: If any of these fail with a parse error, the source file must be regenerated before masking.


2.6 Check Local Mount Health

  • Purpose: For the local server type, if the underlying storage is a network-backed mount (e.g. NFS), a stale or disconnected mount causes silent file access failures.
  • Commands:
    # Verify mount is present
    mount | grep /mnt/staging

    # Test responsiveness
    timeout 5 ls /mnt/staging/ && echo "MOUNT HEALTHY" || echo "MOUNT STALE"
  • Verification: If ls hangs for more than a few seconds, the mount is stale. Remount:
    umount -l /mnt/staging
    mount /mnt/staging

2.7 System Resource Monitoring

Monitor Disk I/O During Masking

iostat -x 1 5
  • Verification: %util should remain below 80% on the working directory's device. Sustained high utilization during binary format masking indicates a storage bottleneck.

Monitor Memory During Masking

watch -n 2 free -h
  • Verification: Available RAM should remain above 20% during binary file masking. If it drops below this, reduce concurrent masking jobs or increase RAM.

3. File Permissions

The TDM service user requires the following permissions depending on the file format being processed.

Permission Requirements by Format

FormatRead: Source FileWrite: Output DirectoryWrite: Source DirectoryCreate Subdirectory in Source Dir
DelimitedYesYesNoNo
Fixed-widthYesYesNoNo
XMLYesYesNoNo
JSONYesYesNoNo
AvroYesYes (masked output)Yes (temp files)No
ParquetYesYes (masked output)Yes (temp files)No
ORCYesYes (masked output)Yes (temp files)Yes (schemas/ subdir)
note

For Avro, Parquet, and ORC, if Masked File Directory is not set, the output file is written to the same directory as the source file; if set, output is written there. In both cases the source directory must be writable for temporary files (and for ORC, subdirectory creation). The source file read permission is always required.

Local Storage — Setting Permissions

For the local server type, the TDM service user must have the correct OS-level permissions on the staging directory.

# Set ownership of the staging directory to the TDM service user
chown -R <service_user>:<service_group> /mnt/staging/

# Set permissions: owner/group read+write+execute, no world access
chmod -R 2770 /mnt/staging/

Cloud Storage — Access Control

For non-local server types, access is governed by the credentials configured in the TDM connection, not by OS-level permissions on the TDM server.

Server TypeAccess Control
FTPUsername and password configured in the TDM connection
Amazon S3 StandardIAM user with s3:GetObject, s3:PutObject, s3:ListBucket on the target bucket
Amazon S3 CompatibleAccess key and secret key for the compatible store
Amazon S3 EC2 IAMEC2 instance IAM role — attach a least-privilege policy to the instance profile
Amazon S3 AssumeRoleThe base IAM user must have sts:AssumeRole permission; the assumed role must have S3 access
Azure File / Blob (all variants)Azure storage account access key or SAS token with read + write permissions on the target container or file share

Verifying Local Access

# Find the TDM service user identity
id <service_user>

# Confirm the service user can read and write the staging directory
sudo -u <service_user> ls /mnt/staging/
sudo -u <service_user> touch /mnt/staging/test && rm /mnt/staging/test