File Staging Environment
Introduction
The enov8 Test Data Management (TDM) Tool supports masking, profiling, and validation of structured file formats directly on disk. This document covers the storage configuration, filesystem permissions, and troubleshooting steps required for the TDM tool to read, mask, and write files in staging environments.
The TDM tool supports the following file formats:
| Category | Formats |
|---|---|
| Text / Flat files | Delimited (CSV, TSV, pipe-delimited, custom delimiter), Fixed-width (mainframe extracts) |
| Structured data | XML, JSON |
| Columnar binary | Apache Avro, Apache Parquet, Apache ORC |
1. Standard File Server Configuration
1.1 Hardware Requirements
Hardware requirements are driven by the size of the files being processed and the number of concurrent masking jobs.
CPU
- Cores: Start with 4 cores. Add 2 cores per 10 GB of data being masked concurrently.
- Example:
- Masking 10 GB of files at a time: 4 cores.
- Masking 50 GB of files at a time: 12 cores.
Memory (RAM)
- Base Requirement: 8 GB minimum.
- Text and flat files (delimited, fixed-width, XML, JSON) are streamed or loaded line-by-line where possible. Memory usage scales with the largest single line or document.
- Binary formats (Avro, Parquet, ORC) are fully loaded into memory during format conversion. Budget at least 3× the size of the largest single file being masked simultaneously.
- Example:
- Masking a 2 GB Parquet file: at least 6 GB of available RAM.
Disk
- Type: SSDs are recommended for working directories.
- Capacity: For binary formats, the working directory must have at least 3× the size of the largest file available as free space to accommodate temporary conversion files and masked output.
Network
- Bandwidth: At least 1 Gbps NIC, particularly when files are stored on mounted remote filesystems.
- Latency: Low-latency connection to the file storage. High latency on remote mounts significantly impacts throughput for large files and binary format conversion.
1.2 Supported File Storage Types
The TDM tool connects to file storage natively using the Server Type configured in the connection. No filesystem mounts are required for cloud storage — credentials are provided directly in the TDM connection configuration.
Local
Files are accessed directly via a filesystem path on the TDM server. Any storage mounted to the TDM server (local disk, SAN, NFS, CIFS) is accessible as a local path.
- Required field: File Path
- The TDM service user must have the appropriate read/write permissions on the path. See Section 3 for details.
FTP
Files are accessed from a remote FTP server.
- Required fields: File Path, Server, Port (default:
21) - Optional fields: Username, Password
Amazon S3 Standard
Direct connection to an AWS S3 bucket using static IAM credentials.
- Required fields: File Path, Access Key, Secret Key
- Optional fields: Region Name
Amazon S3 Compatible
Connection to any S3-compatible object store (e.g. MinIO, Ceph, Wasabi) using a custom endpoint.
- Required fields: File Path, Access Key, Secret Key, Endpoint URL
Amazon S3 EC2 IAM
Connection to S3 using the EC2 instance's attached IAM role. No static credentials are required — the TDM server must be running on an EC2 instance with an appropriate IAM role attached.
- Required fields: File Path
Amazon S3 AssumeRole
Connection to S3 by assuming a cross-account or delegated IAM role.
- Required fields: File Path, Access Key, Secret Key, Account ID, Role Name
- Optional fields: Region Name
Azure File Storage
Connects to an Azure File Share using one of three authentication methods:
| Server Type | Authentication | Required Fields |
|---|---|---|
| Azure File - Shared Access | Account name + access key | Account Name, Azure Access Key |
| Azure File - SAS | Account name + SAS token via access key | Account Name, Azure Access Key |
| Azure File - Connection String | Full connection string built from key | Account Name, Azure Access Key |
- Optional fields (all Azure File types): Default Endpoint Protocol (
http/https, defaulthttps), Endpoint Suffix (defaultcore.windows.net)
Azure Blob Storage
Connects to an Azure Blob Storage container using one of three authentication methods:
| Server Type | Authentication | Required Fields |
|---|---|---|
| Azure Blob - Shared Access | Account name + access key | Account Name, Azure Access Key |
| Azure Blob - SAS | Account name + SAS token via access key | Account Name, Azure Access Key |
| Azure Blob - Connection String | Full connection string built from key | Account Name, Azure Access Key |
- Optional fields (all Azure Blob types): Default Endpoint Protocol (
http/https, defaulthttps), Endpoint Suffix (defaultcore.windows.net)
1.3 File Format Behaviour and Working Directory
Each file format has different temporary file requirements. Understanding these is important for configuring permissions correctly.
Delimited Files (CSV, TSV, pipe-delimited, custom)
- Reads from the File Path configured in the connection.
- Writes the masked output to the Masked File Directory (with an optional Masked File Suffix appended to the filename).
- No temporary files are created in the source directory.
- Supports configurable File Delimiter, File Quote, and File Encoding.
Fixed-Width Files (Mainframe Extracts)
- Reads from the File Path configured in the connection, using the Width Definition to determine column boundaries.
- Writes the masked output to the Masked File Directory (with an optional Masked File Suffix appended to the filename).
- No temporary files are created in the source directory.
- Supports configurable File Encoding with
backslashreplaceerror handling for non-UTF-8 mainframe encodings.
XML
- Reads the file at the File Path configured in the connection and parses it into an element tree in memory.
- Writes the masked output to the Masked File Directory (with an optional Masked File Suffix appended to the filename).
- No temporary files are created in the source directory.
- Encoding is determined by the File Encoding setting (
XML Declaration,utf-8, orutf-16).
JSON
- Reads the file at the File Path configured in the connection.
- Optionally validates against a schema if a Schema Path is provided.
- Writes the masked output to the Masked File Directory (with an optional Masked File Suffix appended to the filename).
- No temporary files are created in the source directory.
Apache Avro
- Reads the file at the File Path configured in the connection.
- Creates the following temporary files in the same directory as the source file:
{filename}.json— records converted to JSON{filename}_schema.json— extracted Avro schema{filename}.avro.masked— intermediate masked JSON
- Writes final masked output to
{filename}_masked.avroin the same directory, or to the Masked File Directory if configured. - All temporary files are deleted after masking completes.
Apache Parquet
- Reads the file at the File Path configured in the connection.
- Creates the following temporary files in the same directory as the source file:
{filename}.json— records converted to JSON{filename}.json.masked— intermediate masked JSON
- Writes final masked output to
{filename}_masked.parquetin the same directory, or to the Masked File Directory if configured. - The tool checks that the source directory is writable before proceeding. If it is not, the job fails immediately.
- All temporary files are deleted after masking completes.
Apache ORC
- Reads the file at the File Path configured in the connection.
- Creates the following temporary files in the same directory as the source file:
{filename}.json— records converted to JSON{filename}.orc.masked— intermediate masked JSONschemas/{filename}.schema— ORC schema extracted into aschemas/subdirectory
- Writes final masked output to
{filename}_masked.orcin the same directory, or to the Masked File Directory if configured. - The
schemas/subdirectory is created automatically. The service user must have permission to create subdirectories in the source file's directory. - All temporary files and the
schemas/subdirectory are deleted after masking completes.
1.4 Network Configuration
The TDM server must be able to reach the configured file storage. Ensure the following ports are open from the TDM server outbound:
| Storage Type | Port | Protocol |
|---|---|---|
| FTP | 21 (default, configurable) | TCP |
| Amazon S3 (all variants) | 443 | HTTPS |
| Amazon S3 Compatible | Custom (configured via Endpoint URL) | HTTPS or HTTP |
| Azure File / Blob (all variants) | 443 | HTTPS |
1.5 Security
- S3 and Azure credentials are stored within the TDM application's encrypted connection configuration. Do not embed access keys or connection strings in scripts or configuration files outside of the TDM platform.
- For Amazon S3 EC2 IAM, use a least-privilege IAM role policy that grants only
s3:GetObject,s3:PutObject, ands3:ListBucketon the specific bucket(s) used for staging. - For Amazon S3 AssumeRole, ensure the assumed role's trust policy restricts which accounts or principals can assume it.
- For local storage, apply group-based permissions so only the TDM service user and authorised operators can read and write the staging directory. Ensure staging files are not world-readable.
- Regularly audit staged file directories to ensure masked output files from previous runs are not left accessible indefinitely.
2. Troubleshooting Steps
2.1 Verify the Service User Can Access the Files
- Purpose: Confirm the TDM service user has the necessary permissions before running a masking job.
- Commands:
# Test read access to source file
sudo -u <service_user> cat /mnt/staging/sample.csv > /dev/null && echo "READ OK"
# Test write access to the output directory
sudo -u <service_user> touch /mnt/staging/out/test_write && echo "WRITE OK" && rm /mnt/staging/out/test_write
# For binary formats — test write + subdirectory creation in the source directory
sudo -u <service_user> touch /mnt/staging/test_write && echo "WRITE OK" && rm /mnt/staging/test_write
sudo -u <service_user> mkdir /mnt/staging/schemas_test && echo "MKDIR OK" && rmdir /mnt/staging/schemas_test - Verification: All tests relevant to the file format being used must pass.
2.2 Check Disk Space in the Working Directory
- Purpose: For binary format masking, if the source directory runs out of space during temp file creation, the job fails mid-run and leaves orphaned temporary files.
- Commands:
df -h /mnt/staging/ - Verification: Free space must be at least 3× the size of the largest binary file to be masked. For text formats, headroom equal to the source file size is sufficient for the output file.
2.3 Identify Orphaned Temporary Files
- Purpose: If a binary format masking job fails mid-execution, temporary files may be left behind in the source directory.
- Commands:
find /mnt/staging/ \( -name "*.masked" -o -name "*_schema.json" -o -name "*_masked.avro" \
-o -name "*_masked.parquet" -o -name "*_masked.orc" -o -name "*.json" \) 2>/dev/null - Cleanup: Review and remove orphaned files from previous failed runs before re-running a masking job. For ORC, also check for orphaned
schemas/subdirectories.
2.4 Verify File Encoding
- Purpose: Delimited, fixed-width, XML, and JSON files use a configurable encoding (default UTF-8). Files with incompatible encodings fail to parse.
- Commands:
file -i /mnt/staging/sample.csv
file -i /mnt/staging/extract.txt - Expected Output:
sample.csv: text/plain; charset=utf-8
extract.txt: text/plain; charset=iso-8859-1 - Verification: Ensure the encoding configured in the TDM masking job matches the actual file encoding. For mainframe fixed-width extracts with EBCDIC encoding, convert to a supported encoding before processing:
iconv -f EBCDIC-US -t UTF-8 extract.dat > extract_utf8.dat
2.5 Verify Binary File Integrity
-
Purpose: Binary format files (Avro, Parquet, ORC) that are corrupt or truncated fail during format conversion.
-
Commands:
# Parquet
python3 -c "
import pyarrow.parquet as pq
t = pq.read_table('/mnt/staging/sample.parquet')
print(t.schema)
print(f'Rows: {t.num_rows}')
"
# Avro
python3 -c "
import fastavro
with open('/mnt/staging/sample.avro', 'rb') as f:
reader = fastavro.reader(f)
print(reader.schema)
print(f'Records: {sum(1 for _ in reader)}')
"
# ORC
python3 -c "
from pyorc import Reader
with open('/mnt/staging/sample.orc', 'rb') as f:
r = Reader(f)
print(r.schema)
" -
Verification: If any of these fail with a parse error, the source file must be regenerated before masking.
2.6 Check Local Mount Health
- Purpose: For the
localserver type, if the underlying storage is a network-backed mount (e.g. NFS), a stale or disconnected mount causes silent file access failures. - Commands:
# Verify mount is present
mount | grep /mnt/staging
# Test responsiveness
timeout 5 ls /mnt/staging/ && echo "MOUNT HEALTHY" || echo "MOUNT STALE" - Verification: If
lshangs for more than a few seconds, the mount is stale. Remount:umount -l /mnt/staging
mount /mnt/staging
2.7 System Resource Monitoring
Monitor Disk I/O During Masking
iostat -x 1 5
- Verification:
%utilshould remain below 80% on the working directory's device. Sustained high utilization during binary format masking indicates a storage bottleneck.
Monitor Memory During Masking
watch -n 2 free -h
- Verification: Available RAM should remain above 20% during binary file masking. If it drops below this, reduce concurrent masking jobs or increase RAM.
3. File Permissions
The TDM service user requires the following permissions depending on the file format being processed.
Permission Requirements by Format
| Format | Read: Source File | Write: Output Directory | Write: Source Directory | Create Subdirectory in Source Dir |
|---|---|---|---|---|
| Delimited | Yes | Yes | No | No |
| Fixed-width | Yes | Yes | No | No |
| XML | Yes | Yes | No | No |
| JSON | Yes | Yes | No | No |
| Avro | Yes | Yes (masked output) | Yes (temp files) | No |
| Parquet | Yes | Yes (masked output) | Yes (temp files) | No |
| ORC | Yes | Yes (masked output) | Yes (temp files) | Yes (schemas/ subdir) |
For Avro, Parquet, and ORC, if Masked File Directory is not set, the output file is written to the same directory as the source file; if set, output is written there. In both cases the source directory must be writable for temporary files (and for ORC, subdirectory creation). The source file read permission is always required.
Local Storage — Setting Permissions
For the local server type, the TDM service user must have the correct OS-level permissions on the staging directory.
# Set ownership of the staging directory to the TDM service user
chown -R <service_user>:<service_group> /mnt/staging/
# Set permissions: owner/group read+write+execute, no world access
chmod -R 2770 /mnt/staging/
Cloud Storage — Access Control
For non-local server types, access is governed by the credentials configured in the TDM connection, not by OS-level permissions on the TDM server.
| Server Type | Access Control |
|---|---|
| FTP | Username and password configured in the TDM connection |
| Amazon S3 Standard | IAM user with s3:GetObject, s3:PutObject, s3:ListBucket on the target bucket |
| Amazon S3 Compatible | Access key and secret key for the compatible store |
| Amazon S3 EC2 IAM | EC2 instance IAM role — attach a least-privilege policy to the instance profile |
| Amazon S3 AssumeRole | The base IAM user must have sts:AssumeRole permission; the assumed role must have S3 access |
| Azure File / Blob (all variants) | Azure storage account access key or SAS token with read + write permissions on the target container or file share |
Verifying Local Access
# Find the TDM service user identity
id <service_user>
# Confirm the service user can read and write the staging directory
sudo -u <service_user> ls /mnt/staging/
sudo -u <service_user> touch /mnt/staging/test && rm /mnt/staging/test