Data Preparation
In this section, we'll look at how to organize your data and what CSV formatting requirements need to be met.
The Data Folder
Your CSV files can live anywhere on your machine, our data upload utility, Conductor, will accept a file path at upload time, so the portal doesn't require a fixed location. However, for this workshop we use the data/ directory at the project root as a convenient working area. Place a representative subset of your data here so we have something to configure and test against:
project-root/
└── data/
├── datatable1.csv # → a representative subset of your data
├── datatable2.csv # → optional additional table
└── ...
CSV Requirements
File Format & Header Rules
Your CSV column headers become field names in PostgreSQL, Elasticsearch, and GraphQL, so they must be valid across all three.
| Rule | Details |
|---|---|
| Format | CSV (comma-separated); other delimiters supported via --delimiter but for simplicity we recommend using comma-separated files. |
| Header row | Required as the first line |
| Prohibited characters | : > < . [space] , / \ ? # [ ] { } " * | + @ & ( ) ! ^ |
| Max length | A maximum of 63 characters per header name, PostgreSQL silently truncates longer identifiers, which can cause mismatches between your schema and index |
| Reserved words | These are internal field names used by Elasticsearch and GraphQL. Using them will conflict with system internals and cause indexing or query errors: _type _id _source _all _parent _field_names _routing _index _size _timestamp _ttl _meta _doc __typename __schema __type |
| Best practices | Use snake_case or camelCase, lowercase, descriptive but concise, no special characters or spaces |
Here are some examples to help illustrate:
| Good Headers | Bad Headers | Why |
|---|---|---|
donor_id | Donor ID! | Spaces and ! are prohibited characters |
age_at_diagnosis | Age at Diagnosis | Spaces are prohibited; use lowercase |
primary_site | Primary.Site | . is a prohibited character |
treatment_response | treatment/response | / is a prohibited character |
Data Types
The Config Generator will automatically infer field types when generating Elasticsearch mappings:
| CSV Content | Elasticsearch Type | Example |
|---|---|---|
| Text/categorical values | keyword | "Lung", "Female", "Complete Response" |
| Whole numbers | integer | 45, 120, 365 |
| Decimal numbers | float | 3.14, 0.95 |
| Dates (ISO format) | date | 2024-01-15 |
The goal is to get the structure right, you can review and adjust individual type assignments after the Config Generator produces the mapping, which we'll cover in the next section.
LLMs can be a powerful aid for reviewing, refining and troubleshooting configurations. Just ensure any data shared with an external model complies with your institution's data governance and privacy requirements.
Version Control Best Practices
Keep data files out of version control, this keeps the repository lightweight and prevents accidentally publishing raw or sensitive data. The prelude repository's .gitignore already excludes common data file patterns, but if you are bringing your own data, verify your files are covered:
# In your .gitignore
*.csv
*.tsv
*.xlsx
data/
What these patterns do
| Pattern | What it ignores |
|---|---|
*.csv | All CSV files anywhere in the repository |
*.tsv | All tab-separated files |
*.xlsx | All Excel files |
data/ | The entire data/ directory and everything inside it |
Git ignores are additive, adding these patterns will not affect files already being tracked. If a data file was previously committed, you'll need to untrack it first:
git rm --cached path/to/your-file.csv
To check whether a file is being tracked:
git check-ignore -v path/to/your-file.csv
If you're working with data that has any access restrictions, use anonymized or synthetic samples during development and only load real data in controlled environments.
Recommended Data Size
There are no strict size limits beyond Docker and Elasticsearch resource constraints. In fact we've scaled this resource to hundreds of millions of records. However, for development and testing, a representative sample of approximately 500 records works well. You can start small and load larger datasets once your configuration is working.
Checkpoint
Before proceeding, confirm:
- A representative subset of your data is in the
data/directory - Running
head -5 data/datatable1.csvshows your headers and data rows - Headers use
snake_casewith no spaces or special characters - You understand that each CSV file becomes one data table in the portal
Next: With data prepared, we'll use the Config Generator to produce the Elasticsearch and Arranger configuration files.