Skip to main content

Data Preparation

In this section, we'll look at how to organize your data and what CSV formatting requirements need to be met.

The Data Folder

Your CSV files can live anywhere on your machine, our data upload utility, Conductor, will accept a file path at upload time, so the portal doesn't require a fixed location. However, for this workshop we use the data/ directory at the project root as a convenient working area. Place a representative subset of your data here so we have something to configure and test against:

project-root/
└── data/
├── datatable1.csv # → a representative subset of your data
├── datatable2.csv # → optional additional table
└── ...

CSV Requirements

File Format & Header Rules

Your CSV column headers become field names in PostgreSQL, Elasticsearch, and GraphQL, so they must be valid across all three.

RuleDetails
FormatCSV (comma-separated); other delimiters supported via --delimiter but for simplicity we recommend using comma-separated files.
Header rowRequired as the first line
Prohibited characters: > < . [space] , / \ ? # [ ] { } " * | + @ & ( ) ! ^
Max lengthA maximum of 63 characters per header name, PostgreSQL silently truncates longer identifiers, which can cause mismatches between your schema and index
Reserved wordsThese are internal field names used by Elasticsearch and GraphQL. Using them will conflict with system internals and cause indexing or query errors: _type _id _source _all _parent _field_names _routing _index _size _timestamp _ttl _meta _doc __typename __schema __type
Best practicesUse snake_case or camelCase, lowercase, descriptive but concise, no special characters or spaces

Here are some examples to help illustrate:

Good HeadersBad HeadersWhy
donor_idDonor ID!Spaces and ! are prohibited characters
age_at_diagnosisAge at DiagnosisSpaces are prohibited; use lowercase
primary_sitePrimary.Site. is a prohibited character
treatment_responsetreatment/response/ is a prohibited character

Data Types

The Config Generator will automatically infer field types when generating Elasticsearch mappings:

CSV ContentElasticsearch TypeExample
Text/categorical valueskeyword"Lung", "Female", "Complete Response"
Whole numbersinteger45, 120, 365
Decimal numbersfloat3.14, 0.95
Dates (ISO format)date2024-01-15

The goal is to get the structure right, you can review and adjust individual type assignments after the Config Generator produces the mapping, which we'll cover in the next section.

tip

LLMs can be a powerful aid for reviewing, refining and troubleshooting configurations. Just ensure any data shared with an external model complies with your institution's data governance and privacy requirements.

Version Control Best Practices

Keep data files out of version control, this keeps the repository lightweight and prevents accidentally publishing raw or sensitive data. The prelude repository's .gitignore already excludes common data file patterns, but if you are bringing your own data, verify your files are covered:

# In your .gitignore
*.csv
*.tsv
*.xlsx
data/
What these patterns do
PatternWhat it ignores
*.csvAll CSV files anywhere in the repository
*.tsvAll tab-separated files
*.xlsxAll Excel files
data/The entire data/ directory and everything inside it

Git ignores are additive, adding these patterns will not affect files already being tracked. If a data file was previously committed, you'll need to untrack it first:

git rm --cached path/to/your-file.csv

To check whether a file is being tracked:

git check-ignore -v path/to/your-file.csv

If you're working with data that has any access restrictions, use anonymized or synthetic samples during development and only load real data in controlled environments.

There are no strict size limits beyond Docker and Elasticsearch resource constraints. In fact we've scaled this resource to hundreds of millions of records. However, for development and testing, a representative sample of approximately 500 records works well. You can start small and load larger datasets once your configuration is working.

Checkpoint

Before proceeding, confirm:

  1. A representative subset of your data is in the data/ directory
  2. Running head -5 data/datatable1.csv shows your headers and data rows
  3. Headers use snake_case with no spaces or special characters
  4. You understand that each CSV file becomes one data table in the portal

Next: With data prepared, we'll use the Config Generator to produce the Elasticsearch and Arranger configuration files.