Creating Spacewalk Data Files
There are several ways to create a Spacewalk Binary File (.sw
). Choose the method that best matches your source data:
From CSV Data
If your data is in CSV format with columns for genomic positions and 3D coordinates, you can use our interactive notebook:
The notebook guides you through:
- Loading your CSV data
- Formatting genomic positions and spatial coordinates
- Creating a properly structured
.sw
file - Validating the output
Example input CSV format:
chromosome,start,end,x,y,z
chr1,1000000,2000000,0.5,1.2,3.4
chr1,2000000,3000000,1.5,2.2,4.4
From Legacy Format
If you have data in the legacy Spacewalk Text format (.swt
), use the swt2sw conversion tool:
- Install the tool:
pip install git+https://github.com/jrobinso/hdf5-indexer.git pip install git+https://github.com/turner/swt2sw.git
- Convert your file:
- For Ball & Stick data:
swt2sw -f input.swt -n output -single-point
- For Point Cloud data:
swt2sw -f input.swt -n output -multi-point
- For Ball & Stick data:
Using HDF5 Directly
For advanced users who want to create .sw
files programmatically:
import h5py
import numpy as np
# Create file
with h5py.File('output.sw', 'w') as f:
# Add header
header = f.create_group('header')
header.attrs['format'] = 'sw'
header.attrs['genome'] = 'hg38'
header.attrs['pointtype'] = 'SINGLE_POINT' # or 'MULTI_POINT'
# Add genomic positions
genomic = f.create_group('genomic_position')
regions = np.array([
['chr1', 1000000, 2000000],
['chr1', 2000000, 3000000]
])
genomic.create_dataset('regions', data=regions)
# Add spatial positions
spatial = f.create_group('spatial_position')
xyz = np.array([
[0.5, 1.2, 3.4],
[1.5, 2.2, 4.4]
])
spatial.create_dataset('t_0', data=xyz)
File Validation
After creating your .sw
file:
- Use myHDF5 to inspect the file structure
- Check that all required groups and attributes are present
- Verify genomic positions are properly sorted
- Ensure spatial coordinates match your expectations
Common Issues
- Missing Header Attributes: Ensure all required attributes (format, genome, pointtype) are set
- Unsorted Regions: Genomic regions must be sorted by start position
- Mismatched Counts: Number of spatial positions must match genomic regions for single-point data
- Invalid Genome ID: Use standard genome identifiers (e.g., hg38, mm10)
Need Help?
- Check the File Format Specification
- See Data Structure for detailed format information
- Open an issue if you encounter problems