Dataset creation made simple: geo2ml

Janne Mäyrä

Finnish Environment Institute (Syke)

What is it?

Sampling features from a raster using point geometries or polygons
Tiling larger rasters and shapefiles into smaller patches
Rasterizing polygon geometries for semantic segmentation tasks
Converting vector data to COCO and YOLO formats and creating required dataset files
Visualization of generated datasets

Example starts with getting the data

Image data

For image data, use data from NLS Finland converted into Geotiff

Annotation data

This example uses https://github.com/microsoft/GlobalMLBuildingFootprints as the source for reference polygons
The codes on how to download the data are adapted from their examples
First get the CRS and bounds of our data

building_path = Path('../data/building_footprints/')
os.makedirs(building_path, exist_ok=True)
fn = 'L4131H.geojson'

with rio.open(image_path) as src:
    in_crs = src.crs
    bounds = src.bounds

Easiest way to access the data is by mercator tile IDs, so check which tiles and how many correspond to these data

feature_proj = shape(riowarp.transform_geom(in_crs, 
                                            CRS.from_epsg(4326), 
                                            shapely.geometry.box(*bounds)))
minx, miny, maxx, maxy = feature_proj.bounds
quad_keys = set()
for tile in list(mercantile.tiles(minx, miny, maxx, maxy, zooms=9)):
    quad_keys.add(int(mercantile.quadkey(tile)))
quad_keys = list(quad_keys)
print(f"The input area spans {len(quad_keys)} tiles: {quad_keys}")

The input area spans 1 tiles: [120120211]

Next read the dataset-links.csv and filter the buildings we need

df = pd.read_csv(
    "https://minedbuildings.blob.core.windows.net/global-buildings/dataset-links.csv"
)
fin_df = df[df.QuadKey.isin(quad_keys)]
fin_buildings = pd.read_json(fin_df.Url.iloc[0], lines=True)

Then create a geodataframe, convert it to gpd.GeoDataFrame and to EPSG:3067

fin_buildings['geometry'] = fin_buildings.geometry.apply(shapely.geometry.shape)
gdf = gpd.GeoDataFrame(fin_buildings, crs=4326).to_crs('EPSG:3067').clip(bounds)
gdf['label'] = 'building'
gdf.to_file(building_path/fn)

Visualize the data

fig, axs = plt.subplots(1,1, 
                        dpi=150)

with rio.open(image_path) as src:
    rioplot.show(src, ax=axs)

gdf.plot(ax=axs, 
         facecolor='none', 
         edgecolor='#64C1CB')
axs.set_title('Buildings in L4131H')
plt.show()

Dataset conversions

geo2ml can create three types of computer vision datasets:
- COCO with box, polygon and oriented box annotations
- YOLO with box, polygon and oriented box annotations
- Image-mask -dataset for semantic segmentation tasks

Semantic segmentation dataset

All chipped images go into folder outpath/images
All masks with corresponding filenames go into folder outpath/mask_images

from geo2ml.scripts.data import create_raster_dataset

unet_dataset_path = Path('../data/unet')

create_raster_dataset(raster_path=image_path, mask_path=building_path/fn, 
                      outpath=unet_dataset_path, save_grid=True, 
                      target_column='label', gridsize_x=512, gridsize_y=512)

Convert to COCO dataset

COCO datasets are stored in json files
Chipped images go to outpath/images
Box coordinate format: [xmin, ymin, xdelta, ydelta]
Polygon coordinate format: [x0, y0, x1, y1,...]
Oriented box format: [xcenter, ycenter, w, h, angle]

Create a dataset with
- 512x512 image chips
- category id from column label
- in polygon annotation format
- minimum bounding box size of 0 pixels

from geo2ml.scripts.data import create_coco_dataset

coco_path = Path('../data/coco')

create_coco_dataset(raster_path=image_path, polygon_path=building_path/fn, 
                    outpath=coco_path, target_column='label', gridsize_x=512, 
                    gridsize_y=512, dataset_name='example_buildings',
                    ann_format='polygon', min_bbox_area=0)

with open(coco_path/'example_buildings.json') as f:
    coco = json.load(f)
print(coco['info'])
print(coco['categories'])
for i in coco['images'][:2]: 
    print(i)
for a in coco['annotations'][:2]:
    print(a)

{'description': 'example_buildings', 'version': 0.1, 'year': 2024, 'contributor': 'Your name', 'date_created': '2024/02/26'}
[{'supercategory': 'object', 'id': 1, 'name': 'building'}]
{'file_name': 'R0C0.tif', 'id': 0, 'height': 512, 'width': 512}
{'file_name': 'R0C1.tif', 'id': 1, 'height': 512, 'width': 512}
{'segmentation': [[22.576, 173.859, 19.079, 168.677, 24.193, 165.238, 27.69, 170.42]], 'area': 38.52693100000003, 'bbox': [19.079, 165.238, 8.611, 8.62100000000001], 'category_id': 1, 'id': 1, 'image_id': 0, 'iscrowd': 0}
{'segmentation': [[0.0, 150.577, 10.462, 144.792, 10.898, 145.577, 42.566, 128.067, 53.372, 147.541, 24.381, 163.573, 24.752, 164.241, 18.062, 167.941, 18.185, 168.163, 12.473, 171.323, 14.449, 174.885, 4.998, 180.112, 4.751, 179.666, 0.0, 182.293]], 'area': 1318.3801059999998, 'bbox': [0.0, 128.067, 53.372, 54.226], 'category_id': 1, 'id': 2, 'image_id': 0, 'iscrowd': 0}

Convert to YOLO dataset

YOLO format requires folder images and labels
- If an image has annotations, it needs to have corresponding annotation with same name as a txt file
Dataset information is collated in a yaml file (not yet automatically generated)
Box annotation format: `classid x_center y_center w h’
Polygon and oriented box annotation format: classid x0 y0 x1 y1...

Create a dataset with
- 640x640 image chips
- category id from column label
- in polygon annotation format
- minimum bounding box size of 0 pixels

from geo2ml.scripts.data import create_yolo_dataset
yolo_path = Path('../data/yolo')

create_yolo_dataset(raster_path=image_path, polygon_path=building_path/fn, 
                    outpath=yolo_path, target_column='label', gridsize_x=640, 
                    gridsize_y=640, ann_format='polygon', min_bbox_area=0)

Print first five annotations of a random label file

label = random.sample(os.listdir(yolo_path/'labels'), 1)[0]
print(f'Annotations for {label}')
with open(yolo_path/'labels'/label) as f:
    for l in f.readlines()[:5]: print(l.strip())

Annotations for R9C4.txt
0 0.39561562499999997 0.9983078125 0.38075625 0.9892406250000001 0.390253125 0.973728125 0.41713125 0.990128125 0.420934375 0.9886609375000001 0.43212343750000004 0.9954890625000001 0.4338703125 1.0 0.396271875 1.0
0 0.002440625 0.9633406250000001 0.018315625 1.0 0.0 1.0 0.0 0.9643937499999999
0 0.14837499999999998 0.9725453125 0.1849140625 0.9558500000000001 0.19149218750000002 0.9701968750000001 0.21418281249999999 0.9598296875000001 0.22329531249999998 0.979703125 0.18334999999999999 0.9979515625 0.1805796875 0.9919109374999999 0.1628734375 1.0 0.1609609375 1.0
0 0.314078125 0.9868734375000001 0.3008546875 0.979959375 0.319 0.945365625 0.3548640625 0.9641140625 0.3361375 0.999821875 0.3250015625 0.994 0.32215937499999997 0.9994234375 0.31065156250000003 0.9934078124999999
0 0.0340953125 0.9526265625000001 0.02433125 0.9222625000000001 0.06349375 0.9097125 0.07325781249999999 0.9400765625

Library available on GitHub

https://github.com/mayrajeo/geo2ml
Documentation and some examples on GitHub pages: https://mayrajeo.github.io/geo2ml/
Don’t be surprised if there are bugs, I’ll fix them when they appear