Dataset creation made simple: geo2ml

Janne Mäyrä

Finnish Environment Institute (Syke)

What is it?

  • Sampling features from a raster using point geometries or polygons
  • Tiling larger rasters and shapefiles into smaller patches
  • Rasterizing polygon geometries for semantic segmentation tasks
  • Converting vector data to COCO and YOLO formats and creating required dataset files
  • Visualization of generated datasets

Example starts with getting the data

Image data

  • For image data, use data from NLS Finland converted into Geotiff

Annotation data

building_path = Path('../data/building_footprints/')
os.makedirs(building_path, exist_ok=True)
fn = 'L4131H.geojson'

with rio.open(image_path) as src:
    in_crs = src.crs
    bounds = src.bounds
  • Easiest way to access the data is by mercator tile IDs, so check which tiles and how many correspond to these data
feature_proj = shape(riowarp.transform_geom(in_crs, 
                                            CRS.from_epsg(4326), 
                                            shapely.geometry.box(*bounds)))
minx, miny, maxx, maxy = feature_proj.bounds
quad_keys = set()
for tile in list(mercantile.tiles(minx, miny, maxx, maxy, zooms=9)):
    quad_keys.add(int(mercantile.quadkey(tile)))
quad_keys = list(quad_keys)
print(f"The input area spans {len(quad_keys)} tiles: {quad_keys}")
The input area spans 1 tiles: [120120211]
  • Next read the dataset-links.csv and filter the buildings we need
df = pd.read_csv(
    "https://minedbuildings.blob.core.windows.net/global-buildings/dataset-links.csv"
)
fin_df = df[df.QuadKey.isin(quad_keys)]
fin_buildings = pd.read_json(fin_df.Url.iloc[0], lines=True)
  • Then create a geodataframe, convert it to gpd.GeoDataFrame and to EPSG:3067
fin_buildings['geometry'] = fin_buildings.geometry.apply(shapely.geometry.shape)
gdf = gpd.GeoDataFrame(fin_buildings, crs=4326).to_crs('EPSG:3067').clip(bounds)
gdf['label'] = 'building'
gdf.to_file(building_path/fn)

Visualize the data

fig, axs = plt.subplots(1,1, 
                        dpi=150)

with rio.open(image_path) as src:
    rioplot.show(src, ax=axs)

gdf.plot(ax=axs, 
         facecolor='none', 
         edgecolor='#64C1CB')
axs.set_title('Buildings in L4131H')
plt.show()

Dataset conversions

  • geo2ml can create three types of computer vision datasets:
    • COCO with box, polygon and oriented box annotations
    • YOLO with box, polygon and oriented box annotations
    • Image-mask -dataset for semantic segmentation tasks

Semantic segmentation dataset

  • All chipped images go into folder outpath/images
  • All masks with corresponding filenames go into folder outpath/mask_images
from geo2ml.scripts.data import create_raster_dataset

unet_dataset_path = Path('../data/unet')

create_raster_dataset(raster_path=image_path, mask_path=building_path/fn, 
                      outpath=unet_dataset_path, save_grid=True, 
                      target_column='label', gridsize_x=512, gridsize_y=512)

Convert to COCO dataset

  • COCO datasets are stored in json files
  • Chipped images go to outpath/images
  • Box coordinate format: [xmin, ymin, xdelta, ydelta]
  • Polygon coordinate format: [x0, y0, x1, y1,...]
  • Oriented box format: [xcenter, ycenter, w, h, angle]
  • Create a dataset with
    • 512x512 image chips
    • category id from column label
    • in polygon annotation format
    • minimum bounding box size of 0 pixels
from geo2ml.scripts.data import create_coco_dataset

coco_path = Path('../data/coco')

create_coco_dataset(raster_path=image_path, polygon_path=building_path/fn, 
                    outpath=coco_path, target_column='label', gridsize_x=512, 
                    gridsize_y=512, dataset_name='example_buildings',
                    ann_format='polygon', min_bbox_area=0)
with open(coco_path/'example_buildings.json') as f:
    coco = json.load(f)
print(coco['info'])
print(coco['categories'])
for i in coco['images'][:2]: 
    print(i)
for a in coco['annotations'][:2]:
    print(a)
{'description': 'example_buildings', 'version': 0.1, 'year': 2024, 'contributor': 'Your name', 'date_created': '2024/02/26'}
[{'supercategory': 'object', 'id': 1, 'name': 'building'}]
{'file_name': 'R0C0.tif', 'id': 0, 'height': 512, 'width': 512}
{'file_name': 'R0C1.tif', 'id': 1, 'height': 512, 'width': 512}
{'segmentation': [[22.576, 173.859, 19.079, 168.677, 24.193, 165.238, 27.69, 170.42]], 'area': 38.52693100000003, 'bbox': [19.079, 165.238, 8.611, 8.62100000000001], 'category_id': 1, 'id': 1, 'image_id': 0, 'iscrowd': 0}
{'segmentation': [[0.0, 150.577, 10.462, 144.792, 10.898, 145.577, 42.566, 128.067, 53.372, 147.541, 24.381, 163.573, 24.752, 164.241, 18.062, 167.941, 18.185, 168.163, 12.473, 171.323, 14.449, 174.885, 4.998, 180.112, 4.751, 179.666, 0.0, 182.293]], 'area': 1318.3801059999998, 'bbox': [0.0, 128.067, 53.372, 54.226], 'category_id': 1, 'id': 2, 'image_id': 0, 'iscrowd': 0}

Convert to YOLO dataset

  • YOLO format requires folder images and labels
    • If an image has annotations, it needs to have corresponding annotation with same name as a txt file
  • Dataset information is collated in a yaml file (not yet automatically generated)
  • Box annotation format: `classid x_center y_center w h’
  • Polygon and oriented box annotation format: classid x0 y0 x1 y1...
  • Create a dataset with
    • 640x640 image chips
    • category id from column label
    • in polygon annotation format
    • minimum bounding box size of 0 pixels
from geo2ml.scripts.data import create_yolo_dataset
yolo_path = Path('../data/yolo')

create_yolo_dataset(raster_path=image_path, polygon_path=building_path/fn, 
                    outpath=yolo_path, target_column='label', gridsize_x=640, 
                    gridsize_y=640, ann_format='polygon', min_bbox_area=0)
  • Print first five annotations of a random label file
label = random.sample(os.listdir(yolo_path/'labels'), 1)[0]
print(f'Annotations for {label}')
with open(yolo_path/'labels'/label) as f:
    for l in f.readlines()[:5]: print(l.strip())
Annotations for R9C4.txt
0 0.39561562499999997 0.9983078125 0.38075625 0.9892406250000001 0.390253125 0.973728125 0.41713125 0.990128125 0.420934375 0.9886609375000001 0.43212343750000004 0.9954890625000001 0.4338703125 1.0 0.396271875 1.0
0 0.002440625 0.9633406250000001 0.018315625 1.0 0.0 1.0 0.0 0.9643937499999999
0 0.14837499999999998 0.9725453125 0.1849140625 0.9558500000000001 0.19149218750000002 0.9701968750000001 0.21418281249999999 0.9598296875000001 0.22329531249999998 0.979703125 0.18334999999999999 0.9979515625 0.1805796875 0.9919109374999999 0.1628734375 1.0 0.1609609375 1.0
0 0.314078125 0.9868734375000001 0.3008546875 0.979959375 0.319 0.945365625 0.3548640625 0.9641140625 0.3361375 0.999821875 0.3250015625 0.994 0.32215937499999997 0.9994234375 0.31065156250000003 0.9934078124999999
0 0.0340953125 0.9526265625000001 0.02433125 0.9222625000000001 0.06349375 0.9097125 0.07325781249999999 0.9400765625

Library available on GitHub