support PP-OCRv6#17921
Open
leo-q8 wants to merge 16 commits into
Open
Conversation
|
Thanks for your contribution! |
1. Restore set_epoch_as_seed in SimpleDataSet to re-enable adaptive shrink_ratio (0.4→0.6) curriculum learning in MakeBorderMap and MakeShrinkMap across training epochs. 2. Remove default p=1.0 for Affine augmentation, restoring albumentations default p=0.5 (50% rotation probability). These two changes caused V5 dataset precision to drop significantly (e.g. blur 0.904→0.709, printing_ch 0.926→0.708) while V4 remained unaffected. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add EMA (Exponential Moving Average) support in training pipeline: build EMA in train.py, apply/restore in program.py, save/load in save_load.py (gated by use_ema config, disabled by default) - Replace build_dataloader with lightweight reset_data_lines to avoid rebuilding DataLoader every epoch when need_reset=True - Store _base_shrink_ratio/_total_epoch in MakeBorderMap/MakeShrinkMap to support in-place shrink_ratio updates across epochs - Add dilated_kernel_size=7 for PP-OCRv6 small det config - Remove redundant balance_loss=true from base/small configs (matches Python default) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rruption Previously, Global.seed (1024) was passed to build_dataloader as the epoch parameter, causing shrink_ratio to be incorrectly computed as 0.81 instead of 0.4 at training start. This separates the two concerns: seed controls data shuffling randomness, epoch controls adaptive shrink_ratio progression. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Core EMA implementation with threshold/exponential/normal decay types, bias-correction, ema_filter_no_grad for distillation, and exchange- pattern save/restore compatible with PaddleDetection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Root cause: each epoch rebuilt the DataLoader, destroying and re-forking 16 worker processes. Worker join() triggers COW page cleanup that grows linearly with parent process memory (epoch 1: ~1s, epoch 400: ~466s), totaling ~27 hours over 500 epochs. Solution: - Enable persistent_workers=True in DataLoader for training - Pre-load all label lines once; use _index_map for per-epoch ratio sampling (bit-exact with original get_image_info_list + shuffle) - Signal epoch changes to workers via multiprocessing.Value shared memory - Workers lazily rebuild _index_map on next __getitem__ (no disk I/O) - MakeBorderMap/MakeShrinkMap dynamically compute shrink_ratio from shared epoch value (_get_shrink_ratio), replacing per-epoch op rebuild - Reset reader_start after save_model to fix avg_reader_cost accounting Verified against 53b006f baseline with real config (34 label files, ratio 0.01~0.8): data sampling bit-exact, shrink_ratio numerically identical, ext_data pool equivalent, DistributedBatchSampler unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace _shared_epoch injection into MakeBorderMap/MakeShrinkMap with a simpler approach: dataset.__getitem__ sets data["epoch"], ops read it in __call__/draw_border_map. This keeps ops as pure data transforms with no cross-process state, and removes _setup_shared_epoch_in_ops, _get_shrink_ratio, and _update_epoch_in_ops (-44 lines, +7 lines). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
set_epoch_as_seed conflated two unrelated concerns: data sampling seed and shrink_ratio epoch tracking. Now that epoch flows through data dict, ops only need total_epoch from config to store _base_shrink_ratio — no need to inject epoch at init time. Remove set_epoch_as_seed and _update_epoch_in_ops entirely (-25 lines). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
draw_border_map does not have access to the data dict, so epoch-based shrink_ratio must be computed in __call__ and passed as a parameter. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When pyclipper.Execute() returns [] for degenerate polygons (very small area or floating-point precision issues), skip the polygon instead of crashing the entire sample with IndexError: list index out of range. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
pyclipper internally converts float coords to integers and raises ValueError when encountering NaN. NaN can appear after aggressive geometric augmentations (affine, crop) on degenerate polygons. - make_border_map: return early from draw_border_map if polygon has NaN - make_shrink_map: mark polygon as ignored and continue if NaN detected Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ppocr/data/simple_dataset.py
- Accept http/https URLs as image paths in label files; local paths
unchanged (os.path.join / os.path.exists behaviour identical to before)
- _load_image_bytes(): cache hit → in-flight future → synchronous download
- _img_path_exists(): returns True for URLs (server reachability checked
lazily at download time, not up-front)
- Per-worker LRU cache (200 entries, ~54 MB/worker) backed by a 4-thread
ThreadPoolExecutor; cache and pool are fork-local, no cross-process lock
- _prefetch_epoch_urls(): called inside each worker on epoch-0 first access
and on every epoch change (via _ensure_index_map); scans _index_map and
submits the first 200 URL entries for background download so that most
URL items are warm by the time __getitem__ needs them
- _download_url_bytes(): always removes the futures-dict entry on both
success and failure, so a transiently-failing URL is retried next access
- get_ext_data(): wraps _load_image_bytes in try/except so a URL error
skips that ext sample without aborting the main item
- Missing local file behaviour is identical to before (raise → except →
log error → random retry in train, sequential retry in eval)
configs/det/PP-OCRv6/{base,small,tiny}_det.yml
- IaaAugment Affine: add p=0.5, widen rotate range to ±25, cap resize to 2×
- Regularizer factor: normalise scientific notation
- base: DetResizeForTest limit_side_len=4000 / limit_type=max
tools/infer_det.py
- infer_img can now be a .txt file listing one image path per line
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
URLs containing CJK filenames (e.g. ancient_img dataset paths with Chinese characters) caused urllib to raise UnicodeEncodeError because http.client requires ASCII-only URLs. Add _encode_url() which uses urllib.parse.urlparse + quote to percent-encode only the path component, leaving scheme and netloc intact. _download_url_bytes() now encodes the URL before passing it to urlopen; the original URL is still used as the cache key so duplicate downloads are correctly deduplicated. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mCropData Non-quad polygons (e.g. from pyclipper) don't have exactly 4 vertices, so get_min_quad_side would produce incorrect results. Guard the check with a len==4 condition. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Some label files (e.g. ancient.txt, handwrite_en.txt) contain [null, null] coordinate points which become NaN after np.array(..., dtype=np.float32). Instead of letting NaN propagate through the pipeline and crash in downstream transforms (CopyPaste, MakeShrinkMap, etc.), detect them at the source and mark as ignore so they are properly masked in loss. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ention Lower the minimum area threshold from 120 to 80 and char height ratio from 0.5 to 0.35 to retain more partially-clipped text polygons during random cropping, reducing unnecessary label loss. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.