Skip to content

Commit 492d120

Browse files
authored
Merge pull request #37 from comp-strat/save_documents_to_google_drive
export mongodb data and save documents to google drive
2 parents f4b2983 + d0ca8fe commit 492d120

1 file changed

Lines changed: 60 additions & 0 deletions

File tree

README.md

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -107,3 +107,63 @@ mongorestore --port=27000 --username=admin --password="ourPassword" dump/ # rest
107107
```
108108

109109
CAUTION: We often find big crawl jobs quickly add a lot of data to container files in subfolders of `/var/lib/docker/overlay2/` and/or log files in subfolders of `/var/lib/docker/containers/`. We continue to resolve these kinds of storage obstacles. **Keep an eye on your disk storage** and delete things as necessary.
110+
111+
## Export data from MongoDB container to local virtual machine
112+
113+
After the crawling prcoess, data will be saved in the MongoDB container of Docker. To export it to local virtual machine:
114+
115+
```bash
116+
# use docker command with mongoexport
117+
docker exec -it mongodb_container mongoexport --authenticationDatabase admin --username admin --password mdipass --db schoolSpider --collection text --out ./text.json
118+
119+
# go to container bash
120+
docker exec -it mongodb_container bash
121+
122+
# move files from container to local virtual machine
123+
docker cp mongodb_container:text.json /vol_c/data/crawled_output_2022
124+
```
125+
126+
## Save data from virtual machine to google drive
127+
128+
We can use rclone to transfer data from virtual machine to google drive with the command:
129+
130+
```bash
131+
rclone copy text.json output_drive:
132+
```
133+
134+
If you haven't installed rclone, please follow the whole process below.
135+
136+
```bash
137+
# install rclone on the virtual machine
138+
curl https://rclone.org/install.sh | sudo bash
139+
```
140+
141+
rclone works around the concept of remotes. A remote is … a logical name for a remote storage. In our case, we will be syncing with a google drive location called “output_drive”.
142+
143+
```bash
144+
# configure the remote location
145+
rclone config
146+
```
147+
148+
In the configuration page,
149+
150+
- choose `New Remote` and give it a name like `output_drive`
151+
- choose the number for `Google Drive`
152+
- skip `client_id` & `client_secret`
153+
- choose `1` Full Access
154+
- enter the root folder of that remote location: get the id from google drive and cut & paste the folder ID in the configuration screen
155+
- don’t enter a “service_account”, we’ll use the interactive login screen.
156+
- don’t enter Advanced Configuration
157+
- Use auto config? -No
158+
- There will be an url shown on the terminal. Pase the url in your browser and follow the usual Google Drive authorization flow
159+
- paste the code from Google Drive authorization in the configurator
160+
- Team Drive? -No
161+
- Finally choose `Yes this is OK`
162+
163+
After creating a rclone remote, use it to transfer data from virtual machine to google drive
164+
165+
```bash
166+
rclone copy text.json output_drive:
167+
```
168+
169+
For detailed reference: [rclone](https://medium.com/@houlahop/rclone-how-to-copy-files-from-a-servers-filesystem-to-google-drive-aaf21c615c5d)

0 commit comments

Comments
 (0)