]>
Commit | Line | Data |
---|---|---|
1 | # tap-google-sheets | |
2 | ||
3 | This is a [Singer](https://singer.io) tap that produces JSON-formatted data | |
4 | following the [Singer | |
5 | spec](https://github.com/singer-io/getting-started/blob/master/SPEC.md). | |
6 | ||
7 | This tap: | |
8 | ||
9 | - Pulls raw data from the [Google Sheets v4 API](https://developers.google.com/sheets/api) | |
10 | - Extracts the following endpoints: | |
11 | - [File Metadata](https://developers.google.com/drive/api/v3/reference/files/get) | |
12 | - [Spreadsheet Metadata](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/get) | |
13 | - [Spreadsheet Values](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets.values/get) | |
14 | - Outputs the following metadata streams: | |
15 | - File Metadata: Name, audit/change info from Google Drive | |
16 | - Spreadsheet Metadata: Basic metadata about the Spreadsheet: Title, Locale, URL, etc. | |
17 | - Sheet Metadata: Title, URL, Area (max column and row), and Column Metadata | |
18 | - Column Metadata: Column Header Name, Data type, Format | |
19 | - Sheets Loaded: Sheet title, load date, number of rows | |
20 | - For each Sheet: | |
21 | - Outputs the schema for each resource (based on the column header and datatypes of row 2, the first row of data) | |
22 | - Outputs a record for all columns that have column headers, and for each row of data | |
23 | - Emits a Singer ACTIVATE_VERSION message after each sheet is complete. This forces hard deletes on the data downstream if fewer records are sent. | |
24 | - Primary Key for each row in a Sheet is the Row Number: `__sdc_row` | |
25 | - Each Row in a Sheet also includes Foreign Keys to the Spreadsheet Metadata, `__sdc_spreadsheet_id`, and Sheet Metadata, `__sdc_sheet_id`. | |
26 | ||
27 | ## API Endpoints | |
28 | [**file (GET)**](https://developers.google.com/drive/api/v3/reference/files/get) | |
29 | - Endpoint: https://www.googleapis.com/drive/v3/files/${spreadsheet_id}?fields=id,name,createdTime,modifiedTime,version | |
30 | - Primary keys: id | |
31 | - Replication strategy: Incremental (GET file audit data for spreadsheet_id in config) | |
32 | - Process/Transformations: Replicate Data if Modified | |
33 | ||
34 | [**metadata (GET)**](https://developers.google.com/drive/api/v3/reference/files/get) | |
35 | - Endpoint: https://sheets.googleapis.com/v4/spreadsheets/${spreadsheet_id}?includeGridData=true&ranges=1:2 | |
36 | - This endpoint eturns spreadsheet metadata, sheet metadata, and value metadata (data type information) | |
37 | - Primary keys: Spreadsheet Id, Sheet Id, Column Index | |
38 | - Foreign keys: None | |
39 | - Replication strategy: Full (get and replace file metadata for spreadshee_id in config) | |
40 | - Process/Transformations: | |
41 | - Verify Sheets: Check sheets exist (compared to catalog) and check gridProperties (available area) | |
42 | - sheetId, title, index, gridProperties (rowCount, columnCount) | |
43 | - Verify Field Headers (1st row): Check field headers exist (compared to catalog), missing headers (columns to skip), column order/position, and column name uniqueness | |
44 | - Create/Verify Datatypes based on 2nd row value and cell metadata | |
45 | - First check: | |
46 | - [effectiveValue: key](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/other#ExtendedValue) | |
47 | - Valid types: numberValue, stringValue, boolValue | |
48 | - Invalid types: formulaValue, errorValue | |
49 | - Then check: | |
50 | - [effectiveFormat.numberFormat.type](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/cells#NumberFormatType) | |
51 | - Valid types: UNEPECIFIED, TEXT, NUMBER, PERCENT, CURRENCY, DATE, TIME, DATE_TIME, SCIENTIFIC | |
52 | - Determine JSON schema column data type based on the value and the above cell metadata settings. | |
53 | - If DATE, DATE_TIME, or TIME, set JSON schema format accordingly | |
54 | ||
55 | [**values (GET)**](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets.values/get) | |
56 | - Endpoint: https://sheets.googleapis.com/v4/spreadsheets/${spreadsheet_id}/values/'${sheet_name}'!${row_range}?dateTimeRenderOption=SERIAL_NUMBER&valueRenderOption=UNFORMATTED_VALUE&majorDimension=ROWS | |
57 | - This endpoint loops through sheets and row ranges to get the [unformatted values](https://developers.google.com/sheets/api/reference/rest/v4/ValueRenderOption) (effective values only), dates and datetimes as [serial numbers](https://developers.google.com/sheets/api/reference/rest/v4/DateTimeRenderOption) | |
58 | - Primary keys: _sdc_row | |
59 | - Replication strategy: Full (GET file audit data for spreadsheet_id in config) | |
60 | - Process/Transformations: | |
61 | - Loop through sheets (compared to catalog selection) | |
62 | - Send metadata for sheet | |
63 | - Loop through ALL columns for columns having a column header | |
64 | - Loop through ranges of rows for ALL rows in sheet available area max row (from sheet metadata) | |
65 | - Transform values, if necessary (dates, date-times, times, boolean). | |
66 | - Date/time serial numbers converted to date, date-time, and time strings. Google Sheets uses Lotus 1-2-3 [Serial Number](https://developers.google.com/sheets/api/reference/rest/v4/DateTimeRenderOption) format for date/times. These are converted to normal UTC date-time strings. | |
67 | - Process/send records to target | |
68 | ||
69 | ## Authentication | |
70 | The [**Google Sheets Setup & Authentication**](https://drive.google.com/open?id=1FojlvtLwS0-BzGS37R0jEXtwSHqSiO1Uw-7RKQQO-C4) Google Doc provides instructions show how to configure the Google Cloud API credentials to enable Google Drive and Google Sheets APIs, configure Google Cloud to authorize/verify your domain ownership, generate an API key (client_id, client_secret), authenticate and generate a refresh_token, and prepare your tap config.json with the necessary parameters. | |
71 | - Enable Googe Drive APIs and Authorization Scope: https://www.googleapis.com/auth/drive.metadata.readonly | |
72 | - Enable Google Sheets API and Authorization Scope: https://www.googleapis.com/auth/spreadsheets.readonly | |
73 | - Tap config.json parameters: | |
74 | - client_id: identifies your application | |
75 | - client_secret: authenticates your application | |
76 | - refresh_token: generates an access token to authorize your session | |
77 | - spreadsheet_id: unique identifier for each spreadsheet in Google Drive | |
78 | - start_date: absolute minimum start date to check file modified | |
79 | - user_agent: tap-name and email address; identifies your application in the Remote API server logs | |
80 | ||
81 | ## Quick Start | |
82 | ||
83 | 1. Install | |
84 | ||
85 | Clone this repository, and then install using setup.py. We recommend using a virtualenv: | |
86 | ||
87 | ```bash | |
88 | > virtualenv -p python3 venv | |
89 | > source venv/bin/activate | |
90 | > python setup.py install | |
91 | OR | |
92 | > cd .../tap-google-sheets | |
93 | > pip install . | |
94 | ``` | |
95 | 2. Dependent libraries | |
96 | The following dependent libraries were installed. | |
97 | ```bash | |
98 | > pip install target-json | |
99 | > pip install target-stitch | |
100 | > pip install singer-tools | |
101 | > pip install singer-python | |
102 | ``` | |
103 | - [singer-tools](https://github.com/singer-io/singer-tools) | |
104 | - [target-stitch](https://github.com/singer-io/target-stitch) | |
105 | ||
106 | 3. Create your tap's `config.json` file. Include the client_id, client_secret, refresh_token, site_urls (website URL properties in a comma delimited list; do not include the domain-level property in the list), start_date (UTC format), and user_agent (tap name with the api user email address). | |
107 | ||
108 | ```json | |
109 | { | |
110 | "client_id": "YOUR_CLIENT_ID", | |
111 | "client_secret": "YOUR_CLIENT_SECRET", | |
112 | "refresh_token": "YOUR_REFRESH_TOKEN", | |
113 | "spreadsheet_id": "YOUR_GOOGLE_SPREADSHEET_ID", | |
114 | "start_date": "2019-01-01T00:00:00Z", | |
115 | "user_agent": "tap-google-sheets <api_user_email@example.com>" | |
116 | } | |
117 | ``` | |
118 | ||
119 | Optionally, also create a `state.json` file. `currently_syncing` is an optional attribute used for identifying the last object to be synced in case the job is interrupted mid-stream. The next run would begin where the last job left off. | |
120 | Only the `performance_reports` uses a bookmark. The date-time bookmark is stored in a nested structure based on the endpoint, site, and sub_type. | |
121 | ||
122 | ```json | |
123 | { | |
124 | "currently_syncing": "file_metadata", | |
125 | "bookmarks": { | |
126 | "file_metadata": "2019-09-27T22:34:39.000000Z" | |
127 | } | |
128 | } | |
129 | ``` | |
130 | ||
131 | 4. Run the Tap in Discovery Mode | |
132 | This creates a catalog.json for selecting objects/fields to integrate: | |
133 | ```bash | |
134 | tap-google-sheets --config config.json --discover > catalog.json | |
135 | ``` | |
136 | See the Singer docs on discovery mode | |
137 | [here](https://github.com/singer-io/getting-started/blob/master/docs/DISCOVERY_MODE.md#discovery-mode). | |
138 | ||
139 | 5. Run the Tap in Sync Mode (with catalog) and [write out to state file](https://github.com/singer-io/getting-started/blob/master/docs/RUNNING_AND_DEVELOPING.md#running-a-singer-tap-with-a-singer-target) | |
140 | ||
141 | For Sync mode: | |
142 | ```bash | |
143 | > tap-google-sheets --config tap_config.json --catalog catalog.json > state.json | |
144 | > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json | |
145 | ``` | |
146 | To load to json files to verify outputs: | |
147 | ```bash | |
148 | > tap-google-sheets --config tap_config.json --catalog catalog.json | target-json > state.json | |
149 | > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json | |
150 | ``` | |
151 | To pseudo-load to [Stitch Import API](https://github.com/singer-io/target-stitch) with dry run: | |
152 | ```bash | |
153 | > tap-google-sheets --config tap_config.json --catalog catalog.json | target-stitch --config target_config.json --dry-run > state.json | |
154 | > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json | |
155 | ``` | |
156 | ||
157 | 6. Test the Tap | |
158 | ||
159 | While developing the Google Search Console tap, the following utilities were run in accordance with Singer.io best practices: | |
160 | Pylint to improve [code quality](https://github.com/singer-io/getting-started/blob/master/docs/BEST_PRACTICES.md#code-quality): | |
161 | ```bash | |
162 | > pylint tap_google_sheets -d missing-docstring -d logging-format-interpolation -d too-many-locals -d too-many-arguments | |
163 | ``` | |
164 | Pylint test resulted in the following score: | |
165 | ```bash | |
166 | Your code has been rated at 9.78/10 | |
167 | ``` | |
168 | ||
169 | To [check the tap](https://github.com/singer-io/singer-tools#singer-check-tap) and verify working: | |
170 | ```bash | |
171 | > tap-google-sheets --config tap_config.json --catalog catalog.json | singer-check-tap > state.json | |
172 | > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json | |
173 | ``` | |
174 | Check tap resulted in the following: | |
175 | ```bash | |
176 | The output is valid. | |
177 | It contained 3881 messages for 13 streams. | |
178 | ||
179 | 13 schema messages | |
180 | 3841 record messages | |
181 | 27 state messages | |
182 | ||
183 | Details by stream: | |
184 | +----------------------+---------+---------+ | |
185 | | stream | records | schemas | | |
186 | +----------------------+---------+---------+ | |
187 | | file_metadata | 1 | 1 | | |
188 | | spreadsheet_metadata | 1 | 1 | | |
189 | | Test-1 | 9 | 1 | | |
190 | | Test 2 | 2 | 1 | | |
191 | | SKU COGS | 218 | 1 | | |
192 | | Item Master | 216 | 1 | | |
193 | | Retail Price | 273 | 1 | | |
194 | | Retail Price NEW | 284 | 1 | | |
195 | | Forecast Scenarios | 2681 | 1 | | |
196 | | Promo Type | 91 | 1 | | |
197 | | Shipping Method | 47 | 1 | | |
198 | | sheet_metadata | 9 | 1 | | |
199 | | sheets_loaded | 9 | 1 | | |
200 | +----------------------+---------+---------+ | |
201 | ``` | |
202 | --- | |
203 | ||
204 | Copyright © 2019 Stitch |