How can I make my dataframe sample in each run usi

2019-09-13 03:02发布

问题:

I have a peculiar scenario in which the sample obtained in two consecutive samplings are not consistent even when I've provided a seed value. I'm using the following code (Which was an outcome of a discussion here:

var conversionSample = sortedConversionSubset.sample(true, (sampleSize + 0.05), 3*x).limit((conversionCount * sampleSize).toInt) 

var nonConversionSample = sortedNonConversionSubset.sample(true, (sampleSize + 0.05), 3*x).limit((nonConversionCount * sampleSize).toInt) 

Here

  1. 'sampleSize' is a constant fraction value less than 0.8
  2. 'x' is a constant int, which represents xth iteration in a for loop

  3. 'conversionCount' and 'nonConversionCount' are int values representing number of rows in each subset

Now the observation being that in two successive runs the sample generated is different in both cases which was not the expected behavior.

sortedConversionSubset
+--------------------------------------+----------+
|clientid                              |Conversion|
+--------------------------------------+----------+
|02438b66-2de4-4765-bae3-de7453647ea7_1|1         |
|203865ed-f02a-4ed9-9098-82691de707a4_0|1         |
|203865ed-f02a-4ed9-9098-82691de707a4_1|1         |
|674e2337-aec5-434e-b56e-8c2efcc42894_1|1         |
|6d6036d3-c161-4f5d-8557-80b85dd87bd9_0|1         |
|6d6036d3-c161-4f5d-8557-80b85dd87bd9_1|1         |
|7797aba3-3eea-4556-856e-753812b4b551_0|1         |
|7797aba3-3eea-4556-856e-753812b4b551_1|1         |
|870ab2a5-0650-42b8-9e6f-bde3859f64fd_0|1         |
|870ab2a5-0650-42b8-9e6f-bde3859f64fd_1|1         |
|9b606693-4ffa-44a5-bd7c-cc6974ce3e83_0|1         |
|be218b72-c664-40cf-adf5-e3519095e941_0|1         |
|e7dc7fd9-32df-46a1-b3bd-793bbda09f6f_0|1         |
|eaf434da-6a8f-4ab0-a744-62bea663ed5e_0|1         |
|eaf434da-6a8f-4ab0-a744-62bea663ed5e_1|1         |
+--------------------------------------+----------+


sortedNonConversionSubset
+--------------------------------------+----------+
|clientid                              |Conversion|
+--------------------------------------+----------+
|03358d8f-9b9c-4258-9c99-234ab102c29b_1|0         |
|040d213c-e91a-42f4-9bf7-90671670dc17_0|0         |
|04fe5148-1c56-4c88-aed0-1f01220bffd6_0|0         |
|0ed2e621-9ba4-46f0-8793-a84d32538c39_0|0         |
|0f9bcf42-e7fa-49a0-9d75-6c9bbc38b4d5_0|0         |
|108c5478-abc0-44d9-968b-47f81c4f5a37_0|0         |
|129eb883-159d-49be-b8ae-9aa44a3e2919_0|0         |
|13e3d779-026b-4d12-8619-aa5fe6ca99ed_0|0         |
|14497295-eebd-44aa-9f26-fc5e4810fb54_0|0         |
|1855d96d-3647-4c4f-a20f-7e46f7635798_0|0         |
|1911caf0-a470-4898-9b62-57c604422727_0|0         |
|1b91b8dc-09b8-47e2-b892-f5c14b650019_0|0         |
|1dfa820c-77e0-4927-8a39-ecd8e842b09b_0|0         |
|1e48e346-4ada-4a8d-896b-7658cc2499cd_0|0         |
|252be902-4204-40a5-9d3c-dd3a7d0f0355_0|0         |
|2995b49d-525b-43e9-ab36-8b8910a4607c_0|0         |
|2bc06b59-4624-4ddd-87a3-ed04cba88233_0|0         |
|2d4538a5-20e6-4742-ae46-aad0a5ed3fff_0|0         |
|31563716-9380-4662-90e5-7f63a1ab9072_0|0         |
|34442a3e-0437-4c41-86fb-1ac55062993a_0|0         |
|35151629-2f86-4917-90d2-42daa5ae4f5c_0|0         |
|3c37e066-dff5-4bd9-84ab-b9e73f3f3fdd_0|0         |
|3e998096-3a4b-4b57-a1de-69d2dbd19abd_0|0         |
|3f8ace3c-d378-4423-97a0-3d9cf35ba256_0|0         |
|49a0cfb8-490f-4252-84fa-2b9e250e9333_0|0         |
|4c3f11fa-e3ba-4eb1-977a-06f034bf8a54_0|0         |
|4ee484f4-e877-44c3-9390-c4e4072c5dee_0|0         |
|4fa035b3-dcd5-40e1-9107-0a0c943ff597_1|0         |
|529704d2-5a60-4718-a03f-639e040f6634_0|0         |
|560f6978-028b-4a37-9f97-d97e93976bf7_0|0         |
|57b47c74-b071-4278-89c9-f7b4cb1225d1_0|0         |
|58305773-f944-4039-8452-f5eb8d62f0cf_0|0         |
|58dfa9dd-43cf-4eb7-ade6-7235004a9815_0|0         |
|5b146218-9bb6-46f0-8c83-df131d78f591_0|0         |
|5ca3b5bc-35a9-42a5-bd37-a8fc94366dc6_0|0         |
|5d5f2ea0-aed9-4c2d-8c22-68859ec35e8e_0|0         |
|5f9ebf92-3b1b-4628-b949-44a32e6d3659_0|0         |
|64822b8c-009e-48ab-b6ca-1a7ece1106fa_0|0         |
|6b352714-af74-4773-854b-073e644e8684_0|0         |
|6e528e49-472e-48c7-baa9-edc25303e427_0|0         |
|73203f58-8be2-4716-b8f0-79c64400c57b_0|0         |
|741630e0-1c99-497d-a127-5c4c562952c5_0|0         |
|778e3b8a-2ca5-469a-9697-f646962e8308_0|0         |
|8029c542-d933-43fb-b359-f2438dcd5660_0|0         |
|8b06ba24-2af3-4eec-811a-4d1779f37876_0|0         |
|8fb43dff-260d-4ece-85e2-3bc2cb636ac1_0|0         |
|90f8a4cb-1956-43c4-ac7d-8c6514cd023a_0|0         |
|916f2e2a-6135-4004-8d54-d80b822ce394_0|0         |
|968a7ca3-1649-4586-9e60-b7e8565e708a_0|0         |
|a32782cc-8c4c-403b-aa83-09f1cec45fdb_0|0         |
|a63f44d5-a4d5-45a0-8a4b-cebf05df810b_0|0         |
|a6f958bc-e050-4216-b981-d51f1c0ff60d_0|0         |
|a7dba1bb-d7ff-44e6-9c4c-997ae59a2337_1|0         |
|ac33d675-d9cc-43b5-94fb-7d412773db14_0|0         |
|b1227816-9bf2-474f-8e82-5739acf6c895_0|0         |
|b1c27a2e-6efc-4869-880b-9ce0a4962edc_0|0         |
|b4ff6d43-cf0a-4f1d-9431-1edcb8ee1fb6_0|0         |
|b9e477ab-2065-42bb-832b-5d0e98ee05c7_0|0         |
|ba8c4efe-e71c-468c-b1bf-37efff596907_0|0         |
|c21eefc8-43d0-4be0-a252-b9fc4dbb7ad0_0|0         |
|c3785311-87c8-43bc-99a8-01d64f5eaa87_0|0         |
|c543bde7-deb8-4484-b0be-353c44baf6eb_1|0         |
|ca31e550-9d28-4628-bfe8-53648a2007f7_0|0         |
|cbc33697-20cb-4f8b-accd-0a6396a4ea41_0|0         |
|cc7810aa-08fc-44e7-acdc-ac948a28f9b9_0|0         |
|d1efdc7c-afb0-4995-bbbd-a76f731d2492_0|0         |
|d6a4b928-e576-41d7-9628-18709765199d_0|0         |
|d7311ec7-6c50-448d-8a6e-f690c3070d57_1|0         |
|d86b09f9-70a0-4101-a13b-129fe3a37b86_0|0         |
|d911be5b-aceb-45c8-a79e-73ccfa1b96f0_0|0         |
|db0c7b10-80f7-4071-aa53-fe0e2dc5ebce_0|0         |
|dce14c51-fa57-4e98-987d-708e2a9aa293_0|0         |
|dd026fb8-f818-4d1e-aaa4-4c9b3fd24994_0|0         |
|dfa9c55c-1e75-4010-be86-a6b1eb723672_0|0         |
|ea29f600-9e85-40f4-9f88-dcef46beb0c1_0|0         |
|eb5e58fc-eaac-4059-8ebc-1fab1ccf3555_1|0         |
|eb7568ab-83ac-45a7-bf4b-3b048d6c7c53_0|0         |
|f5b1cfc4-e397-4699-adab-0af6ee0e1b76_0|0         |
|facbfc8c-d477-4b27-bf15-52a56c26cbf6_0|0         |
|ffd03bca-ef40-4fa4-913e-73c002f29796_0|0         |
+--------------------------------------+----------+

1st Run Sample
+--------------------------------------+----------+
|clientid                              |Conversion|
+--------------------------------------+----------+
|203865ed-f02a-4ed9-9098-82691de707a4_1|1         |
|6d6036d3-c161-4f5d-8557-80b85dd87bd9_0|1         |
|6d6036d3-c161-4f5d-8557-80b85dd87bd9_1|1         |
|02438b66-2de4-4765-bae3-de7453647ea7_1|1         |
|7797aba3-3eea-4556-856e-753812b4b551_0|1         |
|870ab2a5-0650-42b8-9e6f-bde3859f64fd_0|1         |
|1dfa820c-77e0-4927-8a39-ecd8e842b09b_0|0         |
|252be902-4204-40a5-9d3c-dd3a7d0f0355_0|0         |
|2995b49d-525b-43e9-ab36-8b8910a4607c_0|0         |
|2bc06b59-4624-4ddd-87a3-ed04cba88233_0|0         |
|31563716-9380-4662-90e5-7f63a1ab9072_0|0         |
|5ca3b5bc-35a9-42a5-bd37-a8fc94366dc6_0|0         |
|5d5f2ea0-aed9-4c2d-8c22-68859ec35e8e_0|0         |
|5f9ebf92-3b1b-4628-b949-44a32e6d3659_0|0         |
|5f9ebf92-3b1b-4628-b949-44a32e6d3659_0|0         |
|5f9ebf92-3b1b-4628-b949-44a32e6d3659_0|0         |
|6b352714-af74-4773-854b-073e644e8684_0|0         |
|6e528e49-472e-48c7-baa9-edc25303e427_0|0         |
|6e528e49-472e-48c7-baa9-edc25303e427_0|0         |
|741630e0-1c99-497d-a127-5c4c562952c5_0|0         |
|03358d8f-9b9c-4258-9c99-234ab102c29b_1|0         |
|040d213c-e91a-42f4-9bf7-90671670dc17_0|0         |
|040d213c-e91a-42f4-9bf7-90671670dc17_0|0         |
|04fe5148-1c56-4c88-aed0-1f01220bffd6_0|0         |
|129eb883-159d-49be-b8ae-9aa44a3e2919_0|0         |
|1855d96d-3647-4c4f-a20f-7e46f7635798_0|0         |
|3c37e066-dff5-4bd9-84ab-b9e73f3f3fdd_0|0         |
|3e998096-3a4b-4b57-a1de-69d2dbd19abd_0|0         |
|3f8ace3c-d378-4423-97a0-3d9cf35ba256_0|0         |
|49a0cfb8-490f-4252-84fa-2b9e250e9333_0|0         |
|4fa035b3-dcd5-40e1-9107-0a0c943ff597_1|0         |
|4fa035b3-dcd5-40e1-9107-0a0c943ff597_1|0         |
|529704d2-5a60-4718-a03f-639e040f6634_0|0         |
|560f6978-028b-4a37-9f97-d97e93976bf7_0|0         |
|560f6978-028b-4a37-9f97-d97e93976bf7_0|0         |
|560f6978-028b-4a37-9f97-d97e93976bf7_0|0         |
|778e3b8a-2ca5-469a-9697-f646962e8308_0|0         |
|8b06ba24-2af3-4eec-811a-4d1779f37876_0|0         |
+--------------------------------------+----------+

2nd Run Sample
+--------------------------------------+----------+
|clientid                              |Conversion|
+--------------------------------------+----------+
|02438b66-2de4-4765-bae3-de7453647ea7_1|1         |
|7797aba3-3eea-4556-856e-753812b4b551_0|1         |
|870ab2a5-0650-42b8-9e6f-bde3859f64fd_0|1         |
|870ab2a5-0650-42b8-9e6f-bde3859f64fd_1|1         |
|be218b72-c664-40cf-adf5-e3519095e941_0|1         |
|be218b72-c664-40cf-adf5-e3519095e941_0|1         |
|1dfa820c-77e0-4927-8a39-ecd8e842b09b_0|0         |
|252be902-4204-40a5-9d3c-dd3a7d0f0355_0|0         |
|2995b49d-525b-43e9-ab36-8b8910a4607c_0|0         |
|2bc06b59-4624-4ddd-87a3-ed04cba88233_0|0         |
|31563716-9380-4662-90e5-7f63a1ab9072_0|0         |
|5ca3b5bc-35a9-42a5-bd37-a8fc94366dc6_0|0         |
|5d5f2ea0-aed9-4c2d-8c22-68859ec35e8e_0|0         |
|5f9ebf92-3b1b-4628-b949-44a32e6d3659_0|0         |
|5f9ebf92-3b1b-4628-b949-44a32e6d3659_0|0         |
|5f9ebf92-3b1b-4628-b949-44a32e6d3659_0|0         |
|6b352714-af74-4773-854b-073e644e8684_0|0         |
|6e528e49-472e-48c7-baa9-edc25303e427_0|0         |
|6e528e49-472e-48c7-baa9-edc25303e427_0|0         |
|741630e0-1c99-497d-a127-5c4c562952c5_0|0         |
|03358d8f-9b9c-4258-9c99-234ab102c29b_1|0         |
|040d213c-e91a-42f4-9bf7-90671670dc17_0|0         |
|040d213c-e91a-42f4-9bf7-90671670dc17_0|0         |
|04fe5148-1c56-4c88-aed0-1f01220bffd6_0|0         |
|129eb883-159d-49be-b8ae-9aa44a3e2919_0|0         |
|1855d96d-3647-4c4f-a20f-7e46f7635798_0|0         |
|3c37e066-dff5-4bd9-84ab-b9e73f3f3fdd_0|0         |
|3e998096-3a4b-4b57-a1de-69d2dbd19abd_0|0         |
|3f8ace3c-d378-4423-97a0-3d9cf35ba256_0|0         |
|49a0cfb8-490f-4252-84fa-2b9e250e9333_0|0         |
|4fa035b3-dcd5-40e1-9107-0a0c943ff597_1|0         |
|4fa035b3-dcd5-40e1-9107-0a0c943ff597_1|0         |
|529704d2-5a60-4718-a03f-639e040f6634_0|0         |
|560f6978-028b-4a37-9f97-d97e93976bf7_0|0         |
|560f6978-028b-4a37-9f97-d97e93976bf7_0|0         |
|560f6978-028b-4a37-9f97-d97e93976bf7_0|0         |
|778e3b8a-2ca5-469a-9697-f646962e8308_0|0         |
|8b06ba24-2af3-4eec-811a-4d1779f37876_0|0         |
+--------------------------------------+----------+

The two samples being different could be a road blocker for me and just want to check how I could make these consistent